A MapReduce-based approach to social network big data mining

Abstract

The rapid development of social networks has facilitated the convenience of users to receive information. As a network communication platform for people’s daily use, microblog has countless information data. In view of the low efficiency and poor clustering effect of K-means algorithm, a parallel K-means clustering algorithm based on MapReduce model is studied; In order to alleviate the difficulty in calculating the similarity of microblog topic text, the space vector model and semantic similarity are used to calculate the similarity between texts to improve the quality of microblog text classification. The data expansion rate of corresponding nodes under different data sets shows that the average expansion rate of the parallel K-means algorithm reaches 0.89, and the running rate is the highest. The results show that the parallel K-means algorithm has good clustering stability and the highest clustering quality, reaching 1.24; The clustering time of the algorithm is the shortest, the average clustering time is 1.27 minutes, and the clustering effect and efficiency of the algorithm are the best. In the quality analysis of Weibo topic recommendation, the accuracy of P-K-means recommendation is 95.64%, user satisfaction is 98.64%, and the recommendation effect is also the best. It shows that the research on the parallel K-means clustering algorithm based on MapReduce has the best performance in microblogging topic mining and recommendation, which can efficiently recommend topics of interest to users and enhance users’ microblogging experience.

Keywords

Social network big data MapReduce parallel K-means clustering algorithm Weibo topic

1. Introduction

With the development of technology, “knowing the world without going out” has become the norm. Social networks shorten the distance between people and attract the attention of many users [1]. As a mainstream social networking site in China, Weibo users use Weibo to share, find like-minded partners through topics, and understand social trends from popular topics. Therefore, intelligent and effective topic classification and recommendation of Weibo is very important [2]. With the growth of the number of microblog users, a large number of information and data have grown rapidly. In order to accurately classify microblog topics, data mining technology that can analyze the hidden information of data has emerged as the times require. K-means clustering can gather similar topics and divide them into multiple clusters, which greatly facilitates users to find topics of interest [3]. However, the traditional K-means clustering algorithm specifies a random clustering center, and the clustering effect is poor. At the same time, the increase of data volume reduces the running speed of the algorithm, and improving the performance of the algorithm has become the current research focus. The superior performance of cloud computing technology for efficiently processing massive data has attracted people’s attention. As one of the core of cloud computing technology, MapReduce programming is simple and abstract, convenient to use, and easy to combine with other algorithms. Applying MapReduce to algorithm parallelization improvement can greatly improve the efficiency of the algorithm [4]. Therefore, we study the parallel K-means clustering algorithm based on MapReduce model to classify the Weibo topic text; At the same time, aiming at the difficulty of text similarity calculation in the process of clustering, we study the use of space vector model and semantic similarity to calculate the similarity between texts. It is expected that the algorithm studied can improve the classification efficiency of microblog text, improve the accuracy of microblog topic recommendation, and achieve high-quality microblog topic recommendation.

2. Related work

With the popularity and development of social networks, the scale of information data has increased. Many researchers have used various methods to analyze the potential information in social networks. Zhang et al. [5] studied a sentence similarity model in order to analyze the similarity of popular topics in social media. In order to improve the accuracy of the model, the sentence similarity calculation method was used to extract the semantic information of sentences. Through the experiment of the semantic corpus, the model was verified to have practical value, and the average deviation of analysis was better than other methods. Lopez-Castroman et al. [6] have found that online bullying and internet addiction in social networks will increase suicide behavior. Through data mining, identify network risk patterns to detect risk groups, and use them in website applications. When users have suicidal thoughts, targeted intervention can be carried out. Huang et al. [7] studied the intelligent method that can automatically identify the impact of information nodes in view of the uncertainty that affects the location of information nodes in big data, and tested the ability to observe network information and spread information. The experiment showed that the proposed intelligent method can effectively identify influential nodes, and can detect the importance of spam sending nodes in the mailbox, and its performance is better than the baseline method. Su et al. [8] carried out research on the similarity discovery among social users. In order to effectively recommend social topics and infer user relationships, a Bayesian network model combining network topology and user relationships was built. Through simulation test, the algorithm improves the accuracy of similarity discovery between social users, and can effectively analyze social data information, which has certain practical value.

Choi and Chung [9] have proposed a big data knowledge process for association mining using Hadoop’s MapReduce software, which provides efficient management of health knowledge services by using public data to collect and process abnormal information, and uses MapReduce software and extracted association rules to create a knowledge base, which enhances technical value and intelligence efficiency, and provides an effective basis for people’s healthy life. In order to improve the performance of the algorithm, Rani and Pushpalatha [10] found that the MapReduce framework can be used to mine sensors to achieve data mining, and studied and proposed a highly efficient algorithm of vertical partition parallel distribution. The concept of overlapping windows is implemented in the MapReduce framework to reduce the overhead between communications. The results show that the algorithm execution time and output results have made good progress. Laccetti et al. [11] studied the K-means algorithm and found that the algorithm can effectively find the similarity between large-scale data. In view of the problems of the algorithm, such as the K value cannot be defined accurately and the execution efficiency is low, an adaptive parameter is introduced to enable the K value to be defined dynamically, and a parallelization strategy on multi-core CPU is proposed. Through experimental analysis, the results show that the running cost of the adaptive mean algorithm in multi-core environment is significantly reduced, and the accuracy and running speed of the algorithm are significantly better than the comparison algorithm.

Through the summary of the research results of domestic and foreign scholars, it is found that the clustering algorithm performs better in analyzing social network information, and the parallelization algorithm based on MapReduce model is more prominent. Therefore, the research uses data mining algorithm to analyze and discuss the topics of microblogs in social networks. In view of the shortcomings of this algorithm, a parallel K-means clustering algorithm is proposed based on MapReduce model. It is expected that this algorithm can effectively analyze the interests of users and accurately recommend relevant topics.

3. Parallel K-means clustering algorithm based on mapreduce for mining microblog topics

3.1 Parallel MapReduce-based K-means clustering algorithm

With the in-depth application of the Internet and the widespread use of personal terminal devices, network data is on the rise. In the face of large-scale data, traditional computing models have been unable to meet the current demand for data processing speed. MapReduce computing model uses “Map” and “Reduce” functions to segment data, efficiently process work details, reduce fault tolerance, and focus users’ attention on computing, Work efficiency has been improved [12]. MapReduce model is often used in processing large-scale data. It divides a task into multiple subtasks. The subtasks are distributed on each node of the cluster and executed simultaneously. The output results of each node are consolidated by rules. The operation process of MapReduce processing data is shown in Fig. 1. After formatting, the original data is transferred to the Map module in the form of key-value pairs; After calculation, the closely related key-value pairs are combined and used as the input of the Reduce module. After the analysis and processing of the Reduce module, a set of output key-value pairs is finally generated.

Figure 1.

Running process of MapReduce model.

Social networks generate a large amount of data, and there are many hidden information among the data. Mining out the hidden information is of great help to provide users with effective search data tags and find like-minded partners. “Birds of a feather flock together.” In daily life, there are many classification problems. The process of combining and classifying several objects according to certain similarity is called clustering. The collection of each data object is called cluster. Each collection follows the rule of “objects within the cluster are similar, but objects between clusters are different” [13]. K-means clustering algorithm is an algorithm that specifies the clustering category according to the number of partitions. By constantly updating the iterative clustering center, the clustering objective function can obtain the minimum value. K-means is an unsupervised learning algorithm. K-means is an unsupervised learning algorithm. Suppose the set of clustering samples is denoted as $X=\{{x_{i}|{x_{i}\in R^{p};i=1,2,\ldots,n}}\}$ , $p$ is the data samples in the space, and $K$ is randomly selected as the initial clustering centre $Z=({z_{1},z_{2},\ldots,z_{K}})$ , so that $w_{j}({j=1,2,\ldots,K})$ denotes $K$ clustering categories, where the arithmetic mean of the data belonging to the same cluster is denoted as Eq. (1).

$\displaystyle z_{j}=1\left/{N_{j}*\sum\limits_{x\in w_{j}}x}\right.$ (1)

In Eq. (1), $N_{j}$ represents the total number of data in the clustering category $w_{j}$ . The expression of the clustering objective function is shown in Eq. (2).

$\displaystyle J=\sum\limits_{i=1}^{k}{\sum\limits_{j=1}^{n_{i}}{d({x_{j},z_{i}% })}}$ (2)

In Eq. (2), $J$ represents the sum of squares of the distances from each sample data object to the centre of the corresponding cluster (i.e. the minimum mean squared error), $n_{i}$ represents the number of samples included in the clustering category, and $d({x_{j},z_{i}})$ represents the distance between the data samples and the centre of the corresponding cluster, which is calculated by Eq. (3).

$\displaystyle d({x_{j},z_{i}})=\sqrt{\sum\limits_{k=1}^{n}{({x_{jk}-z_{ik}})^{% 2}}}$ (3)

In Eq. (3), $d({x_{j},z_{i}})$ represents the Euclidean distance between two data. The K-means clustering algorithm first randomly specifies the initial cluster centre $K\in Z,Z=({z_{1}(1),z_{2}(1),\ldots,z_{K}(1)})$ ; the sample data are grouped according to the distance calculation formula; the arithmetic mean within the cluster is calculated according to Eq. (1), the cluster centre of the cluster is updated, and the sample data are re-clustered according to the closest distance between the sample data and the new cluster centre; iterations are repeated to adjust the distribution of sample data Repeat iterations to adjust the distribution of the sample data within the clusters until the clustering centre no longer changes or the objective function reaches a minimum value, as shown in Fig. 2.

Figure 2.

Execution flow of K-means clustering algorithm.

As can be seen from Fig. 2, the K-means algorithm requires two calculations for each iteration, namely the arithmetic mean of the clusters and the distance between each sample data and the cluster centre. For the Euclidean distance calculation formula, the values and cluster centres are closely related to the sample data itself, therefore, the data can be divided into multiple subsets, and each subset can be aggregated and divided for the samples in the clusters individually and in parallel. When calculating the arithmetic mean of the clusters, the sum of the cluster means of several related subsets can also be calculated after each subset has been divided, and finally the new cluster centroids are aggregated and calculated until the minimum objective function is obtained. The formula for the parallelised K-means algorithm is shown in Eq. (4).

$\displaystyle O=\left({\frac{\textit{nKt}}{m}+nK\log(m)}\right)$ (4)

In Eq. (4), $n$ represents the total amount of sample data, $t$ represents the number of iterations, and $m$ represents the number of subsets computed in parallel. Since the initial clustering centre selection of the K-means algorithm is random, the results obtained each time have differences, making the quality of clustering less stable. The MapReduce model has more advantages over other parallel computing models, for example, MapReduce can handle different problems, MapReduce has a simple yet powerful interface that can make large data computation becomes simple and easy, and it can have good load balancing while ensuring the quality of computation and guaranteeing the smooth operation of the system, etc. [14]. Therefore, a parallel K-means clustering algorithm is proposed to be implemented based on the MapReduce model, in which the sample data for clustering is considered as a vector structure of character type, and a format conversion of the source data is required. According to the Euclidean distance formula in the clustering algorithm, the cluster abstraction is defined as $\textit{Cluster}({\textit{id},\textit{num},\textit{center},\textit{radius}})$ , where num, center and radius represent the number of sample data contained in the cluster, the centre of the cluster and the cluster radius respectively, which reflect the size of the cluster, the mean value of the cluster and the dispersion of the samples within the cluster when clustering. The sum of the weights of the sample data in a cluster is defined by $s_{0}$ and is expressed in Eq. (5).

$\displaystyle s_{0}=\sum\limits_{i=0}^{n}{u_{i}}$ (5)

In Eq. (5), $u_{i}$ indicates the number of samples contained in the cluster. The weighting formula for each sample within a cluster $s_{1}$ is shown in Eq. (6).

$\displaystyle s_{1}=\sum\limits_{i=0}^{n}{x_{i}u_{i}}$ (6)

The average weighted sum of the samples within a cluster is calculated by $s_{2}$ as Eq. (7).

$\displaystyle s_{2}=\sum\limits_{i=0}^{n}{x_{i}^{2}u_{i}}$ (7)

The corresponding equations for num, center and radius can be calculated from $s_{0}$ , $s_{1}$ and $s_{2}$ as in Eq. (8).

$\displaystyle\left\{{\begin{array}[]{l}\textit{num}=s_{0}\\ \textit{center}=\frac{s_{1}}{s_{0}}\\ \textit{radius}=\frac{\sqrt{s_{0}s_{2}-s_{1}^{2}}}{s_{0}}\\ \end{array}}\right.$ (8)

The implementation process of the parallel K-means clustering algorithm based on the MapReduce model requires the K-means algorithm to be decomposed into three functions, Map, Combine and Reduce, in which the distance from the sample data to the cluster centre is calculated and classified according to the distance; the Combine function is responsible for merging the key-value pairs generated by Map as the input to the Reduce The Combine function is responsible for merging the key-value pairs generated by Map as input to the Reduce function, reducing the computational process of the Reduce function; the Reduce function calculates the weighted average value within the cluster, which is used to update the cluster centre until the target function converges, and the running process is shown in Fig. 3.

Figure 3.

Running chart of parallel K-mean algorithm based on MapReduce.

3.2 Microblog topic mining based on MapReduce parallel K-means clustering algorithm

With the rapid development of information technology, the proportion of the Internet in daily life is increasing, and at the same time social networks are loved by the majority of young people, and the scale has also progressed to expand, currently microblog is the main social network in China, active users are growing year by year, and people’s lives have become inseparable from social networks [15]. With the continuous growth of social network information data, problems such as irregular wording and ambiguous topics have been exposed, creating certain obstacles for users to search for topics of interest or popular topics. To address how to mine the content they need from the massive information data, the study uses parallel K-means clustering algorithm on MapReduce model to mine popular topics on Weibo and make recommendations. There are restrictions on the length of information text in microblog topics, which generally requires no more than 140 characters and a more colloquial expression of information, which causes difficulties in extracting text keywords, and at the same time, the rapid update rate of microblog popular topics requires the efficiency of information mining to meet the demand [16]. The process of microblog topic mining and recommendation is shown in Fig. 3. The latest microblog text is obtained using the microblog open system, the text is subjected to word separation operations, keywords are selected and its weights are calculated, the text is divided and categorised using the parallel K-means clustering algorithm, and finally the range is narrowed down to recommend microblog topics.

Figure 4.

Weibo topic mining and recommendation process.

Obtaining representative data sample information can improve the quality of data mining, generally through the microblogging open platform or using web crawlers to obtain, the original data obtained contains information that is not needed when dividing clusters, so the data samples need to be pre-processed before clustering the microblog text [17]. Firstly, useless characters are removed and valid text is extracted using regular expressions to reduce clustering consumption time and improve efficiency; the processed text is subjected to word separation operation, which is more prone to synonymy and ambiguity in Chinese, and usually uses corpus matching for Chinese word separation operation. In addition, effective feature selection can reduce the computation time of clustering and improve the effect of clustering, by calculating weights on the features, assessing the importance of words for the text, but there is a certain difficulty in calculating the text, which can be transformed into a corresponding mathematical model, so that the calculation of text becomes an algebraic operation in vector space, which greatly improves the efficiency of selecting keywords [18]. Both TF-IDF (term frequency-inverse document frequency) and cosine similarity can calculate the similarity between texts in a spatial vector, measure the importance of a phrase to a text, and can quickly identify keywords. The idea of calculating the weight of a phrase in TF-IDF is divided into two parts, first calculating the frequency of the occurrence of the phrase in the text TF, the calculation formula is shown in Eq. (9).

$\displaystyle tf({t,d})=\frac{n_{t,d}}{\sum\nolimits_{l}{n_{l,d}}}$ (9)

In Eq. (9), $n_{t,d}$ represents the number of occurrences of the phrase $t$ in the text $d$ and the denominator is the total number of occurrences of all words in the text. The IDF formula for the frequency of occurrence of the phrase in all texts is then Eq. (10).

$\displaystyle\textit{idf}({t,D})=\log\frac{N}{|{\{{d\in D:t\in d}\}}|}$ (10)

The $N D$ denominator is the number of texts containing the phrase in the text base. To avoid the situation where the denominator is zero because the text base does not contain the phrase, the denominator is usually $|{\{{d\in D:t\in d}\}}|+1$ . The final formula for calculating the weight of a phrase, TF-IDF, is Eq. (11).

$\displaystyle\textit{tfidf}({t,d,D})=tf({t,d})\times\textit{idf}({t,D})$ (11)

Cosine similarity uses the cosine of the angle between vectors to measure the similarity between texts. Let $A$ , $B$ be two text vectors, the expression for their cosine similarity is shown in Eq. (12).

$\displaystyle\textit{similarity}({A,B})=\frac{A\cdot B}{\|A\|\|B\|}=\frac{\sum% \limits_{i=1}^{n}{A_{i}\times B_{i}}}{\sqrt{\sum\limits_{i=1}^{n}{({A_{i}})^{2% }}}\times\sqrt{\sum\limits_{i=1}^{n}{({B_{i}})^{2}}}}$ (12)

Text can be operated algebraically using the space vector model, but there are many near-synonyms and synonyms in the evolution of language, and the results obtained from direct text vector similarity calculation are often less than ideal, so similarity calculation of semantics is needed to improve the quality of data mining. $\textit{similarity}({W_{1},W_{2}})=\mathop{\max}\limits_{i=1\ldots m,j=1\ldots n% }\textit{similarity}({S_{1i},S_{2j}})$ The semantics of a word is related to the concept of the word. Let the set of concepts of the word be $W_{1}\{{S_{11},S_{12},\ldots,S_{1m}}\}$ and the set of concepts of the word $W_{2}$ be. The semantic similarity of the word $\{{S_{21},S_{22},\ldots,S_{2n}}\}$ $W_{1}$ and the word $W_{2}$ can be distinguished by calculating the similarity of the concept of the word. The conceptual similarity between words is the weighted sum of the similarity of the four features, which is expressed in Eq. (13).

$\displaystyle\textit{similarity}({S_{1},S_{2}})=\sum\limits_{i=1}^{4}{({\beta_% {i}\times\textit{similarity}_{i}({S_{1},S_{2}})})}$ (13)

In Eq. (13), $\beta$ is the adjustable parameter that satisfies $\sum\limits_{i=1}^{4}{\beta_{i}}=1$ , $\beta_{1}\geqslant\beta_{2}\geqslant\beta_{3}\geqslant\beta_{4}$ , indicating that the first feature element has the greatest weight, and the other features play a diminishing role in turn. It is assumed that $D_{p}=\{{t_{p1},t_{p2},\ldots t_{pm}}\}$ and $D_{q}=\{{t_{q1},t_{q2},\ldots,t_{qn}}\}$ are the microblogging text vectors, and the semantic similarity is combined with the spatial vector to calculate the similarity, and the expression is Eq. (14).

$\displaystyle\textit{similarity}({D_{p},D_{q}})=\frac{\sum\limits_{i=1}^{n}{({% w^{\prime}_{pi}w^{\prime}_{qi}\times f({t^{\prime}_{pi},t^{\prime}_{qi}})})}}{% \sqrt{\sum\limits_{i=1}^{n}{w_{pi}^{2}}}\times\sqrt{\sum\limits_{i=1}^{n}{w_{% qi}^{2}}}}$ (14)

In Eq. (14), $w^{\prime}$ is the TF-IDF weight of the corresponding word in the vector, and $f({t^{\prime}_{pi},t^{\prime}_{qi}})$ is the semantic similarity function. The similarity of a text vector can be regarded as the product of the spatial vector similarity of the word group and the semantic deviation, the weight of the word group and the semantic deviation both affect the similarity of the text vector; if there is a high text similarity but a large difference between the semantics, a threshold is set in the calculation of similar word groups $\eta$ , i.e. the word group similarity is greater than the threshold, the word group has reference value. The text clustering process uses Eq. (15) to calculate the vector distance, expressed as follows.

$\displaystyle d({D_{p},D_{q}})=\frac{\alpha({1-\textit{similarity}({D_{p},D_{q% }})})}{\textit{similarity}({D_{p},D_{q}})}$ (15)

In Eq. (15), $\alpha$ is the adjustment parameter, representing the distance value at a similarity of 0.5. The set of vectors of the microblogging text is set to $T$ and the vector centre of the cluster $c_{i}$ is calculated as shown in Eq. (16).

$\displaystyle c_{i}=\frac{\sum\nolimits_{x\in T_{i}}x}{|{T_{i}}|}$ (16)

In Eq. (16), $T_{i}$ is the cluster, $x$ is the vector in the cluster, and $|{T_{i}}|$ is the total number of vectors in the cluster. When a new tweet appears, the text of the tweet is pre-processed, converted into a text vector and divided into clusters, the distance between the text vector and the other vectors in the cluster is calculated, the other vectors with the closest distance are filtered, the obtained text vector is converted back to the original text of the tweet, and output as a recommended topic, the recommended topic process is shown in Fig. 5.

4. Analysis of parallel K-means clustering algorithm based on MapReduce for mining microblog topics

4.1 Performance analysis of parallel K-means algorithm based on MapReduce

In order to test the clustering effect of the parallel K-means clustering algorithm studied, it is compared with the classical clustering algorithm PAM (Partitioning Around Medoids) algorithm, CLARA (Clustering LARge Applications) algorithm and the original K-means algorithm. For the adjustment parameters in the semantic similarity calculation formula, the parameter values are determined according to the empirical formula, and the Hadoop cluster is built with the simulation laboratory for the simulation experiment. The experimental software and hardware configuration and parameter setting results are shown in Table 1.

Table 1
Experimental software and hardware configuration and parameter setting

Hardware configuration		Software configuration
CPU	8 nuclear Intel(R)Xeon(R)E5620 2.40GHZ	Operating system	Red Hat 3.4–5.2
Hard disk	500GB SATA	Hadoop version	Hadoop2.0.2
Memory	16GB-DDR3-ECC	JDK version	1.6.0 31
CPU	8 nuclear Intel(R)Xeon(R)E5620 2.40GHZ	Operating system	Red Hat 3.4–5.2
Parameter	Parameter interpretation	Parameter value	Quote
$\alpha$	The reference distance when the similarity is 0.5	1.0	Formula (15)
$\beta_{1}$	Calculation weight of basic features	0.4	Formula (13)
$\beta_{2}$	Calculation weight of other features	0.3
$\beta_{3}$	Calculation weight of relationship characteristics	0.2
$\beta_{4}$	The calculated weight of the relationship symbol	0.1
$\eta$	Similar word threshold	0.7	/

Figure 5.

Table 2
Experimental software and hardware configuration and parameter

Algorithm	Recommended accuracy	RMSE	User satisfaction
PAM	85.47%	0.31	88.73%
CLARA	91.06%	0.19	94.67%
O-K-means	90.28%	0.20	93.81%
P-K-means	95.64%	0.13	98.64%

Figure 8.

Clustering time comparison chart.

Figure 9.

ROC curve.

As can be seen from Table 2, in terms of the accuracy of microblog topic recommendation, the research P-K-means recommendation accuracy is the highest, 95.64%; The recommended accuracy of PAM is the worst, 85.47%. The larger the RMSE in Table 2, the worse the recommendation effect of the algorithm. Therefore, the P-K-means algorithm with the smallest RMSE has the best recommendation effect. In terms of user satisfaction, P-K-means also performed the best at 98.64%. Relevant data show the superiority of the P-K-means algorithm. By calculating the recall rate and accuracy of the recommended microblog topics, the ROC curves of the four algorithms are shown in Fig. 9.

From the ROC curve in Fig. 9, it can be seen that the parallel K-means algorithm proposed by the study recommended the best microblogging topics with the highest recall and accuracy; calculating the area contained in the curve, the AUC values obtained for the parallel K-means algorithm, the original K-means algorithm, the CLARA algorithm and the PAM algorithm were 0.8853, 0.8316, 0.8237 and The calculated results also illustrate that the research recommends the best accuracy and performance for microblog topics, which is more related to the clustering effect of the algorithms.

5. Conclusion

In the context of the continuous development of the information age, mining the hidden information in social networks and providing convenience for users has a positive role. Research and implement the parallel K-means clustering algorithm on the MapReduce model, combine the similarity calculation between microblog text for clustering analysis and topic recommendation, select the microblog text in the open system to build a dataset, analyze the clustering quality of several algorithms and the effect of microblog recommendation. The results of the corresponding node expansion rate under different data sets show that the average expansion rate of P-K-means algorithm reaches 0.89. Compared with PAM, CLARA and the original K-means algorithm, the average expansion rate increases by 0.24, 0.11 and 0.12 respectively. The research on the parallel K-means algorithm has the highest efficiency. Through clustering quality analysis, the average clustering quality of parallel K-means algorithm reaches 1.24, which is 0.48, 0.42 and 0.28 higher than other algorithms respectively. The parallel K-means algorithm has a clustering time of 1.27 minutes. Compared with other comparison algorithms, the average clustering efficiency has been improved by 88.2%, 80.2% and 51.3% respectively, and the clustering stability is also the best. Through the evaluation of Weibo topic recommendation, the results show that the P-K-means algorithm has the best recommendation accuracy, 95.64%; The user satisfaction is the highest, and the recommendation effect is the best. The ROC curve results show that the AUC value of the parallel K-means algorithm is 0.8853, which is significantly higher than other comparison algorithms. It shows that parallel technology can improve the mining quality of clustering algorithm, and the parallel K-means algorithm based on MapReduce can effectively solve the problem of topic classification and recommendation of microblog, save time for users to find topics, and help users understand social trends through topics. Although the research has made some achievements, there are still many shortcomings. The research content is mainly about topic mining and recommendation of microblog text. At present, most users often publish information in the form of pictures and text. The quality of topic mining using this method has declined. In the future, more comprehensive topic mining will be carried out by combining pictures, videos and other related technologies to make the recommended topic more accurate.

References

Shen

Liu

Han

. Distrust prediction in signed social network. Chinese J Electron. 2019; 28(1): 188-194.

Lamprier

Gisselbrecht

Gallinari

. Contextual bandits with hidden contexts: A focused data capture from social media streams. Data Min Knowl Disc. 2019; 33(6): 1853-1893.

Zhai

. Information mining and visualization of highly cited papers on type-2 diabetes mellitus from ESI. Curr Sci. 2019; 116(12): 1965-1974.

Guo

Zhang

. Measure user intimacy by mining maximum information transmission paths. Complexity. 2020; 2020: 2376451.

Zhang

Huang

Zhang

. Information mining and similarity computation for semi-/un-structured sentences from the social data. Digit Commun Netw. 2020; 7(4): 518-525.

Lopez-Castroman

Moulahi

Azé

Bringay

Deninotti

Guillaume

, et al. Mining social networks to improve suicide prevention: A scoping review. J Neurosci Res. 2020; 98(4): 616-625.

Huang

Yang

Zheng

Mumtaz

. Recognizing influential nodes in social networks with controllability and observability. IEEE Internet Things. 2021; 8(8): 6197-6204.

Chen

. Probabilistic graph model mining user affinity in social networks. Int J Web Serv Res. 2021; 18(3): 22-41.

Choi

Chung

. Knowledge process of health big data using MapReduce-based associative mining. Pers Ubiquit Comput. 2020; 24(5): 571-581.

10.

Rani

Pushpalatha

. Generation of frequent sensor epochs using efficient parallel distributed mining algorithm in large IOT. Comput Commun. 2019; 148(9): 107-114.

11.

Laccetti

Lapegna

Mele

Romano

Szustak

. Performance enhancement of a dynamic K-means algorithm through a parallel adaptive strategy on multicore CPUs. J Parallel Distr Com. 2020; 145: 34-41.

12.

Zhao

Chao

Zhang

Chiclana

Viedma

. An incremental method to detect communities in dynamic evolving social networks. Knowl-Based Syst. 2019; 163(1): 404-415.

13.

Guo

Zhang

. Mining structural influence to analyze relationships in social network. Physica A. 2019; 523: 301-309.

14.

Dokuz

Celik

. Cloud computing-based socially important locations discovery on social media big datasets. Int J Inf Tech Decis. 2020; 19(2): 469-497.

15.

Zhao

Lee

Sun

. The use of social media to promote sustainable fashion and benefit communications: A data-mining approach. Sustainability. 2022; 14(3): Article Number: 1178.

16.

Vikatos

Gryllos

Makris

. Marketing campaign targeting using bridge extraction in multiplex social network. Artif Intell Rev. 2020; 53(1): 703-724.

17.

Yang

Zhang

Bai

. MTGK: Multi-source cross-network node classification via transferable graph knowledge. Inform Sciences. 2022; 589(1): 395-415.

18.

Suh

. SocialTERM-extractor: Identifying and predicting social-problem-specific key noun terms from a large number of online news articles using text mining and machine learning techniques. Sustainability. 2019; 11(1): Article Number: 196.

A MapReduce-based approach to social network big data mining

Abstract

Keywords

1. Introduction

2. Related work

3. Parallel K-means clustering algorithm based on mapreduce for mining microblog topics

3.1 Parallel MapReduce-based K-means clustering algorithm

4.1 Performance analysis of parallel K-means algorithm based on MapReduce

Table 1 Experimental software and hardware configuration and parameter setting

Table 2 Experimental software and hardware configuration and parameter

References

Table 1
Experimental software and hardware configuration and parameter setting

Table 2
Experimental software and hardware configuration and parameter