Abstract
The rapid development of social networks has facilitated the convenience of users to receive information. As a network communication platform for people’s daily use, microblog has countless information data. In view of the low efficiency and poor clustering effect of K-means algorithm, a parallel K-means clustering algorithm based on MapReduce model is studied; In order to alleviate the difficulty in calculating the similarity of microblog topic text, the space vector model and semantic similarity are used to calculate the similarity between texts to improve the quality of microblog text classification. The data expansion rate of corresponding nodes under different data sets shows that the average expansion rate of the parallel K-means algorithm reaches 0.89, and the running rate is the highest. The results show that the parallel K-means algorithm has good clustering stability and the highest clustering quality, reaching 1.24; The clustering time of the algorithm is the shortest, the average clustering time is 1.27 minutes, and the clustering effect and efficiency of the algorithm are the best. In the quality analysis of Weibo topic recommendation, the accuracy of P-K-means recommendation is 95.64%, user satisfaction is 98.64%, and the recommendation effect is also the best. It shows that the research on the parallel K-means clustering algorithm based on MapReduce has the best performance in microblogging topic mining and recommendation, which can efficiently recommend topics of interest to users and enhance users’ microblogging experience.
Introduction
With the development of technology, “knowing the world without going out” has become the norm. Social networks shorten the distance between people and attract the attention of many users [1]. As a mainstream social networking site in China, Weibo users use Weibo to share, find like-minded partners through topics, and understand social trends from popular topics. Therefore, intelligent and effective topic classification and recommendation of Weibo is very important [2]. With the growth of the number of microblog users, a large number of information and data have grown rapidly. In order to accurately classify microblog topics, data mining technology that can analyze the hidden information of data has emerged as the times require. K-means clustering can gather similar topics and divide them into multiple clusters, which greatly facilitates users to find topics of interest [3]. However, the traditional K-means clustering algorithm specifies a random clustering center, and the clustering effect is poor. At the same time, the increase of data volume reduces the running speed of the algorithm, and improving the performance of the algorithm has become the current research focus. The superior performance of cloud computing technology for efficiently processing massive data has attracted people’s attention. As one of the core of cloud computing technology, MapReduce programming is simple and abstract, convenient to use, and easy to combine with other algorithms. Applying MapReduce to algorithm parallelization improvement can greatly improve the efficiency of the algorithm [4]. Therefore, we study the parallel K-means clustering algorithm based on MapReduce model to classify the Weibo topic text; At the same time, aiming at the difficulty of text similarity calculation in the process of clustering, we study the use of space vector model and semantic similarity to calculate the similarity between texts. It is expected that the algorithm studied can improve the classification efficiency of microblog text, improve the accuracy of microblog topic recommendation, and achieve high-quality microblog topic recommendation.
Related work
With the popularity and development of social networks, the scale of information data has increased. Many researchers have used various methods to analyze the potential information in social networks. Zhang et al. [5] studied a sentence similarity model in order to analyze the similarity of popular topics in social media. In order to improve the accuracy of the model, the sentence similarity calculation method was used to extract the semantic information of sentences. Through the experiment of the semantic corpus, the model was verified to have practical value, and the average deviation of analysis was better than other methods. Lopez-Castroman et al. [6] have found that online bullying and internet addiction in social networks will increase suicide behavior. Through data mining, identify network risk patterns to detect risk groups, and use them in website applications. When users have suicidal thoughts, targeted intervention can be carried out. Huang et al. [7] studied the intelligent method that can automatically identify the impact of information nodes in view of the uncertainty that affects the location of information nodes in big data, and tested the ability to observe network information and spread information. The experiment showed that the proposed intelligent method can effectively identify influential nodes, and can detect the importance of spam sending nodes in the mailbox, and its performance is better than the baseline method. Su et al. [8] carried out research on the similarity discovery among social users. In order to effectively recommend social topics and infer user relationships, a Bayesian network model combining network topology and user relationships was built. Through simulation test, the algorithm improves the accuracy of similarity discovery between social users, and can effectively analyze social data information, which has certain practical value.
Choi and Chung [9] have proposed a big data knowledge process for association mining using Hadoop’s MapReduce software, which provides efficient management of health knowledge services by using public data to collect and process abnormal information, and uses MapReduce software and extracted association rules to create a knowledge base, which enhances technical value and intelligence efficiency, and provides an effective basis for people’s healthy life. In order to improve the performance of the algorithm, Rani and Pushpalatha [10] found that the MapReduce framework can be used to mine sensors to achieve data mining, and studied and proposed a highly efficient algorithm of vertical partition parallel distribution. The concept of overlapping windows is implemented in the MapReduce framework to reduce the overhead between communications. The results show that the algorithm execution time and output results have made good progress. Laccetti et al. [11] studied the K-means algorithm and found that the algorithm can effectively find the similarity between large-scale data. In view of the problems of the algorithm, such as the K value cannot be defined accurately and the execution efficiency is low, an adaptive parameter is introduced to enable the K value to be defined dynamically, and a parallelization strategy on multi-core CPU is proposed. Through experimental analysis, the results show that the running cost of the adaptive mean algorithm in multi-core environment is significantly reduced, and the accuracy and running speed of the algorithm are significantly better than the comparison algorithm.
Through the summary of the research results of domestic and foreign scholars, it is found that the clustering algorithm performs better in analyzing social network information, and the parallelization algorithm based on MapReduce model is more prominent. Therefore, the research uses data mining algorithm to analyze and discuss the topics of microblogs in social networks. In view of the shortcomings of this algorithm, a parallel K-means clustering algorithm is proposed based on MapReduce model. It is expected that this algorithm can effectively analyze the interests of users and accurately recommend relevant topics.
Parallel K-means clustering algorithm based on mapreduce for mining microblog topics
Parallel MapReduce-based K-means clustering algorithm
With the in-depth application of the Internet and the widespread use of personal terminal devices, network data is on the rise. In the face of large-scale data, traditional computing models have been unable to meet the current demand for data processing speed. MapReduce computing model uses “Map” and “Reduce” functions to segment data, efficiently process work details, reduce fault tolerance, and focus users’ attention on computing, Work efficiency has been improved [12]. MapReduce model is often used in processing large-scale data. It divides a task into multiple subtasks. The subtasks are distributed on each node of the cluster and executed simultaneously. The output results of each node are consolidated by rules. The operation process of MapReduce processing data is shown in Fig. 1. After formatting, the original data is transferred to the Map module in the form of key-value pairs; After calculation, the closely related key-value pairs are combined and used as the input of the Reduce module. After the analysis and processing of the Reduce module, a set of output key-value pairs is finally generated.
Running process of MapReduce model.
Social networks generate a large amount of data, and there are many hidden information among the data. Mining out the hidden information is of great help to provide users with effective search data tags and find like-minded partners. “Birds of a feather flock together.” In daily life, there are many classification problems. The process of combining and classifying several objects according to certain similarity is called clustering. The collection of each data object is called cluster. Each collection follows the rule of “objects within the cluster are similar, but objects between clusters are different” [13]. K-means clustering algorithm is an algorithm that specifies the clustering category according to the number of partitions. By constantly updating the iterative clustering center, the clustering objective function can obtain the minimum value. K-means is an unsupervised learning algorithm. K-means is an unsupervised learning algorithm. Suppose the set of clustering samples is denoted as
In Eq. (1),
In Eq. (2),
In Eq. (3),
Execution flow of K-means clustering algorithm.
As can be seen from Fig. 2, the K-means algorithm requires two calculations for each iteration, namely the arithmetic mean of the clusters and the distance between each sample data and the cluster centre. For the Euclidean distance calculation formula, the values and cluster centres are closely related to the sample data itself, therefore, the data can be divided into multiple subsets, and each subset can be aggregated and divided for the samples in the clusters individually and in parallel. When calculating the arithmetic mean of the clusters, the sum of the cluster means of several related subsets can also be calculated after each subset has been divided, and finally the new cluster centroids are aggregated and calculated until the minimum objective function is obtained. The formula for the parallelised K-means algorithm is shown in Eq. (4).
In Eq. (4),
In Eq. (5),
The average weighted sum of the samples within a cluster is calculated by
The corresponding equations for num, center and radius can be calculated from
The implementation process of the parallel K-means clustering algorithm based on the MapReduce model requires the K-means algorithm to be decomposed into three functions, Map, Combine and Reduce, in which the distance from the sample data to the cluster centre is calculated and classified according to the distance; the Combine function is responsible for merging the key-value pairs generated by Map as the input to the Reduce The Combine function is responsible for merging the key-value pairs generated by Map as input to the Reduce function, reducing the computational process of the Reduce function; the Reduce function calculates the weighted average value within the cluster, which is used to update the cluster centre until the target function converges, and the running process is shown in Fig. 3.
Running chart of parallel K-mean algorithm based on MapReduce.
With the rapid development of information technology, the proportion of the Internet in daily life is increasing, and at the same time social networks are loved by the majority of young people, and the scale has also progressed to expand, currently microblog is the main social network in China, active users are growing year by year, and people’s lives have become inseparable from social networks [15]. With the continuous growth of social network information data, problems such as irregular wording and ambiguous topics have been exposed, creating certain obstacles for users to search for topics of interest or popular topics. To address how to mine the content they need from the massive information data, the study uses parallel K-means clustering algorithm on MapReduce model to mine popular topics on Weibo and make recommendations. There are restrictions on the length of information text in microblog topics, which generally requires no more than 140 characters and a more colloquial expression of information, which causes difficulties in extracting text keywords, and at the same time, the rapid update rate of microblog popular topics requires the efficiency of information mining to meet the demand [16]. The process of microblog topic mining and recommendation is shown in Fig. 3. The latest microblog text is obtained using the microblog open system, the text is subjected to word separation operations, keywords are selected and its weights are calculated, the text is divided and categorised using the parallel K-means clustering algorithm, and finally the range is narrowed down to recommend microblog topics.
Weibo topic mining and recommendation process.
Obtaining representative data sample information can improve the quality of data mining, generally through the microblogging open platform or using web crawlers to obtain, the original data obtained contains information that is not needed when dividing clusters, so the data samples need to be pre-processed before clustering the microblog text [17]. Firstly, useless characters are removed and valid text is extracted using regular expressions to reduce clustering consumption time and improve efficiency; the processed text is subjected to word separation operation, which is more prone to synonymy and ambiguity in Chinese, and usually uses corpus matching for Chinese word separation operation. In addition, effective feature selection can reduce the computation time of clustering and improve the effect of clustering, by calculating weights on the features, assessing the importance of words for the text, but there is a certain difficulty in calculating the text, which can be transformed into a corresponding mathematical model, so that the calculation of text becomes an algebraic operation in vector space, which greatly improves the efficiency of selecting keywords [18]. Both TF-IDF (term frequency-inverse document frequency) and cosine similarity can calculate the similarity between texts in a spatial vector, measure the importance of a phrase to a text, and can quickly identify keywords. The idea of calculating the weight of a phrase in TF-IDF is divided into two parts, first calculating the frequency of the occurrence of the phrase in the text TF, the calculation formula is shown in Eq. (9).
In Eq. (9),
The
Cosine similarity uses the cosine of the angle between vectors to measure the similarity between texts. Let
Text can be operated algebraically using the space vector model, but there are many near-synonyms and synonyms in the evolution of language, and the results obtained from direct text vector similarity calculation are often less than ideal, so similarity calculation of semantics is needed to improve the quality of data mining.
In Eq. (13),
In Eq. (14),
In Eq. (15),
In Eq. (16),
Performance analysis of parallel K-means algorithm based on MapReduce
In order to test the clustering effect of the parallel K-means clustering algorithm studied, it is compared with the classical clustering algorithm PAM (Partitioning Around Medoids) algorithm, CLARA (Clustering LARge Applications) algorithm and the original K-means algorithm. For the adjustment parameters in the semantic similarity calculation formula, the parameter values are determined according to the empirical formula, and the Hadoop cluster is built with the simulation laboratory for the simulation experiment. The experimental software and hardware configuration and parameter setting results are shown in Table 1.
Experimental software and hardware configuration and parameter setting
Experimental software and hardware configuration and parameter setting
Recommended Weibo topic process.
Test the operation efficiency of corresponding nodes under different data sets in MapReduce model, and get the data expansion rate of several algorithms as shown in Fig. 6.
Comparison chart of data expansion rate.
It can be seen from Fig. 6 that with the increase of the data set, the algorithm expansion rate of the corresponding node shows a downward trend as a whole. The four algorithms have the lowest expansion rate at node 10, and the minimum values of PAM, CLARA, O-K-means and P-K-means are 0.43, 0.61, 0.63 and 0.81 respectively; The average expansion rate of the algorithm is 0.65, 0.78, 0.77 and 0.89 respectively. Compared with other algorithms, P-K-means algorithm has the highest expansion rate, with the average expansion rate increased by 0.24, 0.11 and 0.12. It shows that the parallel K-means algorithm based on MapReduce proposed in the study has the best performance in running speed and can better handle huge data.
The clustering quality evaluation index is used to evaluate the clustering effect of the four algorithms, and the clustering quality results are shown in Fig. 7.
Clustering quality comparison chart.
As can be seen from Fig. 7, the average clustering quality curve of CLARA algorithm has a large fluctuation range, ranging from 0.57 to 1.09. The clustering stability is poor, and the average clustering quality is 0.82; Compared with the PAM algorithm, the original K-means algorithm has better quality. The average clustering quality is 0.96, and the average clustering quality of PAM is 0.76. The parallel K-means algorithm has the best clustering quality, with a fluctuation range of 1.16–1.31, and good stability. The average clustering quality reaches 1.24, which is 0.28, 0.42 and 0.48 higher than the original K-means, CLARA and PAM algorithms respectively. Under the same conditions, the time consumed by clustering is calculated, and the resulting clustering time comparison chart is shown in Fig. 8.
It can be seen from Fig. 8 that PAM has the most clustering time, with an average clustering time of 10.83 minutes. With the increase of the data set, the operation time is on the rise; The average clustering time of CLARA is 6.14 min, the clustering stability is poor, and the consumption time is reduced; The O-K-means clustering time is 2.61 min, the curve fluctuation range is reduced, and the clustering speed has been improved; The clustering time stability of P-K-means algorithm is better than other algorithms, and the clustering time is 1.27 min. Compared with PAM, CLARA and original K-means comparison algorithm, the average clustering efficiency is improved by 88.2%, 80.2% and 51.3% respectively; Compared with the original K-means algorithm, the clustering effect has been improved, indicating that parallel technology can effectively improve the performance of the algorithm.
In order to test the clustering and recommendation effect of parallel K-means clustering algorithm on microblog topics, collect the microblog text of 50 users, cluster and recommend topics using clustering algorithm, and evaluate the recommendation accuracy, root mean square error, user satisfaction and other indicators of several algorithms. The specific results are shown in Table 2.
Experimental software and hardware configuration and parameter
Experimental software and hardware configuration and parameter
Clustering time comparison chart.
ROC curve.
As can be seen from Table 2, in terms of the accuracy of microblog topic recommendation, the research P-K-means recommendation accuracy is the highest, 95.64%; The recommended accuracy of PAM is the worst, 85.47%. The larger the RMSE in Table 2, the worse the recommendation effect of the algorithm. Therefore, the P-K-means algorithm with the smallest RMSE has the best recommendation effect. In terms of user satisfaction, P-K-means also performed the best at 98.64%. Relevant data show the superiority of the P-K-means algorithm. By calculating the recall rate and accuracy of the recommended microblog topics, the ROC curves of the four algorithms are shown in Fig. 9.
From the ROC curve in Fig. 9, it can be seen that the parallel K-means algorithm proposed by the study recommended the best microblogging topics with the highest recall and accuracy; calculating the area contained in the curve, the AUC values obtained for the parallel K-means algorithm, the original K-means algorithm, the CLARA algorithm and the PAM algorithm were 0.8853, 0.8316, 0.8237 and The calculated results also illustrate that the research recommends the best accuracy and performance for microblog topics, which is more related to the clustering effect of the algorithms.
In the context of the continuous development of the information age, mining the hidden information in social networks and providing convenience for users has a positive role. Research and implement the parallel K-means clustering algorithm on the MapReduce model, combine the similarity calculation between microblog text for clustering analysis and topic recommendation, select the microblog text in the open system to build a dataset, analyze the clustering quality of several algorithms and the effect of microblog recommendation. The results of the corresponding node expansion rate under different data sets show that the average expansion rate of P-K-means algorithm reaches 0.89. Compared with PAM, CLARA and the original K-means algorithm, the average expansion rate increases by 0.24, 0.11 and 0.12 respectively. The research on the parallel K-means algorithm has the highest efficiency. Through clustering quality analysis, the average clustering quality of parallel K-means algorithm reaches 1.24, which is 0.48, 0.42 and 0.28 higher than other algorithms respectively. The parallel K-means algorithm has a clustering time of 1.27 minutes. Compared with other comparison algorithms, the average clustering efficiency has been improved by 88.2%, 80.2% and 51.3% respectively, and the clustering stability is also the best. Through the evaluation of Weibo topic recommendation, the results show that the P-K-means algorithm has the best recommendation accuracy, 95.64%; The user satisfaction is the highest, and the recommendation effect is the best. The ROC curve results show that the AUC value of the parallel K-means algorithm is 0.8853, which is significantly higher than other comparison algorithms. It shows that parallel technology can improve the mining quality of clustering algorithm, and the parallel K-means algorithm based on MapReduce can effectively solve the problem of topic classification and recommendation of microblog, save time for users to find topics, and help users understand social trends through topics. Although the research has made some achievements, there are still many shortcomings. The research content is mainly about topic mining and recommendation of microblog text. At present, most users often publish information in the form of pictures and text. The quality of topic mining using this method has declined. In the future, more comprehensive topic mining will be carried out by combining pictures, videos and other related technologies to make the recommended topic more accurate.
