Abstract
OBJECTIVE:
With Sina Weibo data as the background, support vector machine (SVM) and k-nearest neighbor (KNN) method are used to predict and analyze the user’s micro-blog emotion and related behavior in social network, hoping to obtain rich potential business value.
METHODS:
First, the API interface of Sina Weibo is utilized to obtain the information of users in Sina Weibo; then, the Excel software is utilized to sort and analyze the extracted data to extract the features of micro- blogs posted by users. Second, SVM and KNN algorithms are utilized to calculate the weighted average and propose a hybrid multi-classifier-based Mixed Classifier Emotion Prediction Model (MCEPM). Finally, through the evaluation criteria, including precision (P), recall rate (R), and harmonic average (F1), the specific experimental results of SVM and KNN weight coefficients are compared with the prediction results of MCEPM.
RESULTS:
The prediction effect of MCEPM is associated with the weight coefficients of SVM and KNN. If the weight coefficients of SVM and KNN are 0.6 and 0.4, the prediction effect of MCEPM will be optimal. Comprehensive analysis shows that the MCEPM model can balance the prediction results of the positive and negative samples of the two classifiers.
CONCLUSION:
MCEPM model is superior to other algorithms in micro-blog emotion prediction, which can help enterprises analyze users’ product inclination and provide accurate customer service requirements for enterprises.
Introduction
The popularity of the Internet and the mass demand for social activities have gradually brought social network media into the daily life of people. Meanwhile, micro-blogs have also developed rapidly. As of the third quarter of 2019, Sina Weibo (a micro-blog platform) had 216 million daily active users. In addition, the monthly active users reached 497 million, with daily visits reaching tens of billions [1]. In August 2009, Sina Corporation launched the Chinese version of “Twitter”, i.e., the internal beta version of “Sina Weibo”. Since then, Sina Weibo has officially entered the lives of Chinese people, and more and more people have begun posting micro-blogs on Sina Weibo [2]. It is known that users can post messages or upload their pictures on social networking sites through web pages, WAP pages, and mobile applications [3]. The “one-sentence blog” is a vivid description of Sina Weibo. In the beginning, users can post texts within 140 Chinese characters, but now users can post micro-blogs within 2000 Chinese characters. The posted content can include the feelings, of users, the news that the users know, or the interesting information or pictures at anytime and anywhere [4]. In addition, users can also visit others’ Weibo home pages by following other users, and browse their posts [5]. The biggest highlight of Sina Weibo is that it provides a public API interface, which greatly facilitates researchers and makes their research in Sina Weibo more authentic [6].
Taking Sina Weibo data as the research object, the proposed MCEPM model mainly integrates the advantages of SVM and KNN algorithm, which solves the key data missing, the uneven distribution of samples, the different number of samples in each category, and the impact of strong randomness on the classification effect. Using the advantages of the two algorithms, the weighted average value is calculated, and a better model of balanced positive and negative samples prediction is obtained through the reasonable distribution of the weight coefficient. Predict the emotion of Sina Weibo users, and utilize the emotion score indicators in Sina Weibo to quantify data, which can help enterprises analyze consumption Feedback on products, or detection of negative reviews in online reviews, thereby improving their products, services, and marketing capabilities.
Literature review
In recent years, more and more scholars have deepened the analysis of big data, and focused more on it. More and more people have gradually paid attention to the field of microblog service [7]. Foreign scholars in social network research started relatively early. Rahman et al. proposed a community identification and interactive community visualization based on a new community detection algorithm, Tribase, through the data extraction and visualization of Twitter in social networks. This method can directly reveal the community structure and relevant features [8]. The research on social networks in China started relatively late. But as the internet develops, microblog is more popular with most people, so more and more scholars in China have also been devoted to this area. Zheng et al., based on the real-time online searchable social networking sites about tourism information, vividly obtained the user’s search behavior related to tourism from the internet [9]. Yu et al. studied the user interaction model and behavior preference prediction of social network [10].
Micro-blog is a kind of broadcast social media that shares short real-time information and hot events through attention mechanism based on information sharing, dissemination, and acquisition of user relationship. The number of users in Sina Weibo is growing. Users’ information and their behaviors are stored in a way of big data, but this kind of valuable big data is easily ignored by the public [11]. Based on micro-blog, Yousef et al. studied the role of self-network in online social network and information dissemination, mainly focusing on the research and improvement of the ways in which information appears many times but there are shortcomings in the dissemination of user forwarding and other behaviors, and proposed a more humanized and reasonable method of user interactive forwarding comment analysis. Finally, they developed a verification platform for information dissemination analysis [12]. Yao et al. analyzed and studied the behavior of Sina Weibo online users, and used implicit and explicit classification methods to predict user attributes [13].
At present, the stock of microblog has reached hundreds of billions, among which many data are text information, which is an important communication carrier. Mining and analyzing the text data will obtain rich potential business value [14].
Methods
Principle of SVM algorithm
SVM is a two-class classification model. The basic model is defined as a feature space linear classifier. The learning strategy is to maximize the interval, and the decision boundary is the maximum-margin hyperplane that solves the learning samples. The hyperplane distinguishes the data [15]. Since SVM is divided into the linear separable model and the linear inseparable model, this study introduces the basic operating principles of SVM for linear separable problems, as shown in Fig. 1.
Schematic diagram of support vector machine linear classification.
Figure 1 illustrates the basic operating principles of SVM. The squares and circles in the figure represent two samples that are about to be classified. The solid line is the classification line. The two dashed lines indicate the sample closest to the classification line in the class, which is a straight line parallel to the classification line. The distance from the dotted line to the solid line is defined as the classification interval. The ultimate goal is to find the optimal classification line, correctly separate the two samples, and maximize the classification interval.
The dataset to be classified in this study is assumed as (
Where:
It is supposed that the distance from the vector to the hyperplane is
Maximizing
A Lagrangian function is constructed. The Lagrangian multiplier method is used to find an optimal solution
Then,
The resulting optimal decision function is:
In the above equations,
KNN is a commonly used method for text classification. It is also often used in the field of data mining. KNN is a lazy learning algorithm. It does not need to train the model. Whenever a new sample arrives, it compares the new sample with the training data set and finds the k nearest neighbors closest to the new sample in the training set [16]. The statistical method is used to select the
It is assumed that
The implementation of the KNN algorithm is shown in Fig. 2. In the KNN algorithm, the value of the parameter
Flowchart of KNN algorithm.
The KNN algorithm is simple and easy to understand and implement, which can be used for classification, regression, and non-linear classification. The training time complexity is 0
The collection of Sina Weibo data mainly includes methods based on API interfaces and collection methods based on web crawlers. Sina Weibo provides a free and open platform and also provides a Java SDK installation package. Users can obtain their information directly by applying for an API interface. The first step is to register as a developer, create the approved site application, fill in the required relevant information, and submit it. If the application is approved, the API interface is accessible. Otherwise, corresponding modifications must be made to the application, which is then submitted again to the platform for review and look out why it failed. Second, users on Sina Weibo can authorize the applications applied by users through the corresponding Oauth2.0 method, and they can obtain a large amount of data. The principle of the web crawler is a program or script that automatically obtains information from the Internet in accordance with certain rules. It can quickly and automatically collect the content it can access; its purpose is to get or update some of the content and retrieval methods of these websites. However, the requirements for data acquisition in this study are not very high, and there is no need to obtain a large amount of data in a short period of time. Therefore, the API interface provided by Sina Weibo is mainly used to perform data extraction on user follower information, user information that users follow, and hot topics. Finally, the obtained data are organized by the Microsoft Excel software. The artificially cross-labeling method is applied to the sorted 54781 pieces of micro-blog data, i.e., three people receive the same micro-blog text content for emotional tagging. The tagging results with clear emotions are selected for this study. The 25,000 pieces of micro-blog data with emotions are used as the experimental data in this study. The experimental data are divided into the training set and the test set, in which 80% of the data are used as training set data, and 20% of the data are used as test set data. Based on the data extraction of Sina Weibo API, the detailed follower information, following information, and hot topics can be directly obtained. These micro-blog data can be organized and analyzed to provide a theoretical basis for the research on micro-blog emotion prediction.
Building an emotion prediction model of micro-blogs
In this study, two classification algorithms, SVM and KNN, are used to establish the emotion prediction model Mixed Classifier Emotion Prediction Model (MCEPM). Figure 3 shows the basic principles of MCEPM. It first extracts the corresponding features from the database that stores the data. The formatting process is the extracted features. For example, the micro-blog features include the basic information of the user and the content information of the microblog. Then, the micro-blog text is folded. The appropriate classifier is selected. Different classification algorithms are selected according to different contents. Simulation training is performed according to the selected algorithm. Weights are assigned to each classifier. Finally, the results of the test data are used to analyze the associated features for the corresponding problem. In addition, the results are predicted.
Prediction algorithm flow of MCEPM model.
Several representative micro-blog features are extracted from the collated data. The Chinese lexical analysis system ICTCLAS developed by the Institute of Computing Technology of the Chinese Academy of Sciences is applied to segment the micro-blog text. Also, the “emotion analysis words” of China National Knowledge Infrastructure (CNKI), the Chinese affective polarity dictionary NTUSD of Taiwan University, and the Chinese affective vocabulary ontology library of the Information Retrieval Laboratory of Dalian University of Technology are utilized to find out the positive and negative words contained in the micro-blogs.
Text classification performance evaluation criteria for micro-blogs
While researching the social networks, precision (P), recall rate (R), and harmonic mean F1 (F1-score) are often used as indicators for the performance evaluation of micro-blog text classification [17]. Figure 4 shows the confusion matrix of the evaluation criteria:
The confusion matrix.
As shown in Fig. 4,
The recall rate
The
Finally, the accuracy rate
In MCEPM, first, the input data are kept unchanged, and two different classifiers are predicted and trained separately, i.e., the SVM and KNN decision trees. The classification results output by each classifier are weighted by using the Python programming language, and the final result is obtained by using the weighted average method. The specific process is as follows:
(1) A sample is input; the classification results F1 (m) and F2 (m) are respectively output according to the prediction of the SVM and KNN classifier decision trees. (2) The weighting processing is performed on the classification results F1 (m) and F2 (m); the weighted average of them is calculated, and predictions based on the results obtained are made. (3) The prediction tendency of the two classifiers is calculated separately: tag * score. (4) The weight is added for each classifier to get the predicted propensity value of each classifier: tag * score * weight. (5) The predicted propensity value is normalized to balance the impact of the two classifiers. (6) The final result F (m)
The tag indicates the predictive label, and the score indicates the degree of membership belonging to this category. If the final score obtained is larger, the harmonic mean will be larger.
Experimental environment
The experiments in this study are all performed on the same computer. The software used in the experiments includes eclipse, pydev, sklerrn toolkit, and Excel. The parameters of the experimental platform are Windows 10 64-bit operating system flagship version, Intel Core i7-8750H 2.20 GHz, NVIDIA GEFORCE RTX2060 6G GPU, 16.00 G RAM. The programming language is Python 3.6.4.
Results
Results of micro-blog feature extraction
The statistics of positive and negative words contained in micro-blog content found based on the combination of ICTCLAS and three emotional dictionaries are shown in Table 1.
Statistics of emotional dictionary
Statistics of emotional dictionary
As shown in Table 1, in general, the number of negative words is more than that of positive words, which are 15591 and 20869, respectively, and the total number of words is 36460. The main difference lies in the analysis results of NTUSD dictionary. The number of negative words is more than three times that of positive words. Using ICTCLAS to segment micro-blog posts and combining three emotional dictionaries can effectively separate negative and positive words in the data sample, providing a basis for the subsequent operation of the MCEPM model.
In this study, based on the results of classifier training, the precision P, the recall rate R, and F1 of the positive and negative samples of the SVM and KNN classifiers are obtained. The results are shown in Table 2.
Comparison of experimental results of emotion classification based on SVM and KNN
Comparison of experimental results of emotion classification based on SVM and KNN
Weight verification results of precision P.
Weight verification results of recall rate R.
Weight verification results of FI value.
Table 2 shows that the accuracy
A large number of weighted average verifications are performed on the above classification prediction result values to obtain the optimal weight distribution. This experiment chooses to start with the upper-middle value for verification, i.e., the experiment is performed from large to small, and the weighted inflammation numbers are 1 (SVM 0.9, KNN 0.1), 2 (SVM 0.8, KNN 0.2), 3 (SVM 0.7, KNN 0.3), 4 (SVM 0.6, KNN 0.4), 5 (SVM 0.5, KNN 0.5), and 6 (SVM 0.4, KNN 0.6), which represent the verification results of SVM and KNN at different weight coefficients. The weighted average values obtained by assigning weight values are shown in Figs 5–7.
Comparison of P among SVM, KNN, and MCEPM.
A comprehensive comparison of the verification results in Figs 5–7 shows that when the combination of SVM and KNN weights is different, the weighted average of the results is also different, i.e., the accuracy of the positive and negative samples P, the recall rate R, and the harmonic average F1 is not the same. By comparing the data in the table, it can be seen that when the appropriate weight coefficient SVM: 0.6, KNN: 0.4 is selected, the prediction effect of the two models on positive and negative samples can be balanced. According to the data in the comparison table, when the appropriate weight coefficients are selected for the SVM and KNN, the prediction effects of the two models on the positive and negative samples can be balanced. The MCEPM has the optimal effect at this time.
To verify the performance of the proposed MCEPM, this study analyzes the performance of MCEPM. In combination with the weight verification comparison above, when the SVM and KNN weights take values of 0.6 and 0.4, the weighted average obtained, i.e., the MCEPM performance, is better. Therefore, the weight values are assigned to 0.6 and 0.4 respectively for calculation. The weighted average results of MCEPM are compared with SVM and KNN in terms of accuracy P, recall rate R, and harmonic mean F1, respectively. The comparison results are shown in Figs 8–10.
Comparison of R among SVM, KNN, and MCEPM.
Comparison of F1 among SVM, KNN, and MCEPM.
From the comparison of the data in Fig. 8, when the weight of SVM and KNN is 0.6 and 0.4, respectively, the accuracy
As shown in Fig. 9, when the weight of SVM and KNN is 0.6 and 0.4, respectively, the positive and negative sample R value of MCEPM is slightly higher than that of KNN, and significantly higher than that of SVM. Thus, the comprehensive analysis indicates that under the selected better weight coefficient, the classification missed by the MCEPM model classifier in the class is the least, which can achieve better results.
As shown in Fig. 10, when the weight of SVM and KNN is 0.6 and 0.4, respectively, the positive sample F1 value of MCEPM is slightly higher than that of KNN, and significantly higher than that of SVM. The F1 value of negative samples of MCEPM is significantly higher than that of negative samples of SVM and KNN. Therefore, the harmonic mean value F1 of the model is the best.
Micro-blog is a type of broadcast social media based on user relationship information sharing, dissemination and acquisition, sharing short real-time information, and real-time hot spot events through the attention mechanism. The users of Sina Weibo are constantly growing. User information and their behaviors are stored in a big data manner. However, the valuable big data are easily ignored by the public [19]. At present, there are hundreds of billions of micro-blog stock. A large amount of data is text information. Text information is an important communication carrier. Mining and analyzing text data will have rich potential business value [20]. Yousef et al. studied the role of self-networks in online social networks and information dissemination based on micro-blogs. They mainly researched and improved where information appeared multiple times but there was inadequacy in the way of communication such as user forwarding. A more humane and reasonable method of user-interactive forwarded comment analysis was developed, and finally, a verification platform for information dissemination analysis was developed [21]. Yao et al. analyzed and researched the behavior of online users on Sina Weibo; they used implicit and explicit classification methods to predict user attributes [22].
In this study, based on the research of Sina Weibo user information and SVM and KNN algorithms, an emotion prediction model MCEPM for Sina Weibo is established. The experimental results of SVM, KNN, and MCEPM are used for comparison. According to the prediction results of SVM and KNN classifiers, the accuracy P, recall rate R, and harmonic mean F1 of the positive and negative samples are compared. The best weight coefficient can be obtained by calculating the weighted average. SVM and KNN take 0.6 and 0.4 respectively. At this time, the predictive ability of the obtained MCEPM is relatively strong. By comparing and analyzing the P, R, and F1 values of positive and negative samples of MCEPM, SVM, and KNN, the MCEPM proposed in this study combines the advantages of SVM and KNN and uses the advantages of both algorithms to calculate the weighted average. With the reasonable allocation of weight coefficients, the emotion prediction ability of MCEPM is much better than using these two classifiers alone. By merging the two classifiers, it has a stronger generalization ability, which improves the accuracy of emotion prediction on micro-blog text content, reduces a series of problems such as the lack of key data, the uneven sample distribution, the different number of samples in each category, and the effects of strong randomness on the classification. The proposed MCEPM has a balanced effect on the prediction results of the positive and negative samples of each individual classifier.
Conclusion
The micro-blog emotion prediction model MCEPM based on SVM and KNN algorithm is established. Through the experimental data, the prediction ability of MCEPM based on single classifier and mixed classifier is compared. It is found that the MCEPM model designed has good performance in micro-blog emotion prediction. And it can analyze and predict the user’s behavior through the expression of emotion, and provide accurate customer service demand information for enterprises, which brings certain value to the development of society and human beings. However, the work is not enough, and needs further research and improvement, including the behavior and text emotion of users in social network and the influence of micro-blog characteristics on micro-blog text prediction.
Footnotes
Acknowledgments
The authors thank the anonymous reviewers for their useful suggestions. This work has been supported in part by the Scientific Research Fund of Education Department of Hunan Province (19B245, 19C0852, 19K037, 18B353), in part by the Science Foundation of Hunan Province (2020JJ4340, 2018JJ2154, 2020JJ4341).
