Abstract
It is of great theoretical significance and practical value to analyze the characteristics of users and behaviors in social networks, to study the personalized recommendation algorithms of users, to explore the inherent laws of event development, and to predict the movement of information or opinions. This paper analyzes the Weibo behavior through machine learning and cloud computing technology. Moreover, this paper studies and analyzes traditional network algorithms, and proposes a microblog recommendation algorithm based on statistical features. At the same time, the research content of this paper focuses on microblog contents, user characteristics, user preferences, and influence levels. The algorithm has simple structure and strong computing performance and performs feature data mining through cloud computing big data method, which is suitable for online mining microblog behavior. In addition, the performance of the algorithm was analyzed by design comparison experiments. The research indicates that the research algorithm proposed in this paper has certain advantages, which can be applied to network behavior analysis mining, and can provide theoretical reference for subsequent related research.
Introduction
In today’s information explosion, one of the main purposes of Internet users to use the Internet is to quickly obtain the latest and most interesting information. There are two main ways to obtain information. One is through traditional search engines, such as Baidu and Google, which usually have great purpose and directivity. Another way is to use social networks to access a wide variety of information resources. When using the second way to obtain information, how to efficiently obtain information of interest to itself has become an issue of increasing concern to Internet users.
Weibo has the advantages of good interactivity, diverse coverage areas, large number of users, wide user range, and fast speed of hot information dissemination. It is of great significance and far-reaching influence to analyze the behavior habits and information dissemination rules of Weibo platform users. For example, enterprises and individuals can actively recommend information con-tent that may be of interest to users by analyzing the law of information dissemination and the behavior habits and interests of Weibo users. For the content that is of interest to the user, the user often comments or forwards the microblog, which optimizes the user’s reading experience and promotes the flow and dissemination of information. There-fore, the significance of effective analysis of user behavior is: First, the more interesting the user is, the more easily the content will be read by the user, which increases the speed at which the user extracts content that matches the characteristics of his or her interest in a large amount of information, and saves a lot of time for filtering information. Second, it enhances the interaction rate between users and increases the activity of interest information flow. Third, it helps users to obtain more convenient information on Weibo, thus helping Weibo to expand the scale of users. In addition, enterprises can also use the advantages of microblogs to get the behaviors and hobbies of different types of users and the interests of Weibo, so that they can more accurately advertise, improve the efficiency of advertising, and fully exploit the commercial value and potential of Weibo.
Related work
Sicilia R et al. [1] used Infor-motion Gain (IG) to set different influences of different Twitter characteristics on forwarding behavior to different weights. If the weight of a feature is higher, it means that the impact on forwarding is greater. Moreover, by comparing the two classification algorithms with SVM and logistic regression, they conclude that the weight model is ideal for predicting user forwarding behavior. However, there are two shortcomings in the meth-od. One is that the extracted features do not involve the content of the tweet, for example, considering the topic and sentiment analysis of the discussion and the other is not considering the relationship between the features. Basak R et al. [2] explored the main features that have the greatest impact on user forwarding behavior through Principal Components Analysis, PCA, and combined these features with Generalized Linear Modeling (GLM) to establish a predictive model. Through this model, they discuss the relationship between user characteristics and their forwarding behavior. Finally, it is concluded that if a URL and a hashtag are included in a tweet, the tweet is more likely to be forwarded. However, for user characteristics, the number of users and the number of fans and the length of the user registration time also have a great influence on the possibility of forwarding. However, the content they explored was limited to the statistical analysis of user for-warding behavior and did not use these obtained user characteristics to predict the forwarding behavior. Amati G et al. [3] found that the forwarding behavior in Twitter is affected by the users, con-tent of tweets and time, and proposed a semi-supervised graph model algorithm to predict the forwarding of tweets. Adewole K Se Li et al. [4] generated a forwarding tree through the microblog forwarding path and used an iterative method to predict each forwarding behavior on the forwarding path. Srijith P K [5] and others extracted the relevant features and found that homogeneity differences, micro-network structure, geographic distance and gender have an impact on the formation of microblog forwarding behavior, and the homogeneity difference has the greatest impact. Peter G [6] et al. extracted the microblog characteristics such as weight ratio and user personal information to study the user microblog forwarding behavior, and then used the random forest for-warding prediction algorithm to predict the microblog forwarding behavior. Kabakus A T et al. [7] used Twitter as the main research object. They first introduced the definition of Twitter and the background, and introduced the behavior of Re-tweet in detail, including how users forward, why they forward, and what kind of Twitter. Abdelhaq H [8] and others divided the data of Twitter into six categories: History Feature, Social Feature, Aggregate Lexical Feature, Local Content Feature, Posting Feature, Sentiment Feature. Moreover, they used Multiple Additive Regression Trees, (MART) for data training, and removed one of the above six types of features one by one to find the one with the highest prediction accuracy and proved that the characteristics have the greatest impact on Twitter forwarding. Chen Z et al. [9] took a different approach and studied which users would forward tweets from the perspective of repeaters. By studying the user’s concern relationship and analyzing the six characteristics of Retweet History, Follower Status, Follower Active Time, and Follower Interests, the user who is most likely to forward a certain microblog is found. Belavagi M C et al. [10] studied the forwarding behavior of strangers on Twitter, predicted the probability of user forwarding when a stranger mentioned a user, and established a recommendation system to recommend users who prefer to disseminate information. The above research involves various machine learning algorithms and prediction models, and extracts and trains different features, but there is no good analysis method for the relationship between features. Therefore, this paper analyzes the microblog behavior through machine learning and cloud computing technology.
Recommendation algorithm based on user characteristics and improved NBI model
(1) NBI personalized recommendation algorithm
Compared with the collaborative filtering recommendation algorithm, the NBI (Network-based Inference) recommendation model can effectively improve the accuracy of the proposed algorithm when sparse connection matrix and reduce the computational space complexity. The recommended process for the NBI algorithm is shown in Figs. 14:

NBI-based recommendation algorithm step 1.

NBI-based recommendation algorithm step 2.

NBI-based recommendation algorithm step 3.

General form of the recommendation algorithm based on NBI.
Figure 1 shows step 1 of the NBI recommended algorithm. Users A, B, C, and D are four users in the network, and X, Y, and Z are item information in the network. Then the connection relationship between the user and the item constitutes the original network topology relationship matrix [11]:
Among them, a ij is the connection relationship of user j to product i. In step 1, it is assumed that user A makes a recommendation for the target user. Since the user A selects the items X and Z, the purchased item is given an initial value of 1.
In step 2, since the item X is purchased by the users A, B, and C at the same time, each user is assigned 1/3 of the X value. Similarly, the item z is purchased by the users A, C, and D, so the relationship of the item to the user after the second step of interaction is as shown in Fig. 2 [12]. NBI-based recommendation algorithm step 3 as show in Fig. 3.
Since the items X and z purchased by the user A are simultaneously purchased by the users B, C, and D, a new network topology propagation relationship is formed between the user A and B, C, and D. In step 3, the users B and C simultaneously purchase the item Y, so the item Y forms a certain recommendation relationship with the user A. The calculation result from step 2: the value of user B is 1/3, and B purchases two items of X and Y, so each item is assigned a B value of 1/2, that is, user B’s contribution to Y is: (1/3) * (1/2) = 1/6. Similarly, the contribution of user C to item Y is (1/3 + 1/3) * 1/3 = 2/9. Therefore, the contribution of users B and C to item Y is 1/6 + 2/9 = 7/18. Thus, the correlation between the user A and the item Y is represented by the calculation result 7/18, that is, the recommendation force of the item Y for the user A is 7/18. The above network propagation relationship is abstracted into a general form as shown in Fig. 4 [13].
When using the network propagation relationship topology of Fig. 4 and selecting any user as the target user for recommendation, the recommended rating of the item is shown in formula (2) [14].
The above calculation process is abstracted into an operation expression, in which the operation result of step 2 in Fig. 2 is [15]:
In formula (3), f (x
i
) is the initial value of the item, and the initial value of all items in the NBI algorithm is set to 1. k (x
i
) is the outflow of item i. If an item is purchased by more users, the contribution of item i to each user is smaller.
Equation (5) is the item recommendation score obtained by step 3 of the NBI algorithm, as shown in Fig. 4. The final recommendation score for item j obtained by the NBI algorithm for user j is:
The NBI algorithm converts the traditional dependency-to-user relationship description into the recommended conversion relationship of the item to the item, so formula (6) can be replaced by formula (7):
In the formula, i, j corresponds to the item, and β corresponds to the user. The number of nodes and items in the network is n and m, respectively. a
iβ
= 1 & & a
βj
= 1 indicates that a network topology relationship connection can be generated by user β between items i and j. w
ij
is defined as an item, and for the recommendation force of the item i, the corresponding relationship is the corresponding element of the formula (12) conversion matrix. Therefore, the recommended relationship of the items through the transformation can be described by the formula (19) [16].
In formula (19),
In the microblog recommendation algorithm, the recommended subject is a microblog topic that is clustered by short text. When the NBI algorithm recommendation is used for microblog recommendation, if user A publishes the microblog about topic a, and the microblog topic published by user B also contains the microblog information of topic a, then user A and B form a network topology connection relationship through topic a: A-a-B. In this information communication relationship, other microblogs published by user B form a certain recommendation relationship with user A [18].
In the relationship matrix of the original NBI item corresponding to the user, when user j has purchased item i, it corresponds to the matrix element a ij = 1, otherwise a ij = 0. Corresponding to the microblog topic, if user j has posted the microblog of topic i, it corresponds to the matrix element a ij = 1, otherwise a ij = 0. One question that needs to be discussed is whether the original NBI algorithm is sufficient to describe the behavior of users posting microblog topics. That is to say, in the above A-a-B network topology relationship, if another topic b issued by A is also released by user C, the A-a-B propagation relationship can be equivalent to A-b-C.
Hypothesis 1: User A has posted 10 microblogs, 9 of which belong to topic a, and 1 belong to topic b; User B has posted 2 microblogs, which belong to topic a and topic c respectively; User C has also published 2 microblogs, which belong to topic b and topic d respectively. Thus, the topics a, b issued by the user A form two network propagation relationships, namely A-a-B and A-b-C. It can be seen from the above assumption that the ratio of the microblog topic a to the topic b published by the user A is 9:1, which indicates that the topic a has a higher preference for the user A, and the topic b does not have sufficient representativeness for the user A. Users B and C are associated with user A through topics a and b, respectively. According to the preference analysis of user A, user B is more similar to the topic of user A, and A and B should have higher user similarity. Moreover, the relationship between the users A and C is formed by an accidental topic b, so the similarity between the users A and C should be a small value. Therefore, in the recommendation relationship formed by the user B, the microblog belonging to the topic c issued by the user B should have a higher possibility of being concerned by the user A. However, in the recommendation relationship formed by the user C, the topic d and the user A should not have strong correlation. The topic propagation relationship between users A, B, and C and topics a, b, C, and d is shown in Fig. 5 [19].

Comparison of recommendation algorithms based on NBI improved model (1).
Fig. 5a shows the original NBI recommendation model. User A pays attention to topic a and topic b. The NBI model user corresponding topic matrix element is two-dimensional, represented by 1, 0, and the recommendation scores of user C and C are recommended by user B and C to be the same as 1/4. Thus, in the NBI recommendation algorithm, the microblogs belonging to topic C and topic d will be included in the recommendation list of user A with the same score of 0.25. That is, topic C is equivalent to topic d. This conclusion does not support the results of the above analysis, so it is necessary to improve the calculation process of step 1 of the NBI algorithm [20].
The first improvement to the NBI algorithm in this paper is to replace the original NBI item-to-user relationship matrix with the user matrix using topics with normalized power of interest. Figure 5b is a recommended process for improving the NBI model. In the above example, user A has published 10 microblogs, 9 of which belong to topic a and 1 belong to topic b. Therefore, the attention intensity of user A corresponding to topic a is defined as 9/10, and the attention intensity of corresponding topic b is 1/10. Thus, the improved user topic relationship matrix is as shown in Equation (10). Among them, N j is the total number of microblogs posted by the j-th user, and n(i,j) is the number of microblogs belonging to the topic i issued by the user j.
In the improved NBI model calculation result of Fig. 5b, the recommendation scores of topic c and topic d for user A are converted from 0.25 of the NBI algorithm to 9/40 and 1/40, which have the degree of discrimination. Therefore, the NBI recommendation algorithm improved by normalizing the intensity initial matrix is more in line with the real microblog recommendation example with topic clustering results [21].
Hypothesis 2: There are three users in the network: A, B, and C, and each user has published 5 microblogs. The 5 microblogs posted by user A belong to topic a; among the microblogs posted by B users, 4 belong to topic a, 1 belongs to topic b; 1 of the microblogs published by C users belongs to topic a, and the remaining 4 belong to topic c. The target user is defined as A, which is recommended by microblogs. Then, the A-a-B and A-a-C connections are formed through the network information topology relationship. The recommended relationship is shown in Fig. 6.

Comparison of recommendation algorithms based on NBI improved model (2).
Figure 6a shows the original NBI recommendation algorithm. Since User A only publishes the microblogs belonging to Topic a, the corresponding element a
aA
is equal to 1 in both the original matrix and the improved matrix with normalized intensity of interest. Since the out-degree of the topic a is 3, the users B and C each get the 1/3 of the initial value of the topic a. At the same time, since the users B and C each pay attention to two topics, the corresponding topics are assigned the 1/2 of the B and C weights. Therefore, the topic b obtained by the NBI recommendation algorithm and the topic c are recommended to the user A as g, which has the distinguishing power. Therefore, the topic b obtained by the NBI recommendation algorithm and the topic c corresponding to the user A’s recommendation score are all 1/6, which has the distinguishing power. Figure 6b is a modified NBI recommendation model with connection weights, where the outdegree a of the topic a is 3, and the users A, B, and C each get the 1/3 of the initial value of the topic a. For connecting a-B and a-C, the connection weight is defined as h
ij
, and the connection weight is the normalized interaction between user j and topic i, that is h
ab
= 4/5, h
ac
= 1/5. In the improved algorithm, the weight from the topic a obtained by the user B and the user C is one of the out-degree of the initial value of a multiplied by the connection weight between the topic and the user. Therefore, the calculation results of users B and c in step 2 of the NBI algorithm are 4/15 and 1/15, respectively. Similarly, the NBI calculation process is continued, and topics b and c respectively obtain 1/2 from the weights of user B and user C, respectively. That is, the recommendation forces of topic b and topic C for user A are 2/15 and 1/30, respectively, which have a certain distinguishing effect. Therefore, the improved NBI model with connection weights is more in line with the actual recommendation of microblogs. In the improved NBI model that introduces connection weights, Equation (3) is replaced by Equation (11). h
ij
is the connection weight.
Hypothesis 3: There are two users A and B and three topics a, b, and C in the network. Among them, all the microblogs published by user A belong to topic a, and user B posts 10 microblogs, of which 5 belong to topic a, 4 belong to topic b, and 1 belong to topic c. The recommended force relationship of the hypothetical microblog topic is shown in Fig. 7.

Comparison of recommendation algorithms based on NBI improved model (3).
In the original NBI recommendation algorithm, User B obtains the weight from topic a as 1/2. Because the B user pays attention to the three topics at the same time, each topic is divided into 1/3 of the B weight, that is, the recommendation b of the topic b and the topic C for A is 1/6, which does not have the degree of discrimination. According to the idea of collaborative filtering and the similarity relationship between users A and B, the topic b that is closer to user B should be preferentially recommended to user A. Therefore, for the step 3 calculation of the NBI algorithm, that is, the formula (5), when the user j transfers his weight to the topic i, it is also necessary to add a connection weight l ij to distinguish the user’s attention to the topic. Unlike h ij , l ij is the degree of interest of user j for other topics, after removing the topic from step 2. In Hypothesis 3, user B has 5 microblogs about topic a, and for the recommendation that the target user is A, user B’s initial weight is from topic a, so the total number of microblogs posted by B users for other topics is 5. Among them, there are 4 items belonging to the topic b, that is, l Bb = 4/5, and one item belonging to the topic c, that is, l Bc = 1/5. Therefore, in the improved algorithm, the weight of user B is 1/4, and the characteristic of degree distribution is k (B) = 3, and l Bb = 4/5 & & l Bc = 1/5. Then, in the process of weight transfer of the user to the topic, the topic b is the 4/5 of the user B weight after the step 2, and the topic c is 1/4. Finally, the recommendation power of topic b for user A is 1/15, and the recommendation power of topic c for user A is 1/60. After the improvement of step 3, the NBI algorithm is more suitable for the actual recommended use of microblogs. The improved algorithm is shown in Equation (12).
In the same way, the final calculation result (12) is converted into the item-to-item recommendation relationship, and the conversion result is:
Formula (13) is the recommended scoring result for the NBI model microblog topic improved by three steps. In formula (14),
The intensity of attention of user A for topic a is defined as w
a
. After the connection with user A is removed from topic a, the number of connections for the remaining users is n–1. The following discussion will discuss whether the recommendation of topic b derived from the topic a through the topological propagation relationship for user A will exceed the user A’s original attention intensity for topic a.
It can be seen that in any communication relationship formed by topic a, the recommendation of the topic to the target user is smaller than the topic a itself, and this conclusion is in line with the objective phenomenon. That is, the user’s attention to the original topic should be greater than the degree of attention of the user to derive a new topic through a topic. However, the above conclusions are not equivalent to: the intensity of attention of all users A has been concerned with the topic is greater than the intensity of the new topic derived from the communication relationship. In the topology propagation relationship shown in Fig. 6b, the original attention intensity of user A for the b topic is 1/10. Although the attention level A of the user A derived from the topic b is 1/40 for the topic d, which is smaller than the attention intensity of the topic b, the attention intensity derived from the topic a for the topic c is 9/40, which is larger than the attention intensity of the topic b directly concerned by the user A. In reality, the topic X that is not directly concerned by the user A is usually derived from the topic a, b, c, d⋯ that is directly concerned by the user A through the microblog topology propagation relationship. Therefore, the possibility that the unknown topic X has a recommendation power for the user A that is greater than the recommendation power of any one of the topics a, b, c, d⋯ that users are directly concerned about is present. Numerically, the topic scores that the user has followed and the derived topic scores obtained by the improved NBI algorithm are staggered. The calculation results are determined by the initial user attention intensity and the microblog topic propagation topological relationship. Finally, Equation (16) is an improved NBI recommendation model that includes repeated recommendations:
In the formula (16), if the topic i is a topic that the user β does not pay attention to, the recommended score is the calculation result of the formula (15). On the other hand, if the topic i is the topic that the user β has paid attention to, the recommendation score is the user’s corresponding topic attention intensity, that is, the corresponding element in the matrix of the formula (10).
Formula (17) is a microblog recommendation model based on user features and improved NBI algorithm proposed in this chapter. In the formula, Si,t represents a recommendation score for user i from the microblog t sent by user j.
For the use of formula (17), it is necessary to find a suitable parameter α by training the data set, so that Weibo has the highest reasonableness for the target user’s score Si,t.
The experimental data in this article is captured from Sina Weibo data on the platform. Considering that the presence of zombie users and spammers may cause a lot of noise to affect the final predictions, this article examines the active users from the crawled data. Here are the following filters for active users: (1) The number of user fans and the number of users concerned must be greater than 50. (2) Users must post more than 10 microblogs per week. According to the filtering conditions, a total of 12,013 active users were selected. Starting from these users, the breadth-first traversal is carried out, and the network of interest is fetched. Finally, the entire network has 92,034 users and 1,272,871 related relationships. At the same time, Weibo, which was published from 2018.7.1 to 2018.9.30, was selected. The total number of microblogs published by all users was 9,913,495, of which 716,178 were microblogs and 9,197,317 were original microblogs. Table 1 gives the main characteristics of the experimental data set.
Nature of data sets
Nature of data sets
With 2018.8.31 as the time demarcation point, the experimental data set is divided into experimental training set and experimental test set. In addition, as shown in Table 4.1, the number of forwarding microblogs is about 12 times that of the original microblogs. Therefore, in order to ensure data balance, the original microblog and forwarding microblog are sampled, and the sampling ratio is 1:2. Table 2 shows the actual data sets used in the experiment.
Characteristics of using data sets
In order to verify the performance improvement of the proposed prediction model, logistic regression (LR), support vector machine (SVM) and Passive-Aggressive algorithm (PA) are selected as the comparison algorithm, which are trained and verified on the dataset. In order to make a reasonable evaluation of the prediction results, this paper introduces the evaluation indicators in the information retrieval: the accuracy rate is used to measure the proportion of the correct prediction, the recall rate is used to measure the proportion that can be correctly predicted, and the F1 value is the compromise between the two. Among them:
Table 3 gives the overall experimental results with indicators of accuracy and recall. The forwarding behavior prediction algorithm represents the algorithm proposed in this paper.
Experimental results
It can be seen from the experimental results in Fig. 8 that, on the one hand, since the PA algorithm is too simple, its accuracy and recall rate are relatively low. On the other hand, compared with the logistic regression algorithm as the benchmark algorithm, the proposed algorithm can significantly improve the performance index and recall rate. In addition, compared with the previous two linear algorithms, although the SVM already has a better classification effect, since the comparison algorithm is all based on the global data training model, the overall classification effect is not as good as the proposed algorithm. However, the algorithm proposed in this paper introduces local parameters, which improves the classification effect.

F1 value of the experimental results.
By comparing the F1 values in Fig. 8, it can be seen that the proposed algorithm can obtain better F1 values.
In order to determine the influence of local parameters on the model, the benchmark algorithm logistic regression was used as the comparison algorithm, and in order to see the influence of the change of γ1 on the model, the γ1 value was set to be iterated by [0.01, 0.1, 1, 10, 100]. It can be seen from the experimental results shown in Fig. 9 that the γ1 value is too small to cause over-fitting of the model. However, when γ1 is too large, the local parameters in the model will be disabled, and the effect is close to the ordinary logistic regression.

The impact of γ1 to model.
In order to verify the classification of the proposed model in the case of different forwarding historical data, the user is divided into five groups according to the number of microblogs: 1∼10, 10∼100, 100∼500, 500∼1000, 1000∼2000, one user is selected in each group, and a total of 20 times are taken. At the same time, the SVM with better effect is used as the representative and the algorithm proposed in this paper is compared. The result is shown in Fig. 10.

Model comparison on a user collection.
As can be seen from Fig. 10, as a whole, as the forwarding history data increases, the classification algorithm F1 value increases. For a certain user, the magnitude of the change of F1 value of the prediction model proposed in this paper is much smaller than that of SVM, and the F1 value of the prediction model is also decreasing when the data is gradually increasing. This shows that the data sparseness is alleviated by the association of forwarding behavior between users. In addition, the F1 value of the prediction model is always higher than the SVM, indicating that the prediction model can meet the performance requirements of improving the performance through local parameters.
In addition, 20 users were randomly selected, and the prediction model and logistic regression model of the microblog user forwarding behavior were simultaneously used to predict the forwarding behavior. The result is shown in Fig. 11.

Model comparison on a random user set.
It can be seen from Fig. 11 that the model proposed in this paper is much less oscillating to the predicted F1 value of each user than the logistic regression model and can also get better results in several users with poor performance in the logistic regression model. This shows that the model can meet individual requirements.
Social networks have the characteristics of rich data features, large amount of information interaction, strong user subjectivity, network initiative and self-organization. Traditional research methods and models are often difficult to accurately describe the characteristics of user characteristics and information dissemination in social networks. At the same time, massive data resources have also created a new impact on the implementation and processing performance of existing data mining models. Based on this, combined with computational science, statistical physics, probability theory, optimization theory, communication, complex network and other interdisciplinary ideas, this paper analyzes and studies the data mining algorithms in social networks from the perspectives of Internet information collection and processing, social network data empirical analysis, user influence and behavior analysis, user personalized recommendation algorithm and machine learning-based information prediction algorithm. This paper not only pays attention to the theory and application of algorithms, but also discusses the implementation of related models in big data mode based on the massive data processing of the Internet, which provides some references and ideas for data mining methods in social networks.
