Abstract
Social networks have evolved into a popular information and communication platform, and the vast amount of data it generates are rapidly changing and spreading. Thus, it is essential to detect and trace large events and burst topics in mass social network data based on real-time Big Data parallel computing. In this paper, we propose a model that uses the Negative Binomial Distribution to fit the distribution of Weibo topic words. Then, we introduce the concepts of the ‘hot degree’ and the ‘dispersion degree’ of a topic with their corresponding computing methods. And we validate the efficiency of the model using real data. Furthermore, we design a topic detection and trend-tracing algorithm based on stream data, and implement the algorithm on Spark Streaming which is a streaming processing framework that uses memory computing. Finally, the experiments on real data demonstrate that our proposal is effective and efficient in tracking bursting events.
Introduction
Social networking is an emerging information exchange and propagation platform that is attracting more and more users; it has a non-negligible impact on the management and development pattern of society as a whole [1].
At every moment, social networking generates new data that may be directly related to public safety, such as early warnings of hurricanes [2], earthquakes [3] and influenza alerts, some of which is concerned with the development of a nation’s and people’s livelihoods, such as messages regarding oil price increases [4]. Therefore, social networks, such as Twitter and Weibo, are the starting points of various important news and intense events. Unlike traditional news media, interactions among users on social networks significantly increase the speed and depth of the information transmission [5]. However, the arbitrariness of the used language in Weibo often makes valuable information submerged in a flood of data. Therefore, designing effective methods to detect topics in a timely manner in the presence of massive amounts of data is of great significance for useful information extraction and social emergency warnings [6].
Furthermore, the high arrival rates and high exchange rates of social network streams lead to the high data velocity [7, 8], thus, ‘timeliness’ has become a very important factor for burst topics [9, 10, 11, 12]. The data value decreases over time, and the detection of emerging events and topics in real-time has become one of the key issues in on-line social network analysis [13]. In recent years, in order to solve the problems of big-data real-time processing and computing more effectively, the industry has implemented a variety of techniques, such as the Apache Hadoop parallel processing framework, the Apache Storm streaming computing framework, and the Spark memory computing framework. The rational use of these related technologies will further enhance the efficiency and capacity of topic analysis in social networks.
We use Sina Weibo, the most typical social platform in China, as the data source for the typicality and representativeness of its data. Then we propose to use a Negative Binomial Distribution model to fit the distribution of the Weibo topic words. The fitting results are then evaluated. According to this model, we present a method to simulate the ‘hot degree’ and ‘dispersion degree’ of the Weibo topics. Note that the ‘hot degree’ indicates the intensity or frequency of the term used in real time, which indicates the trend of its use. Furthermore, considering the data characteristics in Weibo, we abstract the Weibo data into stream data, and design an algorithm based on time windows to track topics through real-time computing. Compared with traditional trend prediction methods that are simply based on word frequencies, our method reduces randomness and improves accuracy. Through the above three steps, real-time discovery and tracking of bursting topics is achieved by statistical methods.
In addition, considering the problem of timeliness, we apply the popular Spark Streaming, a real-time stream computing framework, to design and implement the real-time topic detection and tracking algorithm. It can help solve the big data computing and analyze these data in real time. This research can be used to applications that aim to increase public awareness of emergencies in real time and timely decision-making over relevant agencies.
This paper includes six sections. Section 1 is the introduction, followed by a detailed review of related word in Section 2. In Section 3, the Negative Binomial Distribution model is introduced and verified to be rational and correct when fitting the trend of the Weibo word frequency. Then the computing method of corresponding parameters combined with Weibo data is proposed, and methods to track the topic trends by tracing the model are described. In Section 4, the proposed topic detection and tracing algorithm is introduced based on stream data, and the overall architecture and real-time implementation in the framework of Spark Streaming are described. In the following section, several experiments are conducted. First, we compare the fitting results between the Negative Binomial Distribution model and the usual Poisson Distribution model. Then, we run the proposed algorithm to track the topic trends and present a discussion. By comparing with real situations, the topic detection and tracing approaches proposed in this paper are proved to be feasible and rational. Finally, our conclusions are presented in Section 6.
Related work
Weibo is a social platform that covers massive dynamic information, and an increasing amount of research has been dedicated to data mining, topic propagation prediction, topic detection and related research. In relation to this topic, several applications have been reported.
Traditional topic detection is mostly based on clustering, classification, topic modeling and other methods [1], and most of these approaches are applied in an off-line fashion [11] used an infinite-state automaton to model stream data through obtaining the state transition to signal the changes in the bursts [14] analyzed the characteristics of information dissemination on social networks and proposed an information trend prediction algorithm based on information classification and regression algorithms. Some of the algorithms, such as the Latent Dirichlet Allocation (LDA), improve the accuracy of the topic detection by applying an LDA topic model to extract the hidden topic information [15, 16]. However, much of traditional research is based on static historical data without considering the evolution of topic trends and the timeliness and instantaneity of the topic detection.
In recent years, more scholars have realized that the value of social network topics decreases as time progresses, especially for events such as natural disasters and health alerts and some related topics, which are important for early warning systems and public security. Some studies on social networks have been conducted on topics such as tornados [2] and earthquakes [3]. Timeliness is considered in increasing numbers of topic detection methods, through which stream data are dynamically processed [17] proposed a real-time analysis framework of Twitter data to detect burst topics in a timely manner and to analyze the corresponding topic trends. However, historical data are still required to determine the optimum parameters for each keyword [18] concerns the identification of the first occurrence of emergencies in such a way that alerts could be issued as early as possible [1] improved the on-line LDA model with a dynamic update model mechanism based on time slicing. To detect new ‘burst’ topics, the evolution of topics will be recalculated each time the data are updated [19] proposed a topic detection algorithm based on a windowing variation technique. Although a burst-word list was obtained with a keyword ranker module, the dynamic trends of the topics were not considered. Overall, algorithms that considered timeliness ignored the dynamic tracing of the topic trends, and these relevant studies were conducted on data mining based on specific small data sets. These studies did not consider the real-time distributed parallel processing of big data. However, the large capacity of social networking data makes the processing efficiency and computing ability two key issues to be considered.
There is much work that takes into account the above issues [20] proposed a sketch-based topic model and achieved real-time detection. But it could not handle streams of data with multiple topic [21] used hierarchical clustering of tweets and rank the clusters to obtained the topics within a long period (e.g., 24 hours), while did not take the timeliness and burst into consideration [22] detected potential topics in real-time by first clustering the tweets and then analyzing whether the clusters meet some characteristics. However, its parameters were not set in a flexible manner to accommodate various type of data, such as Weibo.
In respones to the limitations of the previous researches, we proposed a method to meet all the needs. Our study is not limited to focusing only on the real-time detection of burst topics and dynamic trend tracing based on the Negative Binomial Distribution. We design suitable algorithms for Weibo’s stream data while considering its features and real-time requirements. Moreover, the presented algorithm is applied to the Spark Streaming framework, and a specific parallel plan is advocated to address the problem of real-time analysis and processing on Big Data.
Topic trend model
In Section 3.1, the Negative Binomial Distribution model, the chosen model used in our study, is introduced, and the theoretical basis of choosing this model is presented. To provide a theoretical foundation for the subsequent study, the meanings and computing methods of two parameters, m and r, are described. In Section 3.2, the hypothesis that the frequencies of the occurrences of the Weibo words match the Negative Binomial Distribution is examined. The topics consist of several certain keywords, and, thus, it is believed that the selected keywords identify the topics. To explore this assumption, the topic – Nepal earthquake – is taken as an example, and its Weibo data are analyzed. This approach lays the foundation for subsequent research on trend tracing. Finally, the topic trend-tracing method using the Negative Binomial Distribution model is demonstrated in Section 3.3.
Negative Binomial Distribution model
In some of the literature, the Poisson Regression Model is often adopted to fit the distribution regularity of discrete random events [23]. The Poisson Regression Model demands that the outcome variable is subject to Poisson distribution, and the occurrence of random events is independent. In Poisson distribution, the random variable’s mean value is equal to its variance. However, for many practical applications, such as the distribution of infectious diseases in an area, the biological aggregation distribution, and so on, the count data are more disperse and do not meet Poisson distribution. In addition, there are often correlation factors between the occurrences of events. Therefore, in some cases, the Poisson Distribution Model is difficult to adapt to the actual situation.
The Negative Binomial Distribution is a compound distribution for which the parameter
Weibo burst topics consist of some specific keywords. When a topic appears, the corresponding keywords show a certain state of aggregation over a period of time, and these keywords have their own life cycle. Considering the discreteness and non-independence of microblogging data, we use the Negative Binomial Distribution to simulate the varying pattern of Weibo word frequencies with respect to their life cycles. The related definitions of the Negative Binomial Distribution are as follows:
The expectation and variance of the Negative Binomial Distribution are shown as Eqs (2) and (3):
According to Eq. (2), we set the expectation equal to m, and, then, we obtain Eqs (4)–(6), which are the variance and probability mass function of the Negative Binomial Distribution.
From the probability mass function, we can see that there are two parameters,
Suppose that
In the formulas above,
Suppose that in the time series, for any time interval
To verify the assumption above, we obtain the actual Weibo data, which is the latest 50 minutes of public Weibo data from Sina Weibo API after the Nepal earthquake occurred, to analyze the goodness of fit of the Negative Binomial Distribution. The data crawling program accesses the API interface once every 30 seconds, and, each time, it obtains the latest 200 microblog data. After the word-segmentation and word-frequency counting steps, we obtain 100 frequency data points for each word we choose to test, and, here, we choose the words ‘Nepal’ and ‘Earthquake’ to test the goodness of fit.
The arrival rate of “Nepal”.
The word frequency variations of these two words over time within 50 minutes are shown in Figs 1 and 2. Then, we use the 10-minute data (20 data points), 20-minute data (40 data points), 30-minute data (60 data points), 40-minute data (80 data points) and 50-minute data (100 data points) of these two words to validate the assumption that the word-occurrence frequencies in Weibo match the Negative Binomial Distribution (NB distribution). In this experiment, the Pearson Chi-Square test is adopted, and the fitting results are shown in Tables 1 and 2.
Fitting results of the NB distribution model for the word ‘Nepal’
Fitting results of the NB distribution model for the word ‘Earthquake’
The arrival rate of the word ‘Earthquake’.
The null hypothesis of the Pearson Chi-Square test is that the test data are consistent with the theoretical distribution. From the fitting results, we can find that the p-values are all greater than the significance level (0.05). Thus, the assumption that the word occurrence frequency in Weibo matches the Negative Binomial Distribution model, regardless of whether there is a short time or long time, is validated. Therefore, the Negative Binomial Distribution model can be employed to simulate the word frequencies in Weibo at different times. Furthermore, in Section 5.1, we compare the fitting degree between the Negative Binomial Distribution model and the frequently used Poisson Distribution model, which further validates the rationality of using the Negative Binomial Distribution to fit the Weibo data.
Thus far, the model’s performance has been tested. In the next section, the word trend tracing method is introduced based on this model.
As mentioned above, if a discrete random variable is consistent with the Negative Binomial Distribution, then the two arguments can be calculated as Eqs (7) and (8).
From a mathematical perspective,
Burst events often occur with an abnormal frequency growth of related Weibo words over a period of time. Changes in word frequency lead to variations in the model’s arguments. Thus, tracking the variations in
Using the model to fit the distribution of a certain Weibo word and capturing its word frequency in a time sequence, these two arguments can be calculated. In the model,
Algorithm design of real-time topic tracing and implementation based on Spark Streaming
This part consists of four sections. First, the overall computing process framework of topic discovery and trend tracing based on the Weibo stream data is introduced in Section 4.1. Next, the methods of obtaining the data streams and data preprocessing are presented in Section 4.2. In the subsequent section, the topic discovery and tracing algorithm based on sliding windows are demonstrated. Finally, the design of the implementation process of the real-time tracing algorithm using the current popular real-time processing framework Spark Streaming is shown.
The architecture of the calculation process
Weibo produces large amounts of data every moment, and the value of the data decreases over time. The timeliness of topic discovery has become a key issue to be considered. Thus, we abstract Weibo data into stream data according to its characteristics, and the computing framework based on stream data is brought up in such a way that the data can be processed as long as the data continue to arrive, which is exactly the real-time data analysis.
Figure 3 shows the overall computing process framework of the topic discovery and tracing based on the stream data. The following algorithm describes above procedure in detail. Lines 1–5 show the data preprocessing algorithm.
Obtain data through the Weibo API and set the program to access the API once every 30 seconds. The data stream can be acquired once every 30 seconds, and, every time, the raw data sets are sent to the message-oriented middleware. Preprocess the raw data, including word segmentation, stop word filtering and word-frequency counting. Use the algorithm proposed in Section 4.3 to compute the hot degree and the dispersion degree of each word based on the Negative Binomial Distribution model to further describe the word’s trends. Repeat Steps 1–3 to obtain the hot degree and the dispersion degree in an interval of time. By analyzing the variances of the above two factors, every word’s growth trend and popular level can be estimated to generate a final data set of hot words. Thus, the trends in the hot words can be tracked continuously, and their life cycles can be indicated.
Calculation process of topic detection and trend tracing.
Sina Weibo, the largest microblogging service platform in China, provides developers with an open API service, including hundreds of API interfaces and mainstream SDK. Through the corresponding API interface, the latest Weibo data can be obtained. The data accessing and preprocessing steps are as follows:
Through the official API, the latest microblog data are acquired in JSON format. To facilitate the computation and analysis of the model arguments in real time, we accomplish the secondary development on the basis of the official API, in which the latest 200 microblog data can be obtained for every For preprocessing the raw data, the open-source word segmentation tool, IK Analyzer,1
After the word segmentation, the microblog text is divided into a number of meaningful words. In this step, the word frequencies are counted to continue following the processes of topic discovery and trend tracing. Each piece of data from Step 2 becomes an ordered triple (word,
The method used for topic detection and trend computing, based on the Negative Binomial Distribution, has been introduced in Section 3.3. To achieve real-time topic analysis, we focus on the latest data and discard the old data [19, 17]. Considering the randomness of the word frequencies over some intervals of time, we propose a trend computing algorithm based on a sliding time window.
The main idea of the algorithm is to design a time window and wait for the word frequency to fill this window. When the window is full, the trend for this word is computed, and the sliding window moves forward one unit to wait for the next data. Therefore, the multiple sets of
As Fig. 4 demonstrates, suppose that the size of the time window is
The trend computing algorithm based on sliding time window.
As shown in Section 3.3, by computing the hot degree
The following algorithm describes the topic discovery and tracking process based on the time window. Lines 1–5 show the data preprocessing algorithm. First, the Weibo data, such as the Weibo contents and corresponding time stamps, are extracted. Second, the Weibo contents are segmented into words through the Map function. At the same time, the stop-word dictionary is employed to filter the meaningless words, such as demonstrative pronouns, conjunctions, auxiliary and punctuation, to reduce the computing cost in the subsequent analysis. Finally, the Reduce function is used to count the word frequencies in the Weibo content and to generate the tuple ([word, timestamp], freq).
Lines 6–12 demonstrate the process of topic discovery and trend tracing. First, it is determined whether the time window is filled. If not, a new arrival word frequency is added to fill the window; then, the values of
Spark Streaming, a scalable, high-throughput and fault-tolerant real-time stream data-processing framework based on Spark Core, was created to address the demands of the real-time responses and the increasing amount of data. Spark Streaming divides the stream computing into a series of short batch jobs, and each piece of data is converted to a Spark RDD (Resilient Distributed Dataset). Spark can convert these RDDs into intermediate results and store them in files or output them to an external device according to the business’s needs. At the same time, the Spark framework provides a distributed, scalable computing platform that can better resolve the performance bottleneck of the massive amount of data being processed.
Because the proposed algorithm is based on real-time stream data and the Weibo platform processes a large volume of data, the Spark Streaming computing framework is adopted to achieve the dynamic discovery and real-time tracing of topics. According to this framework, the batch-processing time interval, window length and sliding time interval are set to determine the processing time intervals of the Weibo data. These indicators can be adjusted depending on the actual situation.
The implementation logic of the topic discovery and tracing algorithm of the burst topics in Spark Streaming is designed based on the basic principle of Spark Streaming. The computing process has been introduced in Section 4.1, and the implementation architecture based on Spark Streaming is shown in Fig. 5.
Kafka, a distributed message queue that performs better in terms of the throughput, reliability, and scalability, is employed here as data transfer middleware. It receives stream data from the Weibo platform via Weibo API while accepting a Spark Streaming request and sends the corresponding data to the Spark Streaming cluster.
DStream (Discrete Stream) is an abstract of the stream data in Spark Streaming. It can be obtained either from an external input source or by conversion of the input stream. Each DStream contains the stream data in a specific time interval. With good expansion capability provided by Spark, this series of implementation logic can be extended according to the amount of data size in the Spark cluster, which will ensure the data processing capacity. As shown in Fig. 5, the entire implementation logic processes and converts Weibo data many times. The functions of each node are listed in Table 3.
Logical implementation of the algorithm based on Spark Streaming
Logical implementation of the algorithm based on Spark Streaming
Architecture of the topic detection and trend tracing algorithm based on Spark Streaming.
In this section, firstly, the feasibility of the Negative Binomial Distribution model is validated by comparing it against the frequently used Poisson Distribution model. Then, the algorithm proposed in this paper is applied to show the results of burst Weibo topic detection and trend tracing.
The comparison of the fitting results between the Negative Binomial Distribution and Poisson Distribution
As described in Section 3.1, the Poisson Distribution is often used to simulate the probability distribution of the occurrence frequencies of the random events in a large number of tests. However, the Poisson Distribution cannot adapt to the actual situation in some cases with respect to its variance and mean being equal. For example, if the distribution of the actual data is more divergent and its variance is greater than its mean, then using the Poisson Distribution will not be suitable. Therefore, considering the Weibo data’s dispersion, Negative Binomial Distribution is adopted to fit the distribution of the Weibo topics.
To verify that the Negative Binomial Distribution fits the Weibo topic data more accurately, we use the Weibo data after the Nepal earthquake, as mentioned in Section 3.2, as samples, and we compare the fitting results of the Negative Binomial Distribution and Poisson Distribution with the words ‘Nepal’ and ‘Earthquake’. Here, we adopt the Pearson chi-square test to verify the fitting degree, and we implement the test by using the R language. The comparison results are shown in Tables 4 and 5.
Comparison of the fitting results between NB and Poisson with the word ‘Nepal’
Comparison of the fitting results between NB and Poisson with the word ‘Nepal’
Comparison of fitting results between NB and Poisson with the word ‘Earthquake’
To enhance the credibility of our experiments, we analyzed the Weibo data after the big explosion at TangGu Tianjin China on 12 Aug 2015 in the same way as above. The Results are shown in Table 6 and 7.
Comparison of fitting results between NB and Poisson with the word ‘Tianjin’
Comparison of fitting results between NB and Poisson with the word ‘explosion’
As shown in the tables, the Poisson distribution marginally succeeds at fitting the data over the short time periods but fails over the long term. Nevertheless, even though the Poisson Distribution can fit the occurrence frequency over a short time period, its p-value is smaller than that of the Negative Binomial Distribution. This finding indicates that the Negative Binomial Distribution performs better than the Poisson Distribution at fitting the Weibo data.
As previously mentioned, after the Nepal earthquake occurred, the Weibo data were obtained from the Weibo API as a data source. In the computing topology, the program is set to access the API once every 30 seconds to obtain the most recent 200 pieces of Weibo data.
The changes in the obtained topic word sets obtained by running our proposed algorithm are shown in Table 8. Here, we set the time-window size to be 20 minutes and update the results every 2 minutes.
The results of the topic word detection
The results of the topic word detection
From Table 8, we can find that some topic words, such as ‘Earthquake,’ and ‘Nepal,’ are persistently maintained in the word sets. This finding is consistent with the real situation. Some of the words, such as ‘rescue’, ‘urgency’ and ‘disaster’ also occurred frequently. These words are closely related to the Nepal earthquake. Furthermore, the words ‘constellation’ and ‘weather’ are popular words in Weibo at any time, while they do not represent any meaningful burst events. This type of word can be filtered in the step of data preprocessing. In addition, advertising words, such as ‘red packets’, ‘taking a taxi’ and ‘coupon’, occur occasionally. The reason that these words are detected is that advertising topics can occur frequently during a specific time period, but the lifecycles of these topic words will never end.
Above all, for the real burst topics and emergencies, the keywords occur continuously in the word set for a long time. However, for the advertising topics, their lifecycles do not last very long.
Here, we choose the topic word ‘Nepal’ as an example, and we trace its trends and discuss the impact of the time window size on the results. Figure 6 shows the variations in the hot degree
Trend tracking results of a topic’s Hot Degree.
Figure 6 shows that the overall hot degree trends of the topic are similar even though the time window sizes are different. The value of the hot degree rises rapidly at the beginning, followed by smooth fluctuations over a period of time, and then, it declines at approximately 35 min and, finally, climbs back. During the total 50 minutes, two periods of obvious increase occur, from 10 min to 20 min and from 35 min to 45 min.
What’s more, the curve with larger time window of 20 minutes is smoother than the curve with a time window of 10 minutes. That is, when the time window is larger, the curve is smoother. This relationship arises because a larger time window has smoothing effects on the random data. Furthermore, when the time window size is 20 minutes, the appearances of the peaks are delayed compared with those with the time window of 10 minutes.
Figure 7 displays the changes in the dispersion degree
Tracking results of the topic’s Dispersion Degree.
As previously mentioned, there is a negative correlation relationship between the value of
At the same time, during the period in which the hot degree obviously increases, the dispersion degree also has an apparent downward trend. During the period in which the hot degree is continuously fluctuate, the dispersion degree decreases, and finally, it retains stable. This finding illustrates that when the topic’s hot degree changes significantly, the dispersion degree
Combined with Figs 6 and 7, when the time window size is small,
In this study, research on the occurrence regularity of Weibo words is conducted based on social platforms, and the idea of using the Negative Binomial Distribution model to fit the word distribution is innovatively proposed; in addition, the Weibo topic trend is simulated by tracking the model arguments. Meanwhile, a topic detection and trend-tracing method based on the time window is designed, which can eliminate the randomness of the data and improve the accuracy of the results, further allowing a better fit for analyzing the Weibo stream data. Furthermore, according to the Spark Streaming real-time stream computing framework, the architectures of Weibo topic discovery and real-time tracing are designed to implement the relevant algorithm in real time. Experiments prove the effectiveness of the proposed method, and we further discuss the influence of the time window size on the results. The research in this paper can be used for applications designed to inform the public of emergencies in a timely manner and to applications used to assist related agencies in making important decisions in time.
Footnotes
Acknowledgments
This paper is supported by the National Natural Science Foundation of China under Grant No. 60940032, No. 61073034, and No. 61370064; the Program for New Century Excellent Talents in University of Ministry of Education of China under Grant No. NCET-10–0239; the Science Foundation of Ministry of Education of China and China Mobile Communications Corporation under Grant No. MCM20130371; and the Open Project Sponsor of Beijing Key Laboratory of Intelligent Communication Software and Multimedia under Grant ITSM201503; The National Social Science Foundation of China under Grant No. BCA150050.
