Abstract
The demand for Cyber Social Networks has increasingly become the main source of information propagation due to the rapid growth of micro-blogging activity between socially connected people. The process of detecting disaster events, in huge volumes, on fast-streaming platform is quite challenging. In this paper, an information entropy based event detection framework is proposed to identify the event and its location by clustering relatively high-density ratio of tweets using Twitter data. The Shannon entropy of target users, location, time intervals and hashtags are estimated to quantify the dissemination of events as “how-far about” in real- world using entropy maximization inference model. The geo-tagged (spatial) tweets are extracted for a specified time period (temporal) to identify the location of an event as “where-when about”; and visualizes the event in geo-maps. The evaluation parameters of Entropy, Cluster Score, Event Detection Hit and False Panic Rate during four major disaster events are identified to illustrate the effectiveness of the proposed framework. The retweeting activity of the Twitter user is classified as human signatures and bots. The experimental outcome determines the scope and significant dissemination direction of finding events from a new perspective which demonstrates 96% of improved event detection accuracy.
Introduction
In recent years, Cyber-Social Networking (CSN) platforms like Twitter, Facebook, Sina Weibo and Tumblr have encapsulated huge chunks of information comprising insightful user opinions about various events [1]. The problem of evaluating unstructured data in user generated content has ubiquitous applications, including the identification of conversation topics and abnormal events [2]. The online events that match with actual real-time events are of diverse temporal and spatial scales, particularly disasters like earthquakes, floods, landslides, etc. to deliver prior alerts, support, and immediate recovery [3, 4]. The CSN act as communication tool and double up as informative podiums that catch real human voices reaching out during disasters. However, the information about ongoing events and sensitive issues that remain unnoticed. Providentially, the CSN is used as new additions to measure the impact of disaster events due to wide usage of smart phones [5]. The geographically linked CSN data containing geo-tags is accepted as a trustworthy source for detecting disaster and provides wide range of accessible services [6]. In 2015, a study explored the retweeting activity geographically from collective Twitter population around the globe during Hurricane Sandy [7] disaster. They used feature-rich classifiers for determining the microblogs which are most relevance to specified event [8]. The events could be recognized as abnormal spikes in activity [9] by monitoring the normal flow of microblogs using change point detection approaches [10]. It determines the intensity of the sensitive and abnormal event that was deciphered in the course of fluctuations in content. The annotation scheme [11] for identifying relevant tweets after disasters usually carried out using keyword-based collections without location features. The tweets were semantically examined to detect the overlapping event communities by relative density-ratio estimation in FuSeO model [12]. The community clusters were created using keyword-based ontology which consist of most appropriate hashtags and keywords. The challenge of quantifying the in-depth traces of event information dissemination is resolved in our proposed work by introducing new optimization parameters. The information entropy based framework is proposed in this paper which uses the metadata of a tweet to detect disaster event diffusion and comprehend online user behaviour during crises event. The proposed framework is specific to the Twitter platform by considering the retweeting activity of online users to address the challenges involved.
This paper is organized as follows. The existing event detection models are discussed in Section 2. The Shannon entropy maximization inference model to quantify event is postulated in Section 3. The proposed event detection framework is presented in Section 4. The experiments and results are discussed in Section 5. The conclusion and future work is discussed in Section 6.
Related Work
The various methods to evaluate CSN data to build supportive frameworks and models for online event detection have been proposed over time. In early 2015, a system for sensing complex social systems was introduced which use Bluetooth-enabled mobile telephones data to measure information using Shannon entropy threshold method [13]. The social patterns were recognized across different context namely regular user activity and user connectivity to detect event locations profoundly. A graph based entropy models was deployed in [14] to study information flow in a social entity using Part-of-speech tagger. It focused on node entropy in the social entity graphs with random error pruning. The event detection studies in [15, 16] using Twitter data explored the critical survey of all possible event detection methodologies. It classified the event detection techniques into specified or unspecified and the task of detection as retrospective or new event detection, methods of detection as supervised and unsupervised. The tweets with its meta features were combined with ‘retweeting’ dynamics [17] to examine the behaviors of Twitter users was considered by Ghosh quantification model which merely based on tweet content. The activities were categorized using user generated content namely tweets and retweets. It investigated the process of spreading specific pieces of information using keyword sequence and the user’s viral diffusion behaviors which is used for a comparison with our work. The Ghosh model categorized the tweeting and re-tweeting activity in Twitter as newsworthy information, robotic activity advertising and promotion, campaigns.
The event was identified using topic summarization on microblog features [18] as a bipartite graph. The bipartite graph was generated to capture the relationship between two topics discussed during simultaneous time period to summarize events. The node entropy in the bipartite graph was used with consistent event pair to merge events. In [19], the Twitter hashtags are utilized to indicate events based on which each user node (contributed authors) entropy was mined. The entropy based regression model [20] was used for identifying trending topics which follows binomial distribution.
The popular Sina Weibo compute Cosine similarity to evaluate similarity of the influential corpus [21]. The burst corpus which repeatedly occur in the batches of incoming messages were closely related with clusters. The methodology resulted in 75% accuracy which is relatively low than the proposed approach. The temporal dynamics, including lifecycles and tipping-points, of tweets’ popularity were used for event detection. The Mention-Anomaly-Based Event Detection (MABED), that depend simply on tweets and the cluster score to assess the magnitude of event impact over the crowd [22]. MABED approach detected sub-events by performing spatial and temporal random indexing on textual descriptors.
In the field of disease outbreaks, the CSN agent [23] estimated the information gain in the event location, which evaluates potential time of influenza affected zones using keyword entropy. The study syntactically parsed tree and classified to detect disease reporting sentences in Twitter. An unsupervised Bayesian model [24], extracted the structured representations of events in the form of quadruples namely entity, keyword, date, location and further categorized the extracted events into event types. A general query language models [25, 26] extracted events through a series of machine-learned approaches which is adopted in our work. For our experimentation, the event detection process considers Twitter parameters namely location co-ordinates, place check-ins, geo-tags are used for spatial information which is used in our work.
Information entropy based quantification model
The proposed work deals with Twitter user activity which is transformed into an information theoretic method by calculating the information entropies to detect events. The entropy maximization inference quantification model is shown in Fig. 1. It takes Twitter tweets and Re-tweets as input. The process uses the objective function and entropy parameters; namely, Hashtag Entropy (E
H
), Time Interval Entropy (E
T
), User Entropy (E
U
), and Location Entropy (E
L
) to detect events. The motivation towards the event detection problem is two-fold: first, maximizing information entropy which minimizes the overall prior knowledge established in the distribution of tweets; second, the proposed inference model which provides maximum entropy value over a period of time. The final convex-constrained Shannon entropy E optimization formulation of given function as follows,
Tweeters and Retweeters
CSN users are strongly correlated to carry information to extremes in terms of tweets and retweets. The tweets disseminate an event at any time interval from a location.
Entropy Distribution
The entropy distribution given in Equation (1) and Equation (2) that measures the propagation of the trending event from source. The users impulsively create more influence because of network size and density, as well as a tendency on the part of the followers to retweet posts whenever an event occurs.

Entropy Maximization Inference Model.
The set of Twitter posts of a user u ∈ U with a topic word is represented as Pu,word. Let nu,p be the total number of posts by users. Assume b represent the type of post where b = 1 indicates normal tweets, b = 2 retweets (RT) and b = 3 replies (RE). The time interval of a particular user u, who posts ‘p’ with the topic word ‘W’ from (t - 1) to t, is expressed as Δtu,p,t where Δtu,p,0 = 0, because the first post is assumed to be at the starting time. The regularity of the traces of tweets is measured using the time interval entropy on topic words. The frequency of the word (W) in a given time-interval Δtu,p,t is calculated as follows.
The measure of similar user distribution on topic words represent similar user entropy. Let the random variable D represent a distinct user in trace T i with all the possible values {d1, d2,. . . , d n }, The number of retweets from user U i be found in trace T i , whereP F represents the probability density function of D, such that P F (d i ) gives the probability of a retweet being generated by user U j , and the frequency of user U i ∈ U, with post M to user U j ∈ U.
The hashtag entropy is a measure of significant hashtags in a set of tweets, defined as hashtag entropy. Consider N tweets containing the hashtag (HT) used frequently by K number of users generated g1, g1,. . . g
k
times relevant to a particular event,
The frequency of the location geo-tagged by diversified users is captured using location entropy. Let {l : l1, l2,. . . , l
n
} be the set of locations which are geo-tagged in tweets by a set of users {u : u1, u2,. . . , u
n
} is formulated as
The cumulative score of the tweets that discuss about the same event is computed using score of keyword (KS), tweet score (TS) and cluster score (CS) as follows.
The value of each word relevant to a disaster is assigned to a value in dictionary. Based on word score, the tweet score is computed as follows.
The tweet score is calculated for each keyword which is normalized to interval 0 to 1. Moreover, the score of cluster is computed as follows.
The higher value cluster score, the more the cluster leads to be informative event. Therefore, the entropy is calculated for each cluster which estimates the quantum of information spread from each cluster.
The proposed work identifies the disaster-related event based on the tweet corpus and evaluates the dynamics of the tweeting activity during the disaster. The framework performs change point detection, burst corpus identification, followed by location detection. The tweets collected from Twitter are stored in a tweet repository, and accessed iteratively to retrieve the most recently collected tweets. The system is pre-configured with corpus library for referencing burst words. The usage history of the word w, so the system recognizesthe non-bursty and bursty occurrence of the word w over specified time interval. During pre-processing, the most relevant tweets arfitered by eliminating noisy and irrelevant tweet corpuses. Further, stop words, stemming, and tokenization, including lookups is performed using WordNet dictionary [27]. The keywords in the identical subset appear in the similar topic of discussions. Instead of detecting the most similar words, the system identify the event clusters that are most similar to preceding clusters in order to detect an on-going event. The proposed event detection framework is shown in Fig. 2.

Proposed Event Detection Framework.
The tweet is converted into vector format using bag-of-words vector space approach. After pre-processing, the burst corpus are identified the meta-features namely text, location, hashtag, time, url, and images. The tweets assign to a value, using a set of 5 distinctive word feature: words, tags, links, mentions and users. Each cell value in the matrix from 0 to 1 reflects the importance of the particular term in the specific tweet. The Document Incidence (DI) is the number of tweets in which the word appears. The Global Frequency Rate (GFR) is the total number of times the word appears within the tweet dataset between the First Interval (FI) and Last Interval (LI). The tweets assign to a value, each cell value in the matrix from 0 to 1 reflects the importance of the particular term in the specific tweet. Temporal Burst Ratio (TBR) is obtained by calculating the Z-score of the term between intervals to detect change point detection.
Local-Global (LG) Ratio of the term is computed as the ratio of local (interval) frequency to global document frequency. Event Link Ratio (ELR) is the ratio between the total amount of tweets containing links related to an event and the total number of tweets for that interval. The lesser value of ELR, the fewer the links between events and, vice versa for highly-linked events with a value nearing unit value. A high Z-score means that the word is unusually more frequent and therefore likely to be a good descriptive word diffused, or else that rare topics (high homogeneity index) are being discussed within the interval from event location. It results in determining the behavioural homogeneity among Twitter users. Finally, the event clusters are formed using K- means clustering using event detection algorithm.
Algorithm: Event Detection Algorithm
The event clusters are formed using k – means clustering based on the similarity measure. The similar users are clustered using the Jaccard Coefficient Index, which is used to find the distance between user pairs. The resultant k-means clusters are assigned a score called Cluster Score in Equation (10), to retain meaningful event label using the dataset tabulated in Table 1.
Summary of Disaster Events Identified using Four Datasets
Online User behavioral analysis
Tweets with the highest burst ratio for a specific time interval are filtered. For the experiments, a training dataset collected using Twitter API comprises a total of around 20,000 tweets captured during various disasters between 2015 – 2016, in which 16,000 tweets are valid ones that identify 3 disasters namely Flood, Landslide and Earthquake events (Table 1) tweet arrival rate λ= 100 tweets per minute. The attributes considered for the analytics are Twitter User ID, Time, Time-zone, Tweet, Geo-location, URL, retweet count, replies, followers. After pre-processing the tweets, word indexes are constructed to compute the statistics model for the distinct parameters. Similar users with similar content are clustered using the Jaccard Coefficient Index [28]. The TF-IDF is computed for 3-grams since uni-gram approach cannot significantly distinguish the event occurrence Initially, the tweets are clustered as similar users who retweeted similar tweets. Homogeneity Index (HOM) parameter is the percentage value in which tweets within a particular interval use the same keywords. These categories are used as class labels and the associated words in tweets are trained for each class. The overall statistics of word indexes are given in Table 2. It represents the usage of burst keywords and location names to detect the event type as ‘flood’ and its location as ‘Assam’.
Tweet Word Statistics (Dataset: Assam Flood Tweets)
Tweet Word Statistics (Dataset: Assam Flood Tweets)
The observations made for the dynamics of retweeting activity are captured in four distributions: time interval, hashtag, location, user distribution and similar user. The arrival rate (time/hour) and the frequency of retweeting activity is recorded using Chorus TweetVis [29]. It shows that the popular retweeting frequency of Twitter accounts such as India News (online news posting on Twitter), BarkhaDutt (a popular India journalist and columnist on NDTV) and PTTV Online (a popular Tamil television online) is examined in our experiments is shown in Fig. 3. The human signature on Twitter with inter-arrival rates of different lengths are likely to be equally distributed, where India News, BarkhaDutt and PTTV online tweeted on Twitter with a significant distribution of retweeting frequencies by their followers. The automated activity of bot account namely Bloombrg Newsish shown in Fig. 4, retweet at regular intervals of time, culminating in an isolated maximum peak in distribution. Figure 5 represents the retweeting activity by distinct users and hashtags which shows the maximum retweets of location based hashtags. The time line analysis of tweet distribution for Assam flood-related tweets from July 1, 2016 to August 21, 2016 is shown in Fig. 6 which records the tweets spiked out with maximum homogeneity.

Retweeting Activity of News Channels (Dataset #1).

Retweeting Activity of Bot (Dataset #1).

Retweeting activity by Distinct Users and Hashtags (Dataset #1).

Timeline Analysis (Tweet Arrival) during Assam Flood Event.
The human signature on Twitter with inter-arrival rates of different lengths are likely to be equally distributed, where India News, BarkhaDutt and PTTV online tweeted on Twitter with a significant distribution of retweeting frequencies by their followers (Fig. 3). The retweeting activity during Assam flood dataset shows the actual number of tweets spiked, out of which only 40% of tweets contains informative content (URLs, hashtags, geo-tag) and signifies 89% homogeneity of similar users who uses similar set of keywords to represent an event. In addition, the tweets are classified as two classes: Class A (without multimedia content) and Class B (with multimedia content) using naïve bayes classifier. The entropy values from Table 3 show that users retweet more multimedia content such as images, videos, and audio links rather than raw text tweets without multimedia content. The procedure of spreading specific pieces of information, including corpus sequence and viral diffusion behaviours, is used for a comparison with our work. The activities are characterized as two distinct categories of retweeting activity on Twitter: robotic activity and newsworthy information. The arrival rate (time/hour) and the frequency of retweeting activity. In contrast, Bloombrg Newsish (bot account) [30] shows an automated retweeting frequency at regular intervals of time, culminating in an isolated maximum peak in distribution which correlate with the entropy maximization inference model. The most significant parameters for burst detection are identified to the tweet arrival rate λ, and the number of tweets N containing word.
Evaluation of Entropy (Class A Vs Class B)
Geo-tags are extracted from tweets using the Geo-filter option offered by the Twitter Streaming API [31]. The user’s current location is assumed to be the ground truth for the event location occurring in the most recent time. In some cases, there is a chance for the event to have occurred at a particular location and the user who tweets about the event might in different location. The geographical information which is extracted using Named Entity Recognizer (NER). In order to obtain the geographical coordinates, the output of NER for specific location is Geo-coded. Geo-tags associated with trending hashtag frequency increases in events and trends, reflecting obsessed user attention concerning the area of the event. The information is investigated in a different way, including what and whose content the individuals chose to broadcast. The Twitter-based relationships are examined if they are newly formed because of the disaster. It is found that the general population in the tactic of the disaster support is based on retweeting local tweets and those with locally-significant data. The location is extracted from the tweet using the Stanford NER [32] and the Geonames API [33] is used to carry out the Geo-parsing and Geo-coding processes.
The local terms for next level of geographic places are turned out to be difficult. Hence, a heuristic for selecting local terms is considered. First, the word statistics for the extracted location and the count of Twitter users who frequently used those words in their tweets is calculated. The threshold is fixed for words that frequently occur in tweets by at least the N number of the people in that location. The value of N is an optimally chosen parameter, N = 10, i.e., at least 10 users should be found frequently using the term ‘T’ from a particular location. Secondly, the geographic bounds of event location are determined by conditional probability of the users (who used the said term in relation to the event), their location and the location of the event are compared. The visualization of specific location is plotted on a map with its label, as shown in Figs. 7(a) and 7(b). Finally, the difference of conditional probabilities between user location and event location is calculated and compared with a threshold. If the comparison result is successful, it ensures that the word used by the user is most relevant to an event at a particular location. The extraction of the location of the tweet message is shown in Table 4, with the maximum likelihood of the location where the user at event place.

Visualization of Event using Geo-Coded Tweets.
Location Estimation using the Maximum Likelihood (user and event location)
The effectiveness of the proposed framework is evaluated concerning the safety-critical in terms of the probability of event detection hits P
hit
, and false panic rates P
Fp
. Effective event detection and disaster response establishes a good balance between P
hit
and P
Fp
. The P
hit
and P
Fp
are computed as
The tuning parameters, namely, keyword similarity η = 5, temporal.τ = 0.61 minutes, locality ξ = 10 and hashtag ω = 5 are fixed. The clustering algorithm is applied in 3 scenarios by consolidating these major features (i) Location+Time (LT), (ii) Location + Time + Tweet Content (LTC), and (iii) Location + Time + Content + Hashtag (LTCH). The Cluster Score (CS), Event Detection Hit (EDH), and False Panic Rate (FPR) is measured. The results in Table 5 show that the LTCH gives a high cluster score in Equation (10) through better accuracy (EDH) in detecting events with a minimal FPR. The CS and EDH values in the proposed approach is relatively high compared with Qian method [21], Guille approach [22].
Experimental Result- Cluster Score (CS), Event Detection Hit (EDH), and False Panic Rate (FPR)
The hashtags are categorized in two ways – with entropy less than 0.5 (Entropy A – with multimedia content) and equal to or greater than 0.5 (Entropy B - without multimedia content). The time decays, the probability of relevance hash tags with higher entropy remains to trend, whereas the low entropy hash tags fade away with time. It explains that the real human activity turns up to show high entropy value, whereas robotic (bots) activity turns down to show very low entropy. The entropy values are estimated for all events with respect to Location, User, Hashtag and Time interval using Equation (4), Equation (5), Equation (6) and Equation (7), which is shown in Fig. 8 for all 4 disaster datasets. It evidents that during disasters, users tend to reach CSN to tweet and retweet, and consistently embedded with the names of popular users/activists, hashtags, URLs, mentions, and locations.

Entropy values for Event.
The probability of relevance hash tags with higher location entropy and hashtag entropy remains to trend, whereas the low entropy for user and time interval which predicts important news sources of disasters on Twitter. It quantifies how vital the hashtags and location specific geo-tags are used for disaster news diffusion. The scikit-learn using python [35, 36] and machine learning library [37] is deployed to perform clustering of 20k samples into k = 15 clusters with k-means for 4 events.
The validation of the proposed framework is evaluated using Kappa co-efficient [34], the achieved agreement is 91%. It is evident that the proposed quantification model results in high accuracy in detecting flood-related disasters. The proposed framework is tested that are able to identify an event in diverse temporal and spatial location of interest. The events related to disasters are detected that are influential against the uncertain and noisy information present in the data between different time intervals. The experiments are conducted on a desktop computer with modest hardware (one i7 – processor, 16 GB memory) to collect tweet datasets of size 1 GB with the number of tweets n = 16000, and requiring non-linear time to compute the word index statistics and perform machine learning algorithms.
The cumulative entropies of popular, typical users are calculated using the quantification method and the proposed method. The proposed quantification method gives higher quantification accuracy when compared with the Ghosh quantification method [17] which is shown in Fig. 9. Meaningful event clusters such as Event 1(Assam Flood2016), Event 2 (Uttarakhand Landslide 2016), Event 3 (Chennai Flood 2015), and Event 4 (Manipal Earthquake) and their accuracies are examined to ascertain the differences between the two methods. The Ghosh quantification model results in a moderate accuracy of 73%. whereas the proposed quantification method shows up to 96%. The scalability of the proposed framework depends on the runtime of the maximum entropy optimization parameters.

Performance analysis of proposed work.
The framework for event detection in real-time CSN is proposed and the levels of information spread is quantified using Shannon Entropy for a time-bounded Twitter activity. The entropy maximization inference model exhibits the demand of transforming CSN into information spreading platform. The proposed framework detects four major events during 2015-2016, which clusters the Twitter data, showing an event detection accuracy of 96%. which is acceptably high. The results demonstrate the advantage of adopting Shannon entropy maximization inference model to quantify event in the CSN. Based on the retweeting activity, the Twitter users are classified as humans and bots. The integration of hashtag and location parameter with keywords generated by users significantly improves the event detection accuracy to the top hierarchy. It establishes the importance of quantifying information disseminating features involved in Twitter data.
In future work, the framework can be extended to support the tracking of disaster event in CSN with respect to the user’s native language in big data enabled ecosystem to support the scalability. The proposed approach is well suited to other sources of Geo-social content with substantial modifications.
