A correlation-based approach for event detection in Instagram

Abstract

Online social networks like Instagram has more than 600 million users and creates over 300 million new posts every day. All those data can be used to detect real world events. Many works have been proposed in the literature to detect such events using different techniques, but this task is still hard. It involves many challenges including the processing of large volumes of data, the lack of a ground truth and the need for an adaptive approach. In this sense, our work attempts to tackle these problems with a semi-supervised learning approach to overcome those challenges using times series from Instagram posts. Experimental studies demonstrate that similar time series can be used to generalize the knowledge and predict the occurrence of an event. Also, we demonstrate that Support Vector Regression is a good alternative to Gaussian Process Regression as the first provides good results using much less computing resources than the second. Moreover, we made our labeled dataset public, hoping it can be useful to other researchers as well.

Keywords

Event detection social networks Instagram Pearson correlation

1 Introduction

The advancement of new technologies, the popularization of the internet, and the reduced prices of the mobile devices have helped the proliferation of applications in this new Web 2.0 era. These applications, especially in mobile platforms, has created a massive amount of data through user-friendly free tools for messages, photos, and videos, mainly within the context of social networks (e.g., Facebook, Twitter, and Instagram 1 ). These latter are the most used applications because people have been sharing their experiences and opinions across social networks, daily. Several works have been proposed in the literature focusing on online social networks to solve real-world problems such as tragedy prediction, natural disasters, epidemics, and crime prevention [1 , 30].

The importance of social networks in people lives might be evidenced by metrics that show Facebook with almost 1,6 billion active users 2 . This amount is near 1/5 of the world population 3 . It makes Facebook the biggest online social network (OSN) in the world and generates more social referrals news than Google 4 . Another popular application is Instagram, which has more than 600 million 5 active users along with Twitter with almost 320 million, generating together over 500 million new posts every single day [3].

Most of the massive data create by social network users (e.g., images, texts, metadata, and geolocation) can be easily accessed through web-crawlers systems or public APIs. Monitoring and analysis of these data might provide valuable information and make possible a variety of work performed by practitioners and researchers.

In the literature, data streams from social networks have been used by researchers to identify validly, novel and potentially useful patterns [12, 24]. Instagram, for example, is used in different tasks such as to analyze user behavior [21], to discover and report local news [31] or to detect local media content [14]. In the recent researches using online social network data, for instance, the use of Instagram to detect events in real time [34], but the idea of being able to detect and report such occurrences is not a trivial task.

There are several challenges in event detection tasks, for instance, the high seasonality of the volume of posts during a period requires algorithms capable of doing nonlinear predictions, which are more complex. Another challenge is related to the absence of a trustful dataset [22] to measure the performance of the algorithms that make the benchmark comparisons more difficult and biased. Finally, as a massive data stream, social networks require a high computing power which makes some experiments not viable.

In this sense, we propose a novel adaptive framework to detect candidate local events in real-time through a semi-supervised learning approach. This framework first collects data from a social network (in our case, Instagram), creates a set of time series, applies different unsupervised and supervised learning techniques (e.g., clustering and regression techniques), then, in a 15 min interval presents the candidate events from a specific geographic sub-region.

Furthermore, in this work, we apply Pearson correlation to group times series from different geographic sub-regions and use it as inputs for a regression technique. Thus, it is to possible reduce the number of processing required to model all of the target sub-regions. Besides the framework and the proposed semi-supervised learning approach, our work also includes an analyze of different parameters for the creation of the time series and manually classification of more than seventy-five thousand posts, which can be used as a trustful dataset. To our best knowledge, this dataset is the first labeled Instagram dataset to become publicly accessible. Finally, we show that the framework is able to detect candidate events with a reasonable accuracy using much less computing resources than the most well-known approaches in the literature.

The remainder of this paper is organized as follows: Section 2, we provide a brief contextualization about events and event detection systems. Section 3 presents an overview of the different approaches to detect events. Section 4 reveals the proposed technique to overcome the introduced problem. Section 5 explains how we define the methodology for the experiments, including techniques, tools, and definitions required. Section 6 covers the results found during the experiments and we conclude our work in Section 7.

2 Background

In this section, we present important concepts for a better understanding of this work.

2.1 Event detection

Despite the existence of several works in this subject, there is no formal definition of what is an event and what is the event detection task. In the literature, some authors use their own interpretation, which makes it not consistent with each other. Trying to avoid even more confusion around it, we borrow the definition existing in the literature.

According to Allan et al. [4], an event is “something that happens at specific time and place with consequences”, by this definition an event should have a time and place to happens. However, there is no mention about the place where this can happen. Aggarwal et al. [2] define an event as: “something that happens at specific time and place but is also of interest to the news media”. In other words, the authors complement the first definition stating that it should also affect traditional news media like a newspaper. For Becker et al. [7], “an event is a real-world occurrence e with a time period T_e and a stream of Twitter messages discussing the event during the period T_e”, in that case, the event has to be reported on Twitter.

In this work, we consider the definition from Dong et al. [11] as the most comprehensive interpretation of events in online social networks: “Real world happenings that occur within similar time periods and geographical locations, and that have been mentioned by the online users in the forms of images, videos or texts”. Therefore, intuitively, an event detection system is a set of procedures created to solve the challenge of discovering events [25]. In this work, we are interested in being able to detect any type of local real-time event using Instagram 6 as our primary data source.

2.2 Time series

To detect real-time events in our proposed framework, we define a time series as a time-ordered list of values X, each of them associated with a time T for specific small locations - which we call sub-regions - at a regular interval. Formally, a time series is a discrete uni-variate function X = {x₁, x₂, …, x_n} for T = {t₁, t₂, …, t_n} with values x₁ as the number of geo-tagged posts from Instagram at time t₁, x₂ at time t₂, and so on.

Figure 1 shows time series for all the four months of the collected data.

Fig.1

The volume change over time binned by hour.

2.3 Pearson correlation

The Pearson correlation is a measure of linear correlation between two variables x and y. When computing correlation for time series, we are interested in observing the correlation between two variables over the same time, in our case, the volume of posts for two different sub-regions. In order to do that, we use a sample correlation coefficient, where the sample values are the measurements taken during a time range. The sample correlation coefficient r of two variables a and b is defined as the sample covariance q of the two variables, divided by their sample standard deviation S: $r_{a, b} = \frac{q_{a, b}}{S_{a} S_{b}}$ (1)

Let N be the number of data points from our time series, the sample covariance q is defined as follows: $q_{a, b} = \frac{\sum_{i = 1}^{N} (a_{i} - \bar{a}) (b_{i} - \bar{b})}{N - 1}$ (2)

Where a_i and b_i are the data points, and $\bar{a}$ and $\bar{b}$ are their averages.

Finally the sample standard deviation S is defined thusly: $S_{a} = \sqrt{\frac{\sum_{i = 1}^{N} (a_{i} - \bar{a})^{2}}{N - 1}}$ (3)

In order to improve the generalization of our regression model, we group the trained models according to the measure of similarity between the time series using the Pearson’s correlation coefficient [26] as done by [18, 27].

2.4 Support vector regression

Support Vector Regression is a machine learning technique based on statistical learning theory. Given a set of time series data points x (t), where t is a series of N discrete samples: t = 0, 1, 2, …, N - 1, and y (t + Δ) is some predicted value in the future (t greater than or equal to N). $f (x) = (ω \cdot φ (x)) + b$ (4)

The goal is to map the data points x (t) to a higher dimension “feature” space, via φ (x) (defined as a Kernel Function). Then, perform a linear regression to a higher dimensional feature space [23]. In short, the goal is to find “optimal” weights ω and threshold b.

3 Related work

In this section, we present the relevant works about event detection systems which somehow introduced different solutions for the problem using social networks as the source of the information for their systems.

3.1 Online social networks

As an active topic, event detection has attracted a more intensive interesting with the emergence of the social networks, since the users of those networks, can act as social sensors [30 , 37] creating sensory information regarding events in the real world.

Some methods focus on specific event types, in this context, a system called SportSense [38] was proposed to detect events from football matches. By filtering some tweets with specific words and using matched filtering, it can detect a game event, a touchdown or field goals within a short delay after the event happened. Another work to detect specific types of events was TEDAS [19] created to detect crime-related and disastrous events. Despite the fact both provide excellent results, the strategy is very tied to a unique type of event using a limited list of predetermined keywords.

From a different perspective, some authors tackle the event detection problem using trending topics - a feature from Twitter. GeoScope [8] detect events using trending topics and its locations with geo-based content from the Twitter data stream. The emerged trend can be, for instance, an emergency, a concert or a national game. The algorithm was designed to model the frequency of a topic t_x in any location l_i within a time window W. While the algorithm provides good precision and recall, it has a significant error in estimating the location of the trending topics. Moreover, [3 , 32] argue that even being able to do real-time detection, modeling trend topics is not suitable for detect events in small regions.

Clustering-based approaches have been extensively used in event detection problem by, for example, clustering features from Twitter messages and analyzing its aspects like: temporal, social and topical features [6, 7].

IBM researchers used clustering techniques in Smart City project to identify events from Twitter messages in New York City. However, this work went beyond the others and also implemented the idea of users as sensors of data. A novel crowd sensing system which is responsible for acquiring data from various sources like social networks, mobile phones (SMS), phone call transcripts, etc. All those data after collected go to a preprocessing stage, then features are extracted, filtered and clustered according to its semantic and contextual similarity [28].

In general, clustering techniques are valuable and yield decent results, but these methods have some drawbacks. First, they require an arbitrary threshold for creating a cluster representing an event. Second, it may suffer from cluster over and under-segmentation, making the choice of the threshold issue even more critical.

Following the strategy of detecting real-time events by monitoring the volume of posts in a time series, some authors used the Gaussian Process Regressor (GPR). GPR models the data as a multivariate normal random variable and then predicts the average value for the target vector, in this case, the number of Instagram posts within an interval. Although it presents satisfactory results, the problem in using GPR is the computational requirements. The method needs to optimize the hyper-parameters by maximizing the marginal likelihood, it is a O (n³) operation and needs to be executed every 24 hours, even for a small sample (around 500 records). It was only possible because the authors used a cluster of 60 machines running in parallel [32, 34].

The aforementioned works tried to solve the problem of event detection in real time by using different approaches, however, they showed to have some limitation. Trying to solve issues our approach differs from the existent literature in some ways.

First, our framework does not work with just one OSN, instead, we created a flexible approach that can be implemented using any OSN as the data source, or even a combination of multiple ones.

Second, we do not set specific keywords to detect a single type of event since our framework is based on changes in the volume of posts.

Fig.2

Architecture of the proposed event detection framework.

Third, we do not use a high consuming computing process model, rather, we apply a lightweight regression technique (SVR) to predict and detect any type of event.

Fourth, we do not employ a static traditional machine learning process based on a unique history of data. Instead, we model the time series using an adaptive approach clustering those that have similar behavior through the time.

Finally, the proposed framework uses the idea of burst detection in time series [16, 32]. We believe every time a real event happens we have a sharp (or abnormal) increase in the activity of the social network. Thus, we created a framework capable of modeling this behavior and trigger alerts every time an event emerges.

4 A framework for event detection

Any event detection system can operate in the following ways: looking for the past [35] or new events [5], doing it online [30] or offline [7] and using a Document Pivot or Feature Pivot approach [4]. We can classify the presented framework as a Correlation-based Online Feature Pivot event detection system for new events.

We aim at deciding if a post is an event or not as soon as it arrives (Online) without the need of a time-consuming batch processing (Correlation-based). Furthermore, the applied strategy uses the idea of a “burst of activity” with the volume of posts from a sub-region rising sharply in frequency as the event emerges (Feature Pivot). Accordingly, we propose an architecture formed by three components: (1) Data Collector; (2) Time Series Estimator; and (3) Burst Detector. Each component is responsible for a different task as can be seen in Fig. 2.

Fig.3

Workflow for the event detection framework.

Fig.4

Elbow method determines the optimal number of clusters for three different days (Sunday, Monday, and Tuesday).

4.1 Architecture

4.1.1 Data collector

The first component is responsible for crawling the data from Instagram API. It keeps getting data in real time from the predefined area and storing it in a database. In addition to its location, each crawled post from Instagram has an image and a created time attribute, it also can have a text and other information like the venue.

As in many other architectures from previous works [19], we implemented the data collector using a NoSQL technology for the database. The reason for that is because NoSQL databases are suitable to save documents from Instagram API in pure JSON format and can easily scale if needed.

4.1.2 Time series estimator

The Time Series Estimator uses data collected by our Data Collector component to build the time series. Every 24 h, it generates two time series for each sub-region, the first one, using the time interval of 1 h is used to calculate the correlation coefficient while the second one with a 15 min interval, is used by the regression algorithm to predict the volume of posts.

4.1.3 Burst detector

The event detection is achieved by listening to the data stream from Instagram in real-time, identifying the sub-region of each post, counting the number of items in a sliding window (15 min) and comparing it with the volume predicted by the regression algorithm. If the volume of posts is higher than the defined threshold (3σ) the Burst Detector triggers an alert and store in the database all the posts from Instagram within the event interval. We used this technique based on the results obtained by [32 –34].

4.2 Event detection process

As can be seen in Fig. 3, our event detection framework is composed of different steps, such as Clustering, Group Filtering, Traditional Approach, Correlation-based Approach, Regression Training, and Event Detection. Each step is explained below.

4.2.1 Clustering

In this step, we apply the K-means algorithm [20] to group the geographic sub-regions based on the total number of posts within an interval. The input for the clustering process is the 1 h interval time series generated by the Time Series Estimator. Therefore, the most important parameter to the K-Means algorithm is the number of desired clusters k. In other words, how many clusters we want to group the sub-regions.

To determine the number of clusters (k), we use the Elbow method [29], which it is a well-known way to find the optimal number of clusters for a target data collection. The idea behind this method, is to run K-Means clustering on a dataset for a range of values of k and for each one, calculate the sum of squared errors between the instance of the cluster and its centroid. Then, we take k when there is no significant difference for e error anymore.

In Fig. 4, it is possible to observe on the y-axis, different k values and on the y-axis the e error computed for each weekday used in this work.

4.2.2 Group filtering

Ideally, we wish to have at least 1 post for each data point of the time series, resulting in a minimum of 96 posts (one post every 15 min). However, sometimes the volume of posts from a geographic sub-region might be very low, almost zero. Therefore, in these cases, it is necessary to discard those sub-regions that do not hold any event.

Instead of defining a static value to be the minimum amount of posts for a sub-region, we apply an adaptive approach that clusters all the sub-regions and discards the clusters with the lowest volume of posts. For example, in the first weekday (Sunday) of our experiments, we cluster all the 625 sub-regions in 5 groups and exclude the fifth, since all the geographic sub-regions within this cluster has just a few posts.

4.2.3 Traditional approach

With the remaining geographic sub-regions from the Group Filtering step, we train a support vector regression (SVR) model that will be used later to predict the volume of posts from a target sub-region. The advantage of this model over other techniques from the literature is the fact that SVR is a lightweight algorithm compared to other techniques (e.g., GPR) and produces similar accuracy in prediction tasks [10].

4.2.4 Correlation-based approach

Besides the traditional approach using SVR, in our framework, we propose a novel approach based on the correlation coefficient. Therefore, our approach calculates the Pearson’s correlation coefficient [26] for the time series from all of the geographic sub-regions using the equation defined in 1 , regardless the cluster defined by the clustering step. As a result, we achieve a correlation matrix of M variables (x₁, x₂, …, x_M) as a M × M matrix, where each entry of this matrix is defined as follows: $Q [i, j] = r_{x_{i}, x_{i}}$ (5)

Fig.6

A randomly created quantile-quantile plot for three time series using different sub-regions and days.

The goal is to find similar time series (i.e., geographic sub-regions), which can be used to predict the volume of posts from correlated sub-regions. In this work, we consider that two regions are similar if the correlation coefficient is greater than a minimum threshold, chosen empirically. We experimented different values: 0.65, 0.70 and 0.75, any value below 0.65 generates too many similar regions while a 0.75 value or higher did not provide a minimum number. However, we found 0.70 to be a good threshold to determine if a time series is similar to another one from a different sub-region.

Figure 5 shows a correlation matrix computed among times series from all 625 geographic sub-regions using a weekday (Sunday). The same proceeding has been performed for all of the weekdays used in this work.

Fig.5

Pearson Correlation computed between all filtered sub-regions for Sunday (March 20th, 2106).

4.2.5 Regression training

In this step, we train an SVR model for each geographic sub-region R using the 15 min interval time series with 48 h of historical data for each weekday among all of the sub-regions. After building the models, we could make predictions for future volume in each sub-region.

The output of the model is a tuple [v (R, t), σ (R, t)], representing the predicted post volume (v (R, t)) and associated standard deviation (σ (R, t)) for each time t and sub-region R. The prediction for a target sub-region serves as the volumes of posts that we expect to observe given no event is happening for that sub-region at a specified time.

4.2.6 Event detection

Following the work from [32, 33], an event will be detected if the volume of posts from a sub-region deviates 3σ from the predicted value. According to the available literature [13], assessing the normality (normal distribution) assumption should be taken into account for using parametric statistical tests.

Fig.7

The dimensions and grid system of the land area covered by the data collector.

Therefore, we need to check if the time series we are using follow a normal distribution. In order to do that, we apply a visual test by using a quantile-quantile plot, it allows us to see at-a-glance if our assumption of normality is plausible. The normality of the time series was tested using random sub-regions and random days as one can see in Fig. 6.

Let us follow an example about how all the components from the framework work together to detect events. Supposing we wish to detect events for March 20th, 2106 (Sunday) using the collected historical data.

First, we generate time series with a 15 min and a 1 h interval for all the sub-regions using the two Sundays immediately before the desired date (March 13th, 2106 and March 6th, 2106). Thus, we have time series with 192 and 48 data points, respectively, one for each sub-region.

Second, we group the sub-regions using the 1 h time series by summing the number of posts and using its total amount as the input to the K-means algorithm. The optimum number of clusters is defined by the Elbow method and all geographic sub-regions within the cluster with the lowest number of posts are discarded. Then, SVR model is trained with the 15 min interval time series to predict the number of posts v and its standard deviation σ given a time t.

Next, our novel approach based on Pearson correlation is used to calculate the similarity among all of the sub-regions remaining from the Group Filtering step.

Finally, in real-time, the Burst Detector uses the predicted values v and sigma from the similar sub-region and compare them with the actual sub-region. If the actual number exceeds our threshold (σ_R predicted standard deviation) a candidate event is detected.

5 Experimental methodology

In order to create the event detection framework, we need to collect, label and store the data from Instagram. In this section, we describe the settings and definitions that have been adopted in this work based on some experiments and past works from the literature.

5.1 Dataset

5.1.1 Data collecting

As the first component in our framework, we have the Data Collector, which is responsible for gathering data from Instagram and storing them in a NoSQL database. We have been collecting posts from a particular area of interest within New York over the period of February 1st/2016 to June 30th/2016, it is four months of data with a total of 2, 972, 248 posts.

Instagram API offers a method to read its stream data based on geographical point and a radius of 5 km at most. In this work, we choose the region of Manhattan in New York using the parameters latitude : 40.7626785 and longitude : -73.9659409 as the parameter for the Instagram API. The total land area is around 60 km² and population over 1.6 million. Although in this work we focus on the area of New York City, this framework can be generalized to any other city.

Due to the geographical nature of the island of Manhattan, we decided to split the area into four quadrants as one can see in Fig. 7(a). Considering the land area covered by the data collector, we choose the most significant one. It is a 5 km × 5 km square region comprehending a total of 25 km² (see Fig. 7(b)). As a result, the geographical boundaries for our collector are: (lat : 40.7485, long : -74.0140) and (lat : 40.7930, long : -73.9530).

Next, we created a grid system of 25 × 25, yielding a total of 625 sub-regions with around 400 m² each one (see Fig. 7(c)). The strategy here is to separate the entire area into smaller areas to achieve better accuracy in model fit for the target time series [34].

The decision about how big or small a sub-region should be is a trade-off. If the area on a sub-region is too large, the system detects more events, consequently, more false-positives might appear and it makes the event detection task even more difficult. On the other hand, a tiny sub-region becomes too sensitive and volatile generating too many candidate events, even with just a few posts. This problem is not new and has been already treated in previous works. Therefore, in our experiments, we used a 25 × 25 configuration following the same configuration from [34].

Figure 8 shows the volume of posts for two weeks in all geographical sub-regions with emphasis on the sub-regions with a high volume of posts (over 1000/day).

Fig.8

Bubble map for the volume of posts over the 625 sub-regions.

5.1.2 Data annotations

In Section 1, we have drawn attention to the lack of a reliable dataset as one of the challenges in event detection tasks. To overcome this difficulty, we created a manual annotation process to label 75, 825 posts from two weeks of activity in the selected area. It is the total posts for three days, during the period from March 20th/2016 to March 22th/2016, considering all geographical sub-regions.

The number of labeled posts is enough for our experiments since we can show how our framework is able to detect events in real-time and adapt itself to the changes in the behavior of the time series (i.e., concept drift [15]).

In order to label the data, we created a web page to list the posts from a sub-region in a 15 min interval, then, we asked the annotators to check a list of photos, read their captions, follow all the hyperlinks on it and determine whether there exists at least one post which could clearly be defined as an event. Recall that an event is any kind of happening posted on Instagram, like a car accident, an NBA match, a natural disaster and so on.

Lastly, we made available our labeled dataset in a public repository 7 , which can be accessed by anyone interested on that.

5.2 Evaluation metrics

To avoid a possible misunderstanding on the presented results, it is important to explain some nomenclatures of the evaluation metrics that we have adopted in this work. We hope, in the future, other researchers can find it useful and follow the same definitions in their works.

Actual Events: It is the truly existing number of events in a specific period from a dataset. This number has been achieved from our labeled dataset.

Detected Events: It is the number of candidate events detected by the framework.

Actual Events Detected: It is the number of actual events within the detected ones, excluding the false positives.

Precision = \frac{Actual Events Detected}{Detected Events}

(6)

Recall = \frac{Actual Events Detected}{Actual Events}

(7)

F - Measure = \frac{2 * Precision * Recall}{Precision + Recall}

(8)

Table 1

Effectiveness results among our proposed approach based on correlation coefficient and the traditional approach in the literature

Weekdays	Traditional Approach				Correlation-based Approach
	Cluster	Precision	Recall	F-Measure	Cluster	Precision	Recall	F-Measure
Sunday	1	0.3990	0.3431	0.1638	1	0.4700	0.7192	0.4915
	2	0.4889	0.1648	0.1675	2	0.2831	0.4603	0.1915
	3	0.1616	0.3464	0.1875	3	0.1068	0.2335	0.1154
	4	0.0227	0.3939	0.040	4	0.0397	0.2872	0.0435
	Mean	0.2680	0.3120	0.1397	Mean	0.2249	0.4250	0.2104
Monday	1	0.1046	0.3267	0.1239	1	0.1865	0.7169	0.2534
	2	0.1091	0.3992	0.1326	2	0.1627	0.3565	0.1432
	3	0.0147	0.2424	0.0262	3	0.0386	0.2194	0.0467
	Mean	0.0761	0.3227	0.0942	Mean	0.2450	0.4309	0.1477
Tuesday	1	0.0585	0.2551	0.0939	1	0.0454	0.5051	0.0783
	2	0.0771	0.2908	0.0943	2	0.1250	0.2123	0.0728
	3	0.0050	0.1512	0.0096	3	0.0070	0.1607	0.0144
	Mean	0.0468	0.2323	0.0659	Mean	0.0591	0.2927	0.0551
ALL	Mean Average	0,1303	0,2890	0,0999	Mean Average	0,1763	0,3829	0,1377

Fig.9

Example of some events detected by our proposed framework using the Correlation-based approach.

6 Results and discussion

In this section, we perform three different analysis for real-time event detection tasks. First, we analyze the robustness of our proposed framework using two approaches: (1) Traditional approach, which uses the time series from a geographical sub-region to predict points of the same region; (2) Correlation-based approach, which uses time series from a geographical sub-region to predict points of times series from other more correlated regions. Second, we compare the effectiveness results of our the best approach against the well-known approach existing in the literature. Finally, we show the efficiency results of our proposed framework using two approaches and compare to time-consuming results of the baseline approach.

6.1 Effectiveness analysis

Table 1 shows the effectiveness results among traditional and correlation-based approaches for time series of three different weekdays (Sunday, Monday, and Tuesday) and three evaluation metrics (Precision, Recall, and F-Measure). It is very important to recall that the number of clusters for each weekday has been defined by Elbow method on clustering step (See Fig. 4).

As it is possible to observe, for Sunday data, our proposed framework using the traditional approach achieved the best mean precision (0.2680) than the correlation-based approach (0.2249). However, the correlation-based approach achieved the best recall and F-Measure metrics with 0.4250 and 0.2104 against 0.3120 and 0.1397 achieved by the traditional approach.

In the Monday data, the correlation-based approach achieved better results than the traditional approach in all of the evaluation metrics. Finally, in the Tuesday data, both approaches achieved similar effectiveness results with a slight gain for the correlation-based approach in two evaluation metrics (precision and recall). Notice that our proposed correlation-based approach achieved excellent recall results in all of the weekday’s data. This means that our proposed approach is able to detect a greater number of actual events existing in the labeled dataset.

Furthermore, in this analysis, we compare our proposed approaches to well-known approach based on Gaussian Process Regression technique (GPR) [34] from the literature. In [34], the authors have achieved around 0.13 of mean average precision in their experiments using times series from the same area (Manhattan Island) and a number of sub-regions. In our experiments, our proposed framework achieved around 0.13 and 0.17 of mean average precision for the traditional and correlation-based approaches, respectively. Therefore, our results are very similar to the results obtained by another work [33], demonstrating that the task of detect local events in real-time is arduous.

Figure 9 shows a sequence of three events of actual posts from Instagram detected by our proposed framework for a geographical sub-region. It is just an example among several others real events we have on the dataset.

6.2 Efficiency analysis

According to [32], Gaussian Process Regression (GPR) which is based on a robust probabilistic model has been used as time series prediction model due to its ability to adapt to various kinds of time series simply by replacing the kernels. On the other hand, the GPR drawback is the high computational requirement to fit even a small sample of data. Therefore, GPR is a high computational cost approach, taking hours to process a not large volume of data running it in a cluster of dozens of computers. The goal of calculating the correlation coefficient between the time series from the geographical sub-regions is to evaluate if our proposed approach might provide good effectiveness results compared to well-known approach from the literature, using much less computing resources. As Gaussian Process Regression approach proposed by [34] is not available in the literature, it was not possible to create a fair time-consuming experiment among our approaches and GPR. However, we have performed efficiency experiment among our proposed approaches. Tables 1 shows the mean of processing time for each proposed approach. As we can observe, our correlation-based approach is 300% faster than the traditional approach.

Table 2
Mean of processing time between the our two proposed approaches

Approach Processing Time (sec) # Sub-Regions

Traditional 2094 625

Correlation-based 688 625

Approach	Processing Time (sec)	# Sub-Regions
Traditional	2094	625
Correlation-based	688	625

7 Conclusions

Today, online social networks are the most used applications to solve several real-world problems such as event detection, tragedy prediction, natural disasters, epidemics, and crime prevention.

Event detection is a research area that emerged in the last few years and is growing quickly. However, it is still an open challenge due to its complexity and requirements.

In this paper, we proposed a novel event detection framework using Pearson’s correlation coefficient and SVR to model time series and detect local events in real-time from Instagram.

In our experiments, we demonstrated that a time series prediction model trained with similar geographical sub-regions might be applied to generalize the knowledge of other sub-regions. Therefore, this “cross-regions” prediction strategy produces excellent effectiveness results using much less computing resources and requiring a small subset of the time to run compared to the other approaches in the literature [34].

Our proposed framework uses technologies in its architecture that are suitable to be extended (like MongoDB for storage) to cover a bigger geographic area without loose its characteristics and performance. Furthermore, since we did not adopt any specific feature from Instagram, our proposed framework might be used to work with other OSN, like Twitter for instance (generic system).

Finally, we could show that our framework with correlation-based approach is an easy and lightweight solution to work as a first layer in a more complex system. As future work, we plan to improve our framework adding another step on the top of the burst detection algorithm to filter the true positive events.

Footnotes

Acknowledgments

This work is partially financed by CNPq Universal Project (408919/2016-7). GIBIS Lab. is partially supported by diverse projects and grants from FAPESP, CNPq, and CAPES.

References

Achrekar

, Gandhe

, Lazarus

, Yu

S.-H.

and Liu

, Predicting flu trends using twitter data. In Computer Communications Workshops (INFOCOM WKSHPS), 2011 IEEE Conference on2011, pp. 7025–707. IEEE.

Aggarwal

C.C.

and Subbian

, Event detection in social streams, Proceedings of the 2012 SIAM International Conference on Data Mining, 2012, pp. 624–635.

Ahmed

, Hong

and Smola

A.J.

, Hierarchical geographical modeling of user locations from social media posts, Proceedings of the 22Nd International Conference on World Wide Web, WWW’13, New York, NY, USA, 2013, pp. 25–36. ACM.

Allan

, Introduction to Topic Detection and Tracking, 2002, Springer US, Boston, MA, pp. 1–16.

Allan

, Papka

and Lavrenko

, On-line new event detection and tracking, In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, New York, NY, USA, 1998, pp. 37–45. ACM.

Becker

, Iter

, Naaman

and Gravano

, Identifying content for planned events across social media sites, In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, 2012, pp. 533–542ACM.

Becker

, Naaman

and Gravano

, Beyond trending topics: Real-world event identification on twitter, ICWSM11 (2011), 438–441.

Budak

, Georgiou

, Agrawal

and El Abbadi

, Geoscope: Online detection of geo-correlated information trends in social networks, Proc VLDB Endow7(4) (2013), 229–240.

Crump

, What are the police doing on twitter? Social media, the police and the public, Policy Internet3(4) (2011), 1–27.

10.

Cui

and Fearn

, Comparison of partial least squares regression, least squares support vector machines, and gaussian process regression for a near infrared calibration, Journal of Near Infrared Spectroscopy25(1) (2017), 5–14.

11.

Dong

, Mavroeidis

, Calabrese

and Frossard

, Multiscale event detection in social media, Data Mining and Knowledge Discovery29(5) (2015), 1374–1405.

12.

Fayyad

U.M.

, Piatetsky-Shapiro

, Smyth

and Uthurusamy

, Advances in Knowledge Discovery and Data Mining, Volume 21, AAAI PressMenlo Park, 1996.

13.

Field

, Discovering Statistics using SPSS, Sage Publications, 2009.

14.

Flatow

, Naaman

, Xie

K.E.

, Volkovich

and Kanza

, On the accuracy of hyper-local geotagging of social media content, CoRR (2014), abs/1409.1461.

15.

Gama

J.A.

, Żliobaitė

, Bifet

, Pechenizkiy

and Bouchachia

, A survey on concept drift adaptation, ACM Comput Surv46(4) (2014), 44:1–44:37.

16.

Kleinberg

, Bursty and Hierarchichal structure in streams, Data Mining and Knowledge Discovery7(4) (2003), 373–397.

17.

Kling

C.C.

, Kunegis

, Sizov

and Staab

, Detecting non-gaussian geographical topics in tagged photo collections, In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM ’14, New York, NY, USA,, 2014 pp. 603–612. ACM.

18.

Leydesdorff

, Similarity measures, author cocitation analysis, and information theory, CoRR (2009), abs/0911.4292.

19.

, Lei

K.H.

, Khadiwala

and Chang

K.C.C.

, Tedas: A twitter-based event detection and analysis system, In 2012 IEEE 28th International Conference on Data Engineering, 2012, pp. 1273–1276.

20.

MacQueen

, et al., Some methods for classification and analysis of multivariate observations, In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, Oakland, CA, USA, 1967, pp. 281–297.

21.

Manikonda

, Hu

and Kambhampati

, Analyzing user activities, demographics, social network structure and user-generated content on instagram, CoRR (2014), abs/1410.8099.

22.

McMinn

A.J.

, Moshfeghi

and Jose

J.M.

, Building a large-scale corpus for evaluating event detection on twitter, In Proceedings of the 22Nd ACM International Conference on Information & Knowledge Management CIKM ’13, New York, NY, USA, 2013 pp. 409–418. ACM.

23.

Müller

K.-R.

, Smola

A.J.

, Rätsch

, Schölkopf

, Kohlmorgen

and Vapnik

, Predicting time series with support vector machines, In International Conference on Artificial Neural Networks, 1997, pp. 999–1004. Springer.

24.

Oliveira

and Gama

, An overview of social network analysis, Wiley Int Rev Data Min and Knowl Disc2(2) (2012), 99–115.

25.

Panagiotou

, Katakis

and Gunopulos

, Detecting events in online social networks: Definitions, trends and challenges, Solving Large Scale Learning Tasks. Challenges and Algorithms - Essays Dedicated to Katharina Morik on the Occasion of Her 60th Birthday, 2016, pp. 42–84.

26.

Pearson

, Mathematical contributions to the theory of evolution, iii. regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 187, 1896, 253–318.

27.

Rodrigues

P.P.

, Gama

and Pedroso

, Hierarchical clustering of time-series data streams, IEEE Transactions on Knowledge and Data Engineering20(5) (2008), 615–627.

28.

Roitman

, Mamou

, Mehta

, Satt

and Subramaniam

, Harnessing the crowds for smart city sensing, In Proceedings of the 1st International Workshop on Multimodal Crowd Sensing, CrowdSens ’12, New York, NY, USA, 2012, pp. 17–18. ACM.

29.

Rousseeuw

P.J.

, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics20 (1987), 53–65.

30.

Sakaki

, Okazaki

and Matsuo

, Earthquake shakes twitter users: Real-time event detection by social sensors, In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, NY, USA, New York, 2010, pp. 407–415. ACM.

31.

Schwartz

, Naaman

and Teodoro

, Editorial Algorithms: Using Social Media to Discover and Report Local News, ACM International Conference on Web Logs and Social Media2015, 407–415.

32.

Xia

, Hu

, Zhu

and Naaman

, What Is New in Our City? A Framework for Event Extraction Using Social Media Posts, Cham, Springer International Publishing, 2015, pp. 16–32.

33.

Xia

, Schwartz

, Xie

and Krebs

, CityBeat: Real-time social media visualization of hyper-local city data, Proceedings of the International World Wide Web Conference Committee (IW3C2), 2014, pp. 167–170.

34.

Xie

, Xia

, Grinberg

, Schwartz

and Naaman

, Robust detection of hyper-local events from geotagged social media data, Proceedings of the Thirteenth International Workshop on Multimedia Data Mining - MDMKDD ’13, 2013, pp. 1–9.

35.

Yang

, Pierce

and Carbonell

, A study of retrospective and on-line event detection, In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 98, New York, NY, USA, 1998, pp. 28–36. ACM.

36.

Zhao

, Mitra

and Chen

, Temporal and information flow based event detection from social text streams, In: Proceedings of the 22Nd National Conference on Artificial Intelligence - Volume 2, AAAI’07, 2007, pp. 1501–1506. AAAI Press.

37.

Zhao

, Zhong

, Wickramasuriya

and Vasudevan

, Human as real-time sensors of social and physical events: A case study of twitter and sports games, arXiv preprint arXiv:1106.4300, 2011.

38.

Zhao

, Zhong

, Wickramasuriya

, Vasudevan

, LiKamWa

and Rahmati

, SportSense: Real-Time Detection of NFL Game Events from Twitter,:, arXiv.org (2012), 1205–3212.

A correlation-based approach for event detection in Instagram

Abstract

Keywords

1 Introduction

2 Background

2.1 Event detection

2.2 Time series

3.1 Online social networks

4.1.1 Data collector

4.1.2 Time series estimator

4.1.3 Burst detector

4.2 Event detection process

4.2.1 Clustering

4.2.2 Group filtering

4.2.3 Traditional approach

4.2.4 Correlation-based approach

4.2.6 Event detection

5.1 Dataset

5.1.1 Data collecting

5.2 Evaluation metrics

6.1 Effectiveness analysis

6.2 Efficiency analysis

Table 2 Mean of processing time between the our two proposed approaches Approach Processing Time (sec) # Sub-Regions Traditional 2094 625 Correlation-based 688 625

Footnotes

Acknowledgments

References

Table 2
Mean of processing time between the our two proposed approaches

Approach Processing Time (sec) # Sub-Regions

Traditional 2094 625

Correlation-based 688 625