Mining Twitter data for crime trend prediction

Abstract

While conventional crime prediction methods rely on historical crime records and geographical information of the location of interest, we pursue the question of whether a social media context can provide socio-behavior “signals” for a crime prediction problem. The hypothesis is that crowd publicly available data in Twitter may include predictive variables which can indicate changes in crime rates without being only limited to the availability of historical crime records of specific locations. We developed a prediction model for crime trend prediction, where the objective is to employ Twitter content to predict crime rate directions in a prospective time-frame. The model employs content, sentiment, and topics, as the predictive indicators to infer the changes of crime indexes. Since our problem has a sequential order, we propose a temporal topic detection model to infer predictive topics over time. The main challenge of topic detection over time is information evolution, in which data are more related when they are close in time rather than further apart. Our proposed topic detection model builds a dynamic vocabulary to detect emerging topics rather than considering a vocabulary in bulk. We applied our model on data collected from Chicago for crime trend prediction using historical tweets. The results have revealed the correlation between features extracted from the content as content-based features and the crime trends. Moreover, the results indicate the feasibility of our proposed temporal topic detection model in identifying the most predictive features over time compared to a static model without time consideration. We also studied the contribution of socio-economic indexes and temporal features as auxiliary features. The experiment shows the content-based features improve the prediction performance significantly compared to the auxiliary features. Overall, the study provides a deep insight into the correlation between language and crime trends and the impact of social data as an extra resource in providing predictive indicators.

Keywords

Twitter data analytics temporal data analytics text mining topic modeling sentiment analysis social trend prediction

1. Introduction

Crime, which can be defined as any unlawful act punishable by law, not only affects individuals who are involved but the society as a whole. In criminology, crime in all its facets occurs due to a set of situations known as the “social context” [1]. Crime and the risk of being victimized are variants that depend on the social context. Social context in general is viewed by two different dimensions; physical and social. The physical view refers to the specific geographical locations where crime is more common, such as locations with a higher population and a lower economic status or limited accessibility to education facilities [2]. However, social dimensions are concerned with socio-psychological factors such as individuals personality, communities, level of education, students behavior at school, or environment.

Crime analytics in general and crime prediction in specifics have drawn the attention of researchers to very diverse fields including law-enforcement and policing, social science, and data mining. The main objective in crime analytics is to help law enforcement agencies to more effectively allocate their scarce resources by predicting criminal movements which requires mining vast amounts of crime data, demographic and socio-economic information, and recently, social data.

Conventional crime prediction methods rely on historical socio-economic indexes and demographic information which are collected from the areas of concentrated crime, known as hot-spot maps, and applied to predict distributions of crime from different natures. However, there is some debate about that whether these maps indicate the concentration of all crime types [3]. As an example, taxicab robberies take place in different locations which are not always representative of high crime regions [3]. Another major issue is the lack of data for the prediction model. In conventional methods, historical criminal records must be available for prediction models. Overall, the main drawback of these methods is that they reduce the social context to historical crime records while ignoring socio-behavioral data of the community including both victims and criminals. In fact, the contextual data can be leveraged as a signal to predict upcoming incidents.

Nevertheless, observing the socio-behavioral data of a large society is a challenging task, whereas social media allows users to share their concerns, ideas, and daily activities. The content shared by the individuals when combined together provides a rich resource of naturally occurring data. Twitter, as one of the most popular social services, provides natural social data; users are more willing to express their opinions, interests, and activities without being worried about other opinions [4]. Therefore, Twitter provides content which is representative of its users’ social behavior. Numerous studies have explored extracting behavioral patterns from Twitter, including personality detection [5], language differences [6], and crowd behavior to monitor lifestyles [7].

In this study, we hypothesize that information captured from Twitter may provide socio-behavioral signals to predict crime rate directions as a “crime trend”. A future trend is predicted based on the information observed from the content of tweets posted earlier. We propose a prediction model which converts the trend prediction to a classification problem. In order to handle the lack of data in supervised learning, the model addresses automatic data annotation. In this model, the learning examples are aggregated tweets with different smoothing windows. The examples are labeled with the knowledge inferred from the problem (in our case, crime trend) in a prospective time-frame. In fact, the concept of data annotation is similar to other labeling approaches such as the classic lexicon-based approach. In this approach, polarities or strengths are inferred, based on a set of dictionaries. Thus inspired, in our prediction model we infer labels based on the objective trend. The content of collective individual users is labeled positive or negative if the trend goes up or down, respectively, in the prospective time-frame.

In contrast to the keyword based methods [8], our model does not exploit any topics or content that explicitly refer to criminal activities. In fact, filtering messages, based on the keywords, limits the study to the specific groups of users. Furthermore, searching targeted topics, keywords, or content, may be more descriptive than predictive, which means those targeted signals narrate stories that already happened rather than predict future events. For example, if we rely on searching terms such as “shooting” in tweets as a predictive signal, more likely we will collect tweets that report a recent shooting in the neighborhood. We believe relevant content posted in Twitter includes hidden signals that reflect users’ concerns, sentiments, and social events which are not necessarily presented with explicit terms. Moreover, our model captures collective patterns from the user crowd rather than a selected group of users [9]. We consider all observed users as a crowd, as opposed to user centric approaches, to take advantage of more content and to follow the theory that every user’s opinions and activities tend to have social impact on the community [10]. We explore the correlation, between content, sentiment, and topics as features, with the crime trend. However, there are some challenges in exploiting such aforementioned variables. Despite having problems with processing the content of tweets – abbreviation and the limited numbers of characters – time plays a crucial role in a topic detection model. In another meaning, information changes over time and new text documents are very likely to carry a new vocabulary. Therefore, the detected topics are shown to have birth and death over time [11, 12]. We develop a temporal model, which addresses temporary characteristics of observed terms. The idea is to extract emerging topics as the predictive features over time. In this regard, the proposed topic model builds a dynamic vocabulary, which is updated over time to infer emerging topics and fade away the vocabularies that are no longer popular.

The study aims to achieve the following objectives:

1.
Ascertain the contribution of the content, voluntarily shared on Twitter, in a crime prediction model. Our findings could potentially contribute to the real decision support systems and facilitate research on understanding the crime causes. In addition, the proposed model can help decision makers in law enforcement agencies and police to efficiently manage their limited resources.
2.
Identify an efficient temporal prediction model which captures the most relevant features, such as temporal topics for prediction. Understanding how topics evolve over time is potentially important research in text analysis. It has applications for online topic detection models and time-dependent topic models. As an example, temporal topic detection of news documents where new and old documents are different in terms of word usage is very important to provide insight on how information is evolved and spread in the future.
3.
Explore the temporal effects of the content on crime directions to determine the lag between the predictive features and the crime trends. It helps socio-behaviour analysis investigators to understand the effects of social behaviour on different crime types with the delay related to each type. It is important to understand how the users’ social activities, interests, and opinions over time are related to illegal activities from different types.

The remainder of the paper is structured as follows: the next section provides a review of historical crime index prediction methods as well the recent studies in applying social context in crime prediction. Section 3 describes the dataset and how they were obtained. Section 4 discusses the prediction model along with features involved in the prediction. Section 5 introduces our temporal topic detection model for the trend prediction. Section 6 presents our experimental results. Finally, conclusions of the study with open problems are summarized in Section 7.
2. Related works

Crime analytics has significantly improved the ability of law enforcement to make decisions quickly and perform intelligent actions on different incidents. In fact, the advancement in knowledge discovery and the development in IT technologies, help law enforcement in analyzing past data and predicting future incidents. In crime analytics, technologies and applications use data from three different categories: historical activities of criminals, spatio-temporal information and, more recently, social media. The first set of applications are focused on criminals and their networks as they call “heat list” (Chicago predictive models in 2013). However, these applications to be prone to specific profiles such as a race which raises many issues. The next group applies spatio-temporal features such as location and the time of the day to predict future incidents. On the contrary, the third group does not target specific persons but focuses on the publicly available social media. For instance, FBI tools look for emerging threats in social media. However, the main issue is the validity of tracking significant words for the prediction of future incidents which requires deep analysis of the keywords that were included in the content.

However, conventional techniques, used by law enforcement agencies, are mostly based on generating hot-spot maps [3, 13], which are unique to a specific location and thus cannot be generalized. To overcome this problem, other techniques are proposed to incorporate background knowledge about spatial features, such as the distance to intersections and highways, schools and businesses, and other information about the neighborhood [14, 15]. Mohler et al. [16] proposed a framework that models future crimes as the consecutive to ones currently committed. Another line of research considers social fabric of the neighborhood as a key factor, which has an influence on criminal activities [17, 18]. The most recent works studied the predictive power of mobile network activities for the similar problem [19, 20]. Overall, the historical data which is needed for the mentioned models, is grouped into three main categories: the location of criminals (i.e. the locations they live and appear such as escape route and tourist regions), time and weather, and criminal networks.

As discussed earlier, the main characteristic of the crime prediction methods is to leverage historical data of high crime areas for the prediction model. Despite the fact that crimes mostly happen in high crime neighborhoods, the information in a hot-spot map of a small geographical size does not necessarily represent crime rates in bigger communities. Studies have shown that at the community level activities in hot-spot maps are not always representative of future crime rates [21]. The other challenge in using the hot-spot map approach is that it is location-specific.

As a result, crime prediction of a specific location cannot be easily generalized to the other locations. The location-specific characteristic of hot-spot map models implies that we need to collect enough data from the location of our interest. An alternative approach is to build a generalizable model from a type of data which is freely and publicly available and not restricted to a geographic neighborhood. The social media data has these characteristics in addition to its contextual features that can implicitly represent the socio-behavioral state of the public.

There have been enormous efforts in utilizing micro-blog data to predict real-time notifications, social conflicts, and public health risks [22, 23, 8]. Leveraging user-generated data reveals underlying patterns in different domains. Chen et al. [24] applied textual content of Twitter in the form of user language to detect name-calling harassment. Bollen et al. [25] also successfully implemented trend prediction in Twitter. In their work, individual behaviors is extracted from the content of daily tweets and utilized to predict socio-economic indexes. In another study, Hale et al. [26] studied validity of the language gap between different locations. In this research, the latent factors, extracted from user-generated content, are utilized to detect the communities.

Considering other social topics, far too little attention has been paid to the effect of on-line user generated data and its associations with crime prediction. Some studies have leveraged density of the data captured from social media in crime prediction. Bogomolov et al. [19] have explored the predictability of the data coming from mobile phones as defined as “human behavioral data” for crime prediction. Similarly, in another study [27], the frequency of violent mobile messages was compared to the residential population for capturing crime hot-spots. However, in both studies demographic information is exploited while contextual social data are not contributed in the prediction model.

The idea of applying social data for crime prediction can be observed in the works conducted by Wang et al. [9], Gerber [28], and Chen et al. [29]. The former is the first one to bring social media context into the problem of crime prediction. Wang et al. [9] extracted event-based topics from posted tweets to predict hit-and-run incidents in Charlottesville, Virginia. Although the approach is novel, the source of data is limited to a set of manually selected news agencies and neglects the vast amount of information contributed by the citizens. Also, the assumption that the content of these posts reflects the most recent local events is not always valid. Finally, it is unclear whether the same predictability will be observed when forecasting incidents, other than hit-and-runs. Gerber [28] recently utilized social media data to enhance Kernel Density Estimation (KDE) which is a technique widely adopted by the criminology community. Unlike previous authors, Gerber did not impose any restrictions on the source of tweets. He also assessed how much improvement can be achieved by adding topics extracted from Twitter for different crime categories. Similarly, Chen et al. [29] utilized the sentiment of Twitter data along with weather conditions in KDE for predicting the locations and time of theft. This study is limited to spatial information such as weather data for specific time and regions. In the mentioned studies, KDE as a location dependent technique cannot be easily generalized to other cities. There is also some type of crime which does not occur in the vicinity of previous incidents and the population of one area may change frequently [27].

While most of the research on crime prediction is limited to specific locations, crime types, communities and users, or focused on specific events, our proposed approach is one of the first crime prediction approaches that can be generalized. Furthermore, the proposed model learns the direction of changes in crime indexes rather than values. The importance of detecting crime trend direction is that policy makers and law enforcement agencies are mostly interested in determining if the crime in a neighborhood is declining or not. Another advantage of our approach compared to some of the previous research is that it works for a wide range of crime types. Our method does not target any specific communities, keywords, terms, hashtags, or events.

3. Dataset description

We collected Twitter data and crime rates from Chicago, Illinois between July 1, 2010 and November 30, 2013. Chicago has been targeted due to its importance as the third populous city in U.S as well as being among the top three cities which attracted the highest number of visitors during 2012.1

¹
http://en.wikipedia.org/wiki/Chicago.

It has been also ranked as the first in number of murders, second in robbery, and third in number of property crimes based on FBI report during 2013.2

S. Department of Justice, FBI: http://www.fbi.gov.

3.1 Crime data

The criminal records were extracted from Chicago Data Portal.3

³
City of Chicago Data Portal: https://data.cityofchicago.org.

This data portal is a rich resource providing all reported incidents on a daily basis which are retrieved from Chicago Police Department system. Information of all crimes which have been reported between July 2010 and November 2013 were collected. Each record contains its timestamps, exact location, and the crime type. The dates refer to the time of primary investigation, and crime type derived based on the FBI classification system. Figure 1 presents the crime rate time series (aggregated rates of all different crime types). The sharp spikes and troughs are coincided with some specific events and dates. However, they might be the result of missing data. A major decrease of overall crime rates is observed during the entire period of time which is started in US in 1990s [30].

Figure 1.

Daily aggregated crime rates.

3.2 Twitter data

In order to retrieve the historical tweets, a set of Twitter users was collected using Coupling From The Past (CFTP) [31]. This approach guarantees a better convergence for perfect sampling of online social networks which is not biased toward active users. Historical timelines of the selected users were retrieved and restricted to the same timeframe – between July 1, 2010 and November 30, 2013. Daily statistics of the number of posts is presented in Fig. 2. The observed spikes in Twitter activity trend were corresponded with the important events in Chicago. The sharp spike in 2012 coincided with the presidential election in November. The high number of tweets in February 2013 is associated with Super Bowl Sunday period. The last spike is related to one day after Chicago Blackhawks won the Stanley Cup.

Figure 2.

Daily number of tweets.

3.3 Auxiliary resources

Further, we also collected some other datasets such as unemployment rates4

⁴
Economic Research Federal: http://research.stlouisfed.org.

and weather conditions5

⁵

The Weather Channel: http://www.weather.com.

as the auxiliary resources to investigate the expedience of their incorporation with Twitter content and to understand the contribution of content in prediction versus the other datasets.

4. Prediction model

Crime index prediction, similar to any non-deterministic signal prediction such as stock price, is a difficult not impossible task. For example, predicting that 25 incidents of homeside will be occurred the next day seems impossible. On the other hand, the question “what direction does the crime trend may take tomorrow” may lead us to some extent to a possible answer. What we mean from “direction” is the sign of the change in signal at $t(i)$ compared to some reference such as $t(i-d)$ , where $d$ is lag. A positive change means, signal has a rising trend while a negative change has the opposite meaning, and obviously zero indicates no change in signal between $t(i-d)$ and $t(i)$ .

In contrast to many classification problems, the proposed trend classification does not suffer from the lack of annotated data. Training data is generated by annotating available content as learning examples with knowledge inferred from the trend. This model infers the labels from environment, events, metadata or any background knowledge captured from the problem. The approach falls between semi-supervised learning and unsupervised learning. It is similar to semi-supervised learning in which they both employ unlabeled examples for training phase. However, available observations are fully unlabeled. In this work, the observations are represented with a set of features such as terms, sentiment, and temporal topics captured from Twitter posts.

4.1 Document generation

The problem of trend prediction is converted to a binary classification problem where the objective is to detect the direction of the target trend. Let $X=\{x_{1},x_{2},\ldots,x_{n}\}$ be a set of temporal examples or in general temporal data, which is defined as a state in time. The state is represented by vector of features $x_{j}=(f_{1},f_{2},\ldots,f_{|V|})$ , where V is the global vocabulary. Since each state $x_{j}$ is sampled at time $t(j)$ , then $X=\bigcup^{n}_{j=1}x_{j}$ is the result of $n$ consecutive sampling. One important pre-processing task in time-series data, is smoothing to increase the predictability and to reduce the noise and outliers. Hypothetically, temporal data which is a high-dimensional time-series data can be also smoothed. In our model, each state is represented by a document and a naive smoothing is a rolling averaging algorithm over the temporal documents:

$z_{i}=\frac{1}{q}\sum_{j=1}^{q}x_{j-q+1},{Z}=\bigcup^{n}_{i=1}z_{i},q=[1,n]$ (1)

where $q$ is the size of aggregation window and $z_{i}$ is an example, which is represented by a single document. As a result $Z$ is an $n$ $\times$ $|V|$ document-term matrix where $V$ is the global vocabulary. The vocabulary $V$ is simply a set of all distinct words appeared in all collected, relevant tweets. Although, no keyword search is conducted, a blind filtering including stopword reduction and low-frequent term reduction is applied to the vocabulary. As a result, $z_{i}$ is defined as the average of a set of documents from $j$ to day $j-q+1$ , retrospectively.

In our prediction model, the objective is to transform a prediction problem into a supervised classification task. In other words, we avoid to solve a multi-variable regression of a target variable (i.e. crime index in this research) and prefer a classification based on the categorical target variable (i.e. crime rate direction or change).

Let $Y=\{y_{1},y_{2},\ldots,y_{n}\}$ be the target time series whose future values are to be predicted. The time series $Y$ is sampled in time steps $t(i)$ , $1\leqslant i\leqslant n$ . To convert regression-based prediction into the classification, the continuous signal $Y$ has to be mapped into a categorical set which is called a set of labels. There are several techniques to infer the labels from a continuous variable such as quantization or the direction of changes in rates. Due to the nature of the research, we adopt trend analysis of the continuous rates for labeling:

$l_{i}=\textit{sgn}(y_{i+d}-y_{i}),\ \ if\left\{\begin{array}[]{l}d>0:\ \textit% {lag}\\ d\leqslant 0:\ \textit{lead}\end{array}\right.,L=\bigcup_{i=1}^{n-d}l_{i}$ (2)

where $d$ is the lead or lag from the current state ( $z_{i}$ ) and the target label, $l_{i}$ is the label at $t(i)$ and $L$ is the sequence of the labels in $n-d$ consecutive time steps. In case of crime rate prediction, the label of document $z_{i}$ is the value of the rate with different lags and leads, $l_{i}=(y_{i+d}-y_{i})$ . After inferring the labels, a set of annotated documents is generated by associating high dimensional temporal data to one dimensional target labels inferred from time series of interest, $\forall z_{i}\in Z,z_{i}\rightarrow l_{i}$ , $n-d$ training examples of the form $\{(z_{1},l_{1}),\ldots,(z_{n-d},\;l_{n-d})\}$ are generated.

4.2 Proposed model for crime prediction

The objective of the proposed method is to predict whether crime rate increased or decreased for the prospective time-frame. Therefore, a set of training data ( $D$ ) is given to a binary classifier as follows:

$D=\{(z_{i},l_{i})|z_{i}\in R^{|V|},l_{i}\in\{-1,1\}\},1\leqslant i\leqslant n-d$ (3)

where in our target problem (crime trend prediction) $z_{i}$ and $l_{i}$ are defined as follows:

Aggregated tweets at the time slice $i$ $(z_{i})$ : All tweets which have been posted at the time slice $i$ (for instance day $i$ ), are aggregated as a single document ( $z_{i}$ ). Several preprocessing tasks such as low frequent term deduction, stopword removal, stemming, sentiment detection, and topic identification may be applied to $z_{i}$ .

Trend direction, class label at the time slice $i$ $(l_{i})$ : It is derived from the changes in the crime indexes when comparing the current index ( $i$ ) with the index of ( $i+d$ ), where $d$ is the time interval such as one day or one week (see Eq. (2)).

Terms as the features are referred to the unigram model without filtering any specific keywords. One might speculate that we must collect keywords to emphasize on the offensive language implying a rough context. Nevertheless, content is a rich data which contains valuable hidden variables including activities, topics of discussion, public interests, and sentiments, which might not be necessarily carried by the offensive language.

Sentiments are captured as another set of predictive features. Linguistic Inquiry and Word Count (LIWC) [32] was applied to extract polarity of five different sentiments consisting of “positive”, “negative”, “anxiety”, “anger” and “sad”. Sentiment features are estimated by the frequency of the words in each sentiment category. Let $\textit{Score}_{s}^{z_{i}}$ be the score given by LIWC for the document $z_{i}$ , where $s\in$ {positive, negative, anxiety, anger, sad}, the scores are normalized using the maximum and minimum sentiment score of the corresponding document:

$\textit{Score}_{s}^{{}^{\prime}z_{i}}=\frac{\textit{Score}_{s}^{z_{i}}-\textit% {Score}_{s_{\textit{min}}}^{z_{i}}}{\textit{Score}_{s_{\textit{max}}}^{z_{i}}-% \textit{Score}_{s_{\textit{min}}}^{z_{i}}}$ (4)

Figure 3 demonstrates the daily scores of the different sentiments over the observation period. The figure indicates that the overall “negative” rates have increased during the past four years compared to the other sentiments.

Figure 3.

Sentiment scores during the observation time.

5. Temporal topic model

Topic models are extensively applied to many text mining tasks either to indicate the similarities between two sets of documents [33] or to visualize high dimensional documents to a set of well-structured variables for exploration [34]. Nevertheless, with the increasing number of user-generated data in microblogs, there has been a great demand for topic models for learning meaningful patterns from data. Latent Dirichlet Allocation (LDA) [35] is the most popular topic model approach in identifying latent documents from the corpus where documents can be assigned to a set of semantic topics. LDA is well-studied in document clustering [36, 37], indicating the similarities between two sets of documents [33], finding trending topics [38], and event discover y [39]. In fact, in our study, LDA converts a high dimensional feature matrix to a low dimensional abstraction of documents which is important in text analytics. Although word order is not applied in LDA, due to the nature of our dataset (tweets), documents are short and more information can be inferred from the bag of words without considering the order. In addition, in contrast to deep analysis approaches such as word2vec [40], LDA leverages local context which captures many semantic relationships within a specific domain.

Table 1
The list of notations employed in this section

Notations	Descriptions
Z	$n*\|V\|$ sparse matrix.
n	Number of documents.
V	Global vocabulary.
m	Size of a partition.
T	A topic indexed by k, where $1\leqslant k\leqslant K$ and $K$ is the total number of selected topics.
$\bm{P}$	Documents grouped in partition $t$ .
$\bm{Q}$	Documents grouped in partition $t+m$ .
$\mathbf{K^{P}}$	Total number of topics for partition $P$ .
$\mathbf{K^{Q}}$	Total number of topics for partition $Q$ .
$\mathbf{z_{i}}$	$i^{th}$ document in $Z$ .
$\mathbf{w}$	A word in $z_{i}$ document.
$\bm{Z}^{K}$	Document-topic matrix of size $n*K$ .
$\bm{\alpha},\beta$	Dirichlet prior parameters.
$\bm{\theta}$	Topic distribution for a document.
$\bm{\phi}$	Word distribution of a topic.
$\bm{Z}_{(-P)}$	Documents from $Z$ excluding $P$ .
$\bm{Z}_{(-Q)}$	Documents from $Z$ excluding $Q$ .

In LDA, inputs are a bag-of-words representation of documents and outputs consist of latent topics. A topic in LDA is a multinomial distribution of words in the vocabulary, while a document is a multinomial distribution of topics. In LDA, documents are given as a batch, which builds a static vocabulary for inferring topic distributions. However, in temporal analysis, where information changes over time and upcoming text documents most likely carry new terms, predefining a fixed vocabulary is not practical and raises many issues. However, in temporal analysis, where information changes over time and upcoming text documents most likely carry new terms, predefining a fixed vocabulary is not practical and raises many issues. First, topics have proven to have birth and death [11], when extracted from temporal text streams. Therefore, there is a significant need to have dynamic vocabularies over time to address emerging terms in topic inference and fade away vocabularies which are no longer popular. Second, in static LDA, insignificant topics may have a high chance to become involved. Because of having a broad range of documents (in our model, daily aggregation of conversations over a long period of time), topics consist of common words more likely generated if the documents are given as a single group for topic inference. Third, topics related to significant social events may not be captured. As an example in Fig. 2, a high number of tweets was observed during specific days which coincide with some events. If documents, collected during a long period of time, are given as a single batch, topics related to a specific event are less likely inferred due to the large number of words observed over time.

To tackle the aforementioned issues and to identify an efficient temporal topic model for trend prediction, the following characteristics have to be addressed:

A model where the size is not growing over time. Vocabulary is regenerated and terms previously seen but not in the future fade away.

A model which can infer emerging topics over time.

A model which is not converging in topics after a long period of time. Convergence can be prevented by training a new LDA model during each period of time rather than transferring learning parameters from the previously learned model.

A model which can detect domain specific topics related to the significant events by regenerating vocabulary during different periods of time.

A model which can handle sequential data and can be easily updated by introducing new documents.

A range of different approaches was proposed for time varying topic identification [41, 42]. In a temporal based topic model such as Online LDA [38], topics are extracted from the current time slice. Reassigning topics for new documents is performed by updating parameters based on the previous model, whereas we aim at detecting different sets of topics without the contribution of the previous model. Our proposed approach identifies different sets of topics in each time slice and the accumulation of detected temporal topics generates the main identified features for the prediction model. In this case, the temporal topic detection model constructs a dynamic vocabulary over time.

The generic procedure of the proposed temporal topic model is presented in Algorithm 1. In this model, an LDA model is trained separately for each time slice (partition) and a set of topics is inferred for each partition. A time slice or partition is a unit of time (a month, a year, $\ldots$ ) in which documents are classified based on their timestamps. In every iteration, topic similarities between two partitions are estimated. In fact, the approach seeks for the degree of topic similarities between extracted topics in the prospective time slice ( $\textit{partition}_{t+1}$ ) compared to the current time slice ( $\textit{partition}_{t}$ ). The identified topics at $\textit{partition}_{t+1}$ which are similar to the already detected topics at $\textit{partition}_{t}$ are not selected. After selection of the proper topics, topic distributions are inferred for the other unseen documents, based on the LDA models corresponding to the partitions. Table 1 presents a set of notations which is used to describe the temporal model along with their definitions. The following subsections discuss the major steps of the proposed temporal topic model.

[t] Procedures of the temporal topic model[1] $i\leqslant u$ u is the number of partitions $P_{j}\leftarrow\textit{partition}_{j}$ $Q_{j}=\textit{partition}_{j+1}$ $\textit{lda}_{j}$ $=$ lda ( $P_{j}$ , $k^{p_{j}}$ ) estimating LDA parameters based on the training data in $P_{j}$ $Z^{K^{p_{j}}}$ $\leftarrow$ $lda_{j}$ [ $Z$ ] inferring topic distribution based on trained model ( $lda_{j}$ ) $lda_{j+1}$ $=$ lda ( $Q_{j}$ , $k^{Q_{j}}$ ) estimating LDA parameters based on the training data in $Q_{j}$ $Z^{K^{Q_{j}}}$ $=$ $lda_{j+1}$ [ $Z$ ] inferring topics for each document based on trained model ( $lda_{j+1}$ ) $k\leqslant K^{Q_{j}}$ $\textit{Sim}(T_{k}^{Q_{j}},T_{K}^{P_{j}})=\sum\limits_{h=1}^{K^{P_{j}}}\textit% {Distance}(T_{k}^{Q_{j}},T_{h}^{P_{j}})$ $\textit{Sim}(T_{k}^{Q_{j}},T_{K}^{P_{j}})<\textit{threshold}$ Regenerate $Z^{K^{Q_{j}}}$ $Z_{n*K}$ $=\text{accumulation of all }$ $Z^{K^{Q_{j}}}$ , $Z^{K^{P_{j}}}$

Figure 4.

Document partitioning.

5.1 Document partitioning

Given all documents $Z$ with their timestamps, the documents are placed into different partitions. As an example, if the observation period is 12 months and the size of partition is one month, the documents are partitioned monthly according to their timestamps. In each iteration, two different sequential partitions are processed for topic inference. The documents in the first partition are considered as a true representation ( $P$ ) and the second partition ( $Q$ ) denotes the model (Fig. 4). Therefore, partitions are created as follows:

$\displaystyle P_{j}=\bigcup^{m}_{r=1}z_{[(j-1)m+r]}$ (5) $\displaystyle Q_{j}=\bigcup^{m}_{r=1}z_{[jm+r]}$

5.2 Topic inference

In the temporal model, the LDA parameters are inferred as follows:

1.
For each topic $k^{P_{j}}=[1,K]$ :
2.
Draw $\phi_{k}^{P_{j}}\sim\textit{Dirichlet}(\beta^{P_{j}})$
3.
For document $z_{i}\in P_{j}$ :
4.
Draw a distribution over topics, $\theta^{P_{j}}_{z_{i}}\sim\textit{Dirichlet}(\alpha^{P_{j}})$
5.
For each word in document $w\in z_{i}$ :
6.
Draw a topic T $\sim\textit{Multinomial}(\theta^{P_{j}}_{z_{i}})$
7.
Draw a word $w$ $\sim\textit{Multinomial}(\phi^{P_{j}}_{z_{i}})$

At the arrival of a new document for the next partition ( $Q_{j}$ ), the LDA parameters are updated using the same procedure that has been used for $P_{j}$ :

1.
For each topic $k^{Q_{j}}=[1,K]$ :
2.
Draw $\phi_{k}^{Q_{j}}\sim\textit{Dirichlet}(\beta^{Q_{j}})$
3.
For document $z_{i}\in Q_{j}$ :
4.
Draw a distribution over topics, $\theta^{Q_{j}}_{z_{i}}\sim\textit{Dirichlet}(\alpha^{Q_{j}})$
5.
For each word in document $w\in z_{i}$ :
6.
Draw a topic T $\sim\textit{Multinomial}(\theta^{Q_{j}}_{z_{i}})$
7.
Draw a word $w$ $\sim\textit{Multinomial}(\phi^{Q_{j}}_{z_{i}})$

The process of topic inference will be continued by taking the next two partitions Eq. (5). In fact, by applying the new partitions, the dictionary is periodically updated over time and does not become too large.
5.3 Topic selection

In temporal topic detection, every two partitions are considered as heterogeneous sources since they were generated in different timestamps. Accordingly, the topics as the predictive variables (in our prediction model) derived from each partition are variant due to emerging information over time. Figure 5 presents the frequency of the top 2000 words (stemmed words) over two consecutive years (2012 and 2013). From the Fig. 5, we can also observe that the frequency of the words is variant over time. As an example “gun”, “energi”, and “basketball” were popular in 2012, while “cancer” and “campaign” were more popular terms in 2013. However, some words such as “data” and “develop” are constantly repeated over the two years with similar word frequencies.

In the temporal topic model, we also address topic evolution by ignoring topics repeated over time and selecting emerging topics in new partitions. This topic selection process allows us to select the topics which are diverse enough to represent emerging context and provides more predictive features. Topic selection is implemented in two steps. First, topic similarities are calculated and then, based on a predetermined threshold value, topics with a similarity smaller than the threshold are selected.

Figure 5.

Word frequency over two different years. The second graph is a zoomed version of the first graph.

Similarities between topics captured in partition $P_{j}$ and $Q_{j}$ are processed on a one to one level. Two different distance measures, the Jaccard index and KL-divergence, were applied. While the Jaccard index represents information flow at word level, KL-divergence also applies word distributions. In fact, Jaccard addresses emerging words in the selection of topics and KL-divergence measures a non-symmetric relation between topics and explains how upcoming topics ( $K^{Q_{j}}$ ) are diverse compared to the current time slice ( $K^{P_{j}}$ ). The similarity between each topic inferred from $Q_{j}$ is compared with all the inferred topics from $P_{j}$ . Topic similarities for each topic $k$ , where $k\in K^{Q_{j}}$ is calculated as follows:

$\textit{Sim}(T_{k}^{Q_{j}},T_{K}^{P_{j}})=\sum\limits_{h=1}^{K^{P_{j}}}\textit% {Distance}(T_{k}^{Q_{j}},T_{h}^{P_{j}})$ (6)

where the distances are summed if a one to one linkage has low similarity. Distance is the distance function calculated based on the Jaccard $(\textit{Distance}_{J})$ or KL-divergence $(\textit{Distance}_{KL})$ measurements.

The next step is to rank and select novel topics. Topic novelty is measured by one of the two characteristics as follows: (i) a novel topic should have a different word distribution compared to the previous partition; or (ii) a novel topic should introduce emerging words to the dictionary.

So far, for each two partitions the asymmetric one-to-one corresponding distance between topics were measured. To select the best emerging topics, a hybrid score as a linear combination of their similarity measures are calculated. The rank is given to all topics in $Q$ by mean scores given by the distance measures.

$\textit{Score}(T_{k}^{Q_{j}})=\frac{\sum\limits_{h=1}^{K^{P_{j}}}\textit{% Distance}_{KL}(T_{k}^{Q_{j}},T_{h}^{P_{j}})+\sum\limits_{h=1}^{K^{P_{j}}}% \textit{Distance}_{J}(T_{k}^{Q_{j}},T_{h}^{P_{j}})}{2}$ (7)

5.4 Document representation

After topic inference and selection, each document ( $z_{i}$ ) is represented by a set of novel topics. If we assume $K$ is the overall number of selected topics, each document is presented with a vector of topic distributions as follows:

$\displaystyle z_{i}=(T_{1},T_{2},\ldots,T_{K}),$ (8) $\displaystyle T_{k}=[0,1],1\leqslant k\leqslant K$

Since the topics were extracted based on the different partitions, we approached a normalization to standardize topic distribution. Each topic distribution is normalized with respect to the partition where the topic was inferred. As an example, if the topic was extracted from the partition $P_{j}$ , then the score will be calculated considering the minimum and maximum topic distribution in $P_{j}$ as follows:

$T_{k}^{{}^{\prime}P_{j}}=\frac{T_{k}^{P_{j}}-T_{\textit{min}}^{P_{j}}}{T_{% \textit{max}}^{P_{j}}-T_{\textit{min}}^{P_{j}}}$ (9)

where $T_{k}^{P_{j}}$ refers to the topic distribution for the document $z_{i}$ of partition $P_{j}$ .

Figure 6.

Training and test data are split into different sizes. In each experiment, the F-measure is presented for the holdout and progressive CV.

6. Experimental results and discussions

In this section, the experimental results are presented based on the contribution of different features. In the content-based model, the predictability of different smoothing windows was examined. In addition, a set of experiments was conducted to study the predictability of the content compared with auxiliary features. We also present how a prediction is different with the availability of historical data. For the topic model, experiments indicate that there is a need for an appropriate temporal model for detecting latent topics. It is shown how the topics are variant when inferred from the temporal model in terms of document-topic and term-topic matrices. Moreover, we also examined the predictability of topics detected by the temporal model compared with the batch model. Similar to the content-based model, the predictability over different crime types as well as different lags is presented.

For the classifier, we applied linearSVC which is the implementation of liblinear [43]. LinearSVC is faster compared with LinearSVM [44], since kernel transforms are not used and it scales better for large datasets in a linear classification problem. For the topic identification, Online LDA proposed by Hoffman et al. [45] was applied. Their model uses variational Bayes for posterior inference, which has shown to be faster for large dataset analysis. While the model identifies novel topics in each iteration and adds them to the total number of final topics, the number of topics ( $K$ ) is not predefined prior to the topic extraction. Features are normalized in a range of $[0,1]$ . In the topic extraction phase, we applied differently sized partitions ranging from yearly to monthly. The baseline is batch LDA with no time consideration.

The evaluation is processed by calculating the Macro-average F-measure with two different scenarios, holdout and progressive Cross-Validation (CV) [46]. The dataset is divided into two sets: training and test data. In the holdout scenario, training data is from 1 to $i$ documents, and test data is from $i+1$ to $n$ , where $n$ is the size of the dataset. The prediction is processed by a single trained model. However, for the progressive CV scenario, the classifier needs to be retrained $n-i$ times. The training set is the first $i$ and it is tested on the $i+1th$ document. In the second iteration, the training set is moved one document forward (the first $i+1$ ) and it is tested on the $i+2th$ document. This process is continued until all the test data is classified. Therefore, the model is constantly updated day by day and with each iteration the training window is moved one day forward. In this case, for continuous daily prediction, the content of current tweets is not missed and the training data is updated on a daily basis, which can be a remedy for model deterioration. However, we are interested in answering the question of how often the training model should be updated.

In order to illustrate the performance of the two different evaluation scenarios, dataset is split into training and test sets with different sizes. The results have been displayed in Fig. 6. The figure illustrates that in most cases, the progressive CV outperformed the holdout. In fact, for long term prediction, updating the training model is crucial. However, as shown in Fig. 6, when 10% of the test data is left, the holdout is sufficient enough to yield a decent performance and avoid the cost of re-training the classifier many times.

6.1 Content-based prediction

In content-based prediction, documents are texts with n-grams where $n\in\{1,2,3\}$ . We removed stopwords and low-frequent terms. The documents were represented with a binary and tf-idf representation. The best results were achieved using $n=1$ and binary representations. In the following subsections, we explain the results of using content for the targeted prediction.

6.1.1 Smoothing temporal data

As discussed in Section 4, each temporal document ( $z_{i}$ ) is generated using different smoothing windows (see Eq. (1)). In this part, the results of the experiment with different aggregation windows $q$ where $q=[1,7]$ are represented. The F-measure for each crime type is reported in Table 2. While the results vary based on the different crime types, daily ( $q=1$ ) aggregation is considered to be the best window size.

Table 2
The prediction performance based on different aggregation windows ( $q$ )

Crime type	Frequency	$q=1$	$q=2$	$q=3$	$q=4$	$q=5$	$q=6$	$q=7$
Total	1,137,790	0.63	0.63	0.64	0.61	0.64	0.62	0.60
Theft	247,617	0.67	0.65	0.66	0.66	0.64	0.60	0.60
Battery	204,041	0.78	0.80	0.72	0.72	0.73	0.65	0.59
Narcotics	124,890	0.78	0.74	0.71	0.62	0.63	0.64	0.59
Criminal damage	120,934	0.74	0.74	0.68	0.71	0.70	0.64	0.61
Burglary	79,420	0.73	0.72	0.71	0.66	0.65	0.68	0.56
Assault	65,954	0.65	0.66	0.61	0.63	0.61	0.62	0.59
Other offense	63,672	0.66	0.68	0.64	0.69	0.61	0.61	0.57
Motor vehicle theft	57,227	0.61	0.60	0.56	0.60	0.62	0.63	0.58
Robbery	4,5458	0.58	0.56	0.53	0.58	0.60	0.57	0.60
Deceptive practice	40,917	0.72	0.73	0.69	0.64	0.65	0.63	0.56
Criminal trespass	28,682	0.66	0.65	0.64	0.64	0.60	0.61	0.61
Weapons violation	12,408	0.58	0.60	0.62	0.63	0.65	0.62	0.56
Public peace violation	10,661	0.63	0.63	0.63	0.61	0.60	0.58	0.58
Offense involving children	7,343	0.65	0.61	0.60	0.59	0.72	0.59	0.63
Prostitution	7,311	0.77	0.75	0.71	0.57	0.70	0.60	0.61
Crime sexual assault	4,330	0.63	0.67	0.68	0.57	0.56	0.64	0.61
Sex offense	3,344	0.60	0.63	0.63	0.57	0.57	0.54	0.60
Interference with public officer	2,982	0.57	0.60	0.55	0.67	0.55	0.59	0.58
Gambling	2,587	0.64	0.58	0.59	0.63	0.59	0.60	0.62
Liquor law violation	1,939	0.62	0.68	0.65	0.65	0.63	0.61	0.55
Homicide	1,547	0.55	0.56	0.56	0.57	0.59	0.58	0.59
Arson	1,542	0.60	0.57	0.57	0.56	0.55	0.53	0.58

6.1.2 The impact of historical data

Another set of experiments was conducted to measure the impact of historical data on prediction performance. This was done to find out if the crime trend becomes more predictable as we observe more historical data or not. Contrary to the previous experiments, the size of test data remains unchanged (August 2013 to November 2013), and the size of training data is started from 31 days of the latest historical data (July 2013) to predict test data. In the next experiment, the size of training date is increased by sliding training window 31 more days into the past. In fact, in each experiment, the size of training data is increased by involving more documents retrospectively. The experiments are repeated until all the historical data were involved. Figure 7 depicts the results with the different historical training windows for all the incidents. The highest predictability is obtained when whole historical data contributes to the prediction model. However, the result by the seventh months is comparable to the overall performance, while adding more historical contributes little.

Figure 7.

Test data consist of documents during August, 2013 and November, 2013. First experiment applied the training data during July 2013. For the next experiment, the training window is increased by one more month retrospectively (June 2013 and July 2013). The experiments repeated until the whole historical training data was involved. The figure indicates the F-measure for each experiment. For some of the results, the period of contributed training data presented.

6.2 Performance of content v.s. auxiliary features

Although the main contribution of the paper is to study the correlation between content and crime trend, we also employ other auxiliary datasets which are widely applied in crime prediction. As discussed before, several studies have investigated the incorporation of socio-economic indexes and spatio-temporal features in crime prediction [47]. We also apply the other resources in our prediction model to understand the contribution of the content-based features in comparison to the other predictive variables. We selected a list of non-content features, which widely applied in crime prediction. The selected features are as follows:

•
Unemployment rate: Unemployment rates were shown to have a direct relationship with crime rates [48, 49]. These rate were leveraged as a socio-economic factor. The rates were obtained as discussed in dataset section.
•
Weather: The normalized monthly average temperature was also employed when shown to be effective in crime index prediction [47, 29].
•
Crime rates: Crime rates were employed as another set of features. As discussed before, conventional predication models employ historical crime records to predict future incidents. In our model, crime record at time $t$ is labeled with crime records at time $t+d$ , where $d$ is the lag. The idea is to investigate how much a crime rate is predictive of future records.
•
Number of tweets: The number of tweets per day is normalized between 0 and 1.
•
Day of week: It refers to the day when a document is generated.
•
Events: The days before and after a set of specific events such as: Halloween, Thanksgiving, Christmas, New year’s day, Martin Luther King day, Valentine’s day, St Patrick’s day, 4th of July, Super Bowl, and Presidential election.

We evaluated the performance of each feature as well as content-based features (n-gram) in predicting crime rate directions with lag up to 7. Figure 8 presents the performance of “day of week”, “number of tweets”, and “content” for predicting the increase and the decrease of crime rates. The results indicate that content-based features significantly improve the F-measure where the other features did not provide comparable results. The rest of the features such as “unemployment rate” and “events” could not achieve high performance compared to the other auxiliary features (in the best case, F-measure $=$ 0.4). Overall, content indicates a high predictability in trend prediction, compared to other features. The number of tweets is shown to be effective compared to day of week and the crime rate. Although predicting crime trend is challenging, the contribution of the content-based feature in prediction compared to the auxiliary features are considerable.

Figure 8.
Performance of different features for predicting crime rate directions.

Figure 9.
The most frequent terms distributions for the top 20 topics inferred by (a) baseline, and (b) temporal model.

Figure 10.
Topic distribution for each document based on different sizes of partition.

6.2.1 Prediction based on sentiment analysis

Unlike all the previous experiments which have been conducted using content-based features, this experiment is set up to test the predictability of sentiment features. The features are computed as explained before. A holdout evaluation has been applied to evaluate the predictions. The experiment is conducted on all the incidents for five individual sentiment variables and one incorporated sentiment. Then we repeated the experiment by adding sentiment variable to the content. The results indicated a low predictability for sentiments. In the best case, negative sentiment, the F-measure reached up to 0.55. In fact, the sentiment analysis was not able to perform better than the content-based features in any of the cases.

6.3 Prediction based on temporal topics

The characteristics of topics, extracted from temporal and batch models, are discussed. We also evaluate the predictability of topics as features in the proposed prediction model.

6.3.1 Characteristics of temporal topics

Identified topics from the temporal model have been compared with the baseline which is batch LDA without the time dimension. The comparison has been made in two different phases: first we compared how variant are the term distributions. Second we analyzed their differences in document-topic level.

Term-Topic Distribution: Adopting the visualization method proposed by [34], in Fig. 9, the top 20 terms and their distributions for each individual latent topic have been visualized. The figure reveals that the topics extracted by the baseline are similar to each others as they share more similar words, while in temporal model, topics tend to have less similar terms. As shown in Fig. 9, the vocabulary generated by the temporal model is larger compared to the baseline, therefore, more distinct topics were identified. The second characteristic of the identified topics is topic-term distribution which has been visualized by the solid dots with different sizes. It suggests that in the temporal model, the term distribution is more variant, which means the extracted topics are more diverse compared to that of the baseline.

Document-Topic Distribution: The extracted topics show different characteristics in terms of docu-ment-topic distribution. Figure 10 presents the distribution of the most popular topics in each document (each day) for the batch and the temporal model. In the batch model, the extracted topics for each day has low distributions, while one topic has shown to have high value. In most of the days, the most popular topics are topic 16 to 20 with low values. This results in poor topic identification for the whole entire period. In the temporal model, where number of partitions are between 2 to 20, the topics are fairly distributed over the documents. However, in the case of extreme partitioning, where number of partitions are 20, identified topics seem to be general.

6.3.2 Temporal topics as features

In order to present the predictability of the temporal topic models, the experiments were expanded to 22 different crime types. For the baseline model, a predefined number of topics was observed from training corpus. In this case, any topic shift is ignored. Whereas, the temporal topic model is concerned with topic shifts and time dimension as discussed before. Table 3 displays the best results for each individual crime type as well as the accumulated one. It shows that the temporal model, which detects novel topics, outperformed the baseline (in 17 cases) and content (n-gram). The performance was improved by the temporal topics to 21% higher than the baseline in the best predictable crime type (Burglary). Further analysis investigated the predictability of the proposed model for different lags.

Table 3
F-measure of the best results for different crime types

Crime type	Content	Batch model	Temporal model
All crimes	0.63	0.60	0.76
Theft	0.67	0.69	0.79
Battery	0.78	0.75	0.85
Narcotics	0.78	0.70	0.88
Criminal damage	0.74	0.67	0.78
Burglary	0.73	0.73	0.94
Assault	0.65	0.73	0.70
Other offense	0.66	0.64	0.70
Motor vehicle theft	0.61	0.65	0.60
Robbery	0.58	0.66	0.72
Deceptive practice	0.72	0.60	0.73
Criminal trespass	0.66	0.66	0.67
Weapons violation	0.58	0.72	0.67
Public peace violation	0.63	0.66	0.75
Offense involving children	0.65	0.65	0.78
Prostitution	0.77	0.70	0.79
Crime sexual assault	0.63	0.72	0.73
Sex offense	0.60	0.67	0.62
Interference with public officer	0.57	0.59	0.60
Gambling	0.64	0.66	0.54
Liquor law violation	0.62	0.66	0.66
Homicide	0.55	0.63	0.67
Arson	0.60	0.59	0.73

Table 4

Labeling approach for lag $=$ 1 and lag $=$ 2

Lag $=$ 1	Lag $=$ 2
$z_{1}\rightarrow l_{1}:sgn\|y_{2}-y_{1}\|=+1$	$z_{1}\rightarrow l_{1}:sgn\|y_{3}-y_{1}\|=+1$
$z_{2}\rightarrow l_{2}:sgn\|y_{3}-y_{2}\|=+1$	$z_{2}\rightarrow l_{2}:sgn\|y_{4}-y_{2}\|=+1$
$z_{3}\rightarrow l_{3}:sgn\|y_{4}-y_{3}\|=+1$	$z_{3}\rightarrow l_{3}:sgn\|y_{5}-y_{3}\|=-1$
$z_{4}\rightarrow l_{4}:sgn\|y_{5}-y_{4}\|=-1$	$z_{4}\rightarrow l_{4}:sgn\|y_{6}-y_{4}\|=-1$
$z_{5}\rightarrow l_{5}:sgn\|y_{6}-y_{5}\|=-1$	$z_{5}\rightarrow l_{5}:sgn\|y_{7}-y_{5}\|=-1$
$z_{6}\rightarrow l_{6}:sgn\|y_{7}-y_{6}\|=-1$	$z_{6}\rightarrow l_{6}:sgn\|y_{8}-y_{6}\|=-1$
$z_{7}\rightarrow l_{7}:sgn\|y_{8}-y_{7}\|=+1$	$z_{7}\rightarrow l_{7}:sgn\|y_{9}-y_{7}\|=+1$

Figure 11.

The labeling approach based on (a) lag $=$ 1 and (b) lag $=$ 2.

Figure 12.

Holdout evaluation results for different crime types over 7 lags.

6.3.3 Performance of prediction for different lags

The predictability of the proposed model with temporal topics for different lags were examined. A set of test scenarios were implemented to examine the predictability of lags. Therefore, each document $z_{i}$ which has been generated at time $t_{i}$ is labeled with the prospective crime trends $l_{i}$ (see Eq. (2)). The lag does not stand for a day of week, it is a window of time in which crime rate directions are captured. As an example, if $\text{lag}=1$ , each document is labeled with the direction of crime trend in a day later. In each different lags, the classifier is fed with the generated training data separately. Figure 11 shows the crime trend of BATTERY between a period of 14 days and the generated labels (either $+$ 1 or $-$ 1) for lag $=$ 1 and lag $=$ 2. In the case of these two different lags, the documents are labeled as presented in Table 4, where $z_{1}$ is a document aggregated at time $t(1)$ and $l_{1}$ is its assigned label. The performance of the classifier in lag $=$ 1 and 2 are evaluated separately. Figure 12 illustrates the results of using temporal topics for different lags up to 7 ( $d\in\{1,7\}$ ). The intention is to understand the best lag between the temporal topics and crime trend. According to the results, the best performance is mostly captured when $d\in\{1,3\}$ compared to the other lags. However, it can be variant for different crime types. Overall, the results demonstrate that the proposed prediction model with temporal topics reveals significant performance compared to other features.

7. Discussion and conclusions

In this paper, a prediction model for crime rate direction is presented based on mining posted tweets from a relevant geographic area. The conclusions and the findings of the paper are as follows: (i) the proposed method does not need any previously reported training data. In fact, the model annotates its own training data. In our prediction model, the labels are derived from a target signal (here, crime index) and then labels are assigned to the input data. (ii) Using the prediction model, the crime index prediction is reduced to a binary classification. The classifier predicts, given the input data, whether or not the crime index will be up (down) in future. (iii) In order to evaluate the predictability of user generated content in social media, no keywords and specific terms related to the crime were targeted. (iv) the lexicon-based sentiment analysis offered a poor or negligible prediction. (v) In addition to raw content and sentiment, the hidden topics of posted tweets were also extracted and employed as the features in the classifier model, which have shown to have the highest predictability compared to content and sentiment. Due to time varying nature of the content, a temporal topic model based on LDA was proposed and compared with the static LDA (batch). This model infers the hidden topics using a dynamic vocabulary. In fact, the vocabulary is regenerated in different time-frames to address information evolution for topic inference. The best topics, which are selected based on the diversity and novelty, are used as the predictive features in the prediction model.

We evaluated our method on crime trend prediction of Chicago, however, the model can be applied to other direction of various nature without relying on a specific location. Despite using fluctuated crime time series, the results offer a strong correlation between content and crime index direction. The content-based features significantly improve the results of prediction compared to other auxiliary variables. Particularly in case of using temporal topics, the results in most crime types suggests a strong correlation between the content of social media and direction of crime rates.

Although we do not track any specific keywords for our prediction model, sampling Twitter is important to avoid missing data. The sparsity of users’ activities plays a crucial role. The prediction model relies on the availability of content over time, which is affected by the absence of users’ tweets. In the future, we would like to propose a sampling approach in which we avoid missing of users’ activities as much as possible and collect Twitter data of users who are historically active. Overall, the study supported the importance of considering Twitter content as an extra data resource without suffering from the lack of training data. The predictability of some variables derived from Twitter content was successfully proven in this study, but further analysis on extracting other informative signals may be undertaken. We would like to analyze textual content semantically for better understanding the relationship between features. In this study, we were interested to present the effectiveness of the content-based model, but further analysis is needed to examine the incorporation of other socio-economic indexes, and geographical information that are correlated with criminal activities. In fact, the content can be applied as a valuable extra information along with other resources, which were shown to have correlation with different incidents.

Footnotes

Acknowledgments

This study was supported in part by the Natural Sciences and Engineering Research Council of Canada and Ontario Trillium Scholarship. The authors would like to thank Kenton White for providing Twitter dataset.

References

Abrahamsen

,The psychology of crime, 1960.

Weinberg

S.K.

, Theories of criminality and problems of prediction, The Journal of Criminal Law, Criminology, and Police Science (1954), 412–424.

Eck

Chainey

Cameron

and Wilson

, Mapping crime: Understanding hotspots, 2005.

Marwick

A.E.

et al., I tweet honestly, i tweet passionately: Twitter users, context collapse, and the imagined audience, New Media & Society 13(1) (2011), 114–133.

Wald

Khoshgoftaar

T.M.

Napolitano

and Sumner

, Using twitter content to predict psychopathy, in Machine Learning and Applications (ICMLA), 2012 11th International Conference on, vol. 2. IEEE, 2012, pp. 394–401.

Kıcıman

, Language differences and metadata features on twitter, in Web N-gram Workshop, 2010, p. 47.

Wakamiya

Lee

and Sumiya

, Crowd-based urban characterization: Extracting crowd behavioral patterns in urban areas from twitter, in Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Location-Based Social Networks. ACM, 2011, pp. 77–84.

Achrekar

Gandhe

Lazarus

S.-H.

and Liu

, Twitter improves seasonal influenza prediction, in HEALTHINF, 2012, pp. 61–70.

Wang

Gerber

M.S.

and Brown

D.E.

, Automatic crime prediction using events extracted from twitter posts, in Social Computing, Behavioral-Cultural Modeling and Prediction. Springer, 2012, pp. 231–238.

10.

Latané

, The psychology of social impact, American Psychologist 36(4) (1981), 343.

11.

Ahmed

and Xing

E.P.

, Timeline: A dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream, arXiv preprint arXiv:1203.3463, 2012.

12.

AlSumait

Barbará

and Domeniconi

, On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking, in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, IEEE, 2008, pp. 3–12.

13.

Chainey

Tompson

and Uhlig

, The utility of hotspot mapping for predicting spatial patterns of crime, Security Journal 21(1) (2008), 4–28.

14.

Wang

and Brown

D.E.

, The spatio-temporal modeling for criminal incidents, Security Informatics 1(1) (2012), 1–17.

15.

Xue

and Brown

D.E.

, Spatial analysis with preference specification of latent decision makers for criminal event prediction, Decision Support Systems 41(3) (2006), 560–573.

16.

Mohler

G.O.

Short

M.B.

Brantingham

P.J.

Schoenberg

F.P.

and Tita

G.E.

, Self-exciting point process modeling of crime, Journal of the American Statistical Association 106(493) (2011).

17.

George

A.B.

and Tita

, 9 social networks and the ecology of crime: Using social network data to understand the spatial distribution of crime, 2012, pp. 128–143.

18.

Hipp

J.R.

Butts

C.T.

Acton

Nagle

N.N.

and Boessen

, Extrapolative simulation of neighborhood networks based on population spatial distribution: Do they predict crime? Social Networks 35(4) (2013), 614–625.

19.

Bogomolov

Lepri

Staiano

Oliver

Pianesi

and Pentland

, Once upon a crime: Towards crime prediction from demographics and mobile data, in Proceedings of the 16th International Conference on Multimodal Interaction. ACM, 2014, pp. 427–434.

20.

Traunmueller

Quattrone

and Capra

, Mining mobile phone data to investigate urban crime theories at scale, in Social Informatics, Springer, 2014, pp. 396–411.

21.

Weisburd

and Green

, Defining the street-level drug market, 1994.

22.

Sakaki

Okazaki

and Matsuo

, Earthquake shakes twitter users: real-time event detection by social sensors, in Proceedings of the 19th International Conference on World Wide Web. ACM, 2010, pp. 851–860.

23.

Weng

and Lee

B.-S.

, Event detection in twitter, ICWSM 11 (2011), 401–408.

24.

Chen

Zhou

Zhu

and Xu

, Detecting offensive language in social media to protect adolescent online safety, in Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom). IEEE, 2012, pp. 71–80.

25.

Bollen

Mao

and Zeng

, Twitter mood predicts the stock market, Journal of Computational Science 2(1) (2011), 1–8.

26.

Hale

Gaffney

and Graham

, Where in the world are you? geolocation and language identification in twitter, Proceedings of ICWSM’12 (2012), 518–521.

27.

Malleson

and Andresen

M.A.

, The impact of using social media data in crime rate calculations: shifting hot spots and changing spatial patterns, Cartography and Geographic Information Science 42(2) (2015), 112–121.

28.

Gerber

M.S.

, Predicting crime using twitter and kernel density estimation, Decision Support Systems 61 (2014), 115–125.

29.

Chen

Cho

and Jang

S.Y.

, Crime prediction using twitter sentiment and weather, in Systems and Information Engineering Design Symposium (SIEDS), 2015. IEEE, 2015, pp. 63–68.

30.

Mishra

, Crime drop of the 1990s, The Encyclopedia of Criminology and Criminal Justice, 2014.

31.

White

and Japkowicz

, Sampling online social networks using coupling from the past, in Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on. IEEE, 2012, pp. 266–272.

32.

Pennebaker

J.W.

Francis

M.E.

and Booth

R.J.

, Linguistic inquiry and word count: Liwc 2001, Mahway: Lawrence Erlbaum Associates 71 (2001), 2001.

33.

Makrehchi

, Social link recommendation by learning hidden topics, in Proceedings of the Fifth ACM Conference on Recommender Systems. ACM, 2011, pp. 189–196.

34.

Chuang

Manning

C.D.

and Heer

, Termite: Visualization techniques for assessing textual topic models, in Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, 2012, pp. 74–77.

35.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, the Journal of Machine Learning Research 3 (2003), 993–1022.

36.

Wei

and Croft

W.B.

, Lda-based document models for ad-hoc retrieval, in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2006, pp. 178–185.

37.

Mei

Ling

Wondra

and Zhai

, Topic sentiment mixture: modeling facets and opinions in weblogs, in Proceedings of the 16th International Conference on World Wide Web. ACM, 2007, pp. 171–180.

38.

Hindle

Godfrey

M.W.

and Holt

R.C.

, Whatâ€™s hot and whatâ€™s not: Windowed developer topic analysis, in Software Maintenance, 2009. ICSM 2009. IEEE International Conference on. IEEE, 2009, pp. 339–348.

39.

Stilo

and Velardi

, Time makes sense: Event discovery in twitter using temporal similarity, in Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 02. IEEE Computer Society, 2014, pp. 186–193.

40.

Mikolov

Chen

Corrado

and Dean

, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013.

41.

Guo

Xiang

Chen

Huang

and Hao

, Lda-based online topic detection using tensor factorization, Journal of Information Science (2013), 0165551512473066.

42.

Lau

J.H.

Collier

and Baldwin

, On-line trend analysis with topic models:n # twitter trends detection topic model online. in COLING. Citeseer, 2012, pp. 1519–1534.

43.

Fan

R.-E.

Chang

K.-W.

Hsieh

C.-J.

Wang

X.-R.

and Lin

C.-J.

, Liblinear: A library for large linear classification, The Journal of Machine Learning Research 9 (2008), 1871–1874.

44.

Chang

C.-C.

and Lin

C.-J.

, Libsvm: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST) 2(3) (2011), 27.

45.

Hoffman

Bach

F.R.

and Blei

D.M.

, Online learning for latent dirichlet allocation, in Advances in Neural Information Processing Systems, 2010, pp. 856–864.

46.

Blum

Kalai

and Langford

, Beating the hold-out: Bounds for k-fold and progressive cross-validation, in Proceedings of the Twelfth Annual Conference on Computational Learning Theory. ACM, 1999, pp. 203–208.

47.

Anderson

C.A.

, Temperature and aggression: effects on quarterly, yearly, and city rates of violent and nonviolent crime, Journal of Personality and Social Psychology 52(6) (1987), 1161.

48.

Raphael

and Winter-Ebmer

, Identifying the effect of unemployment on crime, Journal of Law and Economics 44(1) (2001), 259–283.

49.

Cohen

L.E.

and Felson

, Social change and crime rate trends: A routine activity approach, American sociological review, 1979, pp. 588–608.

Lag $=$ 1	Lag $=$ 2
$z_{1}\rightarrow l_{1}:sgn\|y_{2}-y_{1}\|=+1$	$z_{1}\rightarrow l_{1}:sgn\|y_{3}-y_{1}\|=+1$
$z_{2}\rightarrow l_{2}:sgn\|y_{3}-y_{2}\|=+1$	$z_{2}\rightarrow l_{2}:sgn\|y_{4}-y_{2}\|=+1$
$z_{3}\rightarrow l_{3}:sgn\|y_{4}-y_{3}\|=+1$	$z_{3}\rightarrow l_{3}:sgn\|y_{5}-y_{3}\|=-1$
$z_{4}\rightarrow l_{4}:sgn\|y_{5}-y_{4}\|=-1$	$z_{4}\rightarrow l_{4}:sgn\|y_{6}-y_{4}\|=-1$
$z_{5}\rightarrow l_{5}:sgn\|y_{6}-y_{5}\|=-1$	$z_{5}\rightarrow l_{5}:sgn\|y_{7}-y_{5}\|=-1$
$z_{6}\rightarrow l_{6}:sgn\|y_{7}-y_{6}\|=-1$	$z_{6}\rightarrow l_{6}:sgn\|y_{8}-y_{6}\|=-1$
$z_{7}\rightarrow l_{7}:sgn\|y_{8}-y_{7}\|=+1$	$z_{7}\rightarrow l_{7}:sgn\|y_{9}-y_{7}\|=+1$

Mining Twitter data for crime trend prediction

Abstract

Keywords

1. Introduction

3. Dataset description

1 http://en.wikipedia.org/wiki/Chicago.

3 City of Chicago Data Portal: https://data.cityofchicago.org.

4 Economic Research Federal: http://research.stlouisfed.org.

4.1 Document generation

Table 1 The list of notations employed in this section

6.1 Content-based prediction

6.1.1 Smoothing temporal data

Table 2 The prediction performance based on different aggregation windows ( q )

6.3 Prediction based on temporal topics

6.3.1 Characteristics of temporal topics

6.3.2 Temporal topics as features

Table 3 F-measure of the best results for different crime types

7. Discussion and conclusions

Footnotes

Acknowledgments

References

¹
http://en.wikipedia.org/wiki/Chicago.

³
City of Chicago Data Portal: https://data.cityofchicago.org.

⁴
Economic Research Federal: http://research.stlouisfed.org.

Table 1
The list of notations employed in this section

Table 2
The prediction performance based on different aggregation windows ( $q$ )

Table 3
F-measure of the best results for different crime types