Crime rate detection using social media of different crime locations and Twitter part-of-speech tagger with Brown clustering

Abstract

Nowadays, the crime rate increases dramatically in every country. Therefore, it is an urgent need for governments and social associations to produce persistent solutions and disincentive penalties to prevent crime issues. Specifically, social media plays an important role in crime rate detection; thus, reducing crime rates significantly. It would be a good medium for the desired task. In this paper, we analyze Twitter data collected from Twitter accounts for seven different locations (Ghaziabad, Chennai, Bangaluru, Chandigarh, Jammu, Gujarat, and Hyderabad) from January 2014 to November 2018 in a case study of India, which is opted to illustrate the efficiency of the proposed work. Sentiment analysis has been used to analyze users’ behavior and psychology through the tweets of people to track crime actions. Twitter part-of-speech tagger, which is a Markov Model of first-order entropy, has been used for part-of-speech in online conversational text. Brown clustering is used for a long set of unlabeled tweets. Comparisons are verified with real crime rates from an authorized source of information according to different locations. We also measure the latest crime trends for the highest (Ghaziabad, Uttar Pradesh) and lowest crime cities (Jammu) in India. It has been found that the latest crime trends have been recorded for the time duration of 7 days (23, January 2019 to 30, January 2019). The analyses demonstrate that the obtained results match with the real crime rate data. We believe that these types of studies will help to detect the real-time crime rate for different locations and detect the crime pattern easily.

Keywords

Crime detection Twitter data social media sentiment analysis Brown clustering

1 Introduction

Criminal activities dramatically increase in every country’s years by years. Strong measures are required for prevention of these criminal activities. Crime rate detection plays a vital role in monitoring these criminal activities and improving public safety. Social media are useful for crime rate detection at different locations in any country, and therefore crime rates can be decreased significantly. Social media are not only a communication tool but also a source of information [1, 2]. Twitter is a good candidate for data analysis with more than 300 million peoples [3]. Users share their feelings, ideas, emotions, and anger on this social media platform. However, it is not easy to extract information from Twitter for crime detection because the Tweets are the intension of users. A Tweet can be a symbol or in other formats. Due to these issues, each Tweet should be analyzed carefully [4].

Researchers have used the social media data to detect or predict the crime by using vicinity of past crime [5], to detect crime by using statistical topic modelling across a city in US [4], to identify the profile of high rank criminals [6], to track the attractiveness of the actors [7], to predict about the election results [8] and to detect high repute profiles [9]. In [10], several online large social networks were analyzed. In [11], a self-exciting point process for the modelling of crime was presented. In [12], a hybrid model was proposed for understanding the actual crime data of Virginia. Goswami et al. [13] performed a survey on the detection of events on social media, which are helpful for the detection of natural disasters. Bosque et al. [14] presented several techniques to predict rude comments on social media. Egele et al. [9] explored how similar techniques can be used to recognize the compromises of some high repute profiles that show consistent behavior. Social media were used for surveys on different emotion frameworks, emotion detection [15] and issues faced due to short text [16]. In [17], events correlation in tweets has been done by using singular value decomposition (SVD). Review analyses have been performed for understanding challenges in collecting data from social media and some advance tools to address these challenges [4 , 18–22].

Even though the related works used social media to detect the crime rate; yet they ignored the collected data from authorized security agencies; thus, this would impact the performance of tracking. This raises the motivation for this research, which leads to the objective of this paper for detecting the crime rate by different locations using the collected data from social media.

We have three important goals: (I) Extract the information from twitter by recording tweets of users; (II) Get location-based real-time crime data from security agencies; (III) Make a comparative analysis between Twitter data and real crime data.

Few sensitive keywords directly related to the crime such as Criminal, Rape, Murder, and Attack are used to collect the data from a social media site. 3801 tweets are filtered out from a large number of tweets and used as datasets for the study. The filtering process is applied for a large number of tweets by using keywords in the 4-year period. The data clearing process has been done to eliminate the non-alphanumeric characters [20]. Rate of cognizable crimes has also been collected from the security agencies, and a comparison between the datasheet generated from Twitter data and the real crime data received from security units are also a significant part of our study.

The meaning of Twitter data is the number of Tweets recorded from Twitter using different keywords such as Murder’, ‘Crime’, ‘Encounter’, ‘Hit and Run’, ‘Rape’ and ‘Fight’ for seven determined locations (Ghaziabad, Chennai, Bangaluru, Chandigarh, Jammu, Gujarat, and Hyderabad) of India for the duration from January 2014 to November 2018. Using this analysis, 3801 Tweets were collected and saved to the database. The real crime data for the determined locations were collected from the security agencies (i.e. National Crime Records Bureau, http://ncrb.gov.in/), which are useful to check the accuracy of the present research.

Many researchers are working in different domains of social network, and many models have been proposed for predicting the rank of the users according to their social media account [32]. Centrality measures and network topology have been investigated [33]. Some approaches such as random-walk-based or diffusion-based algorithms have been established for ranking of social media nodes [34].

For measuring the influence of users on Twitter, Leavit et al. [35] used four different features such as retweets, replies and mentions for measuring the influence on Twitter. Other side, Cha et al. [36] discussed about the influences in Twitter such as retweets influence, indegree influence and user’s mentions. The approach has been computed for 6 million users. For assessing the influence of users, belief function theory in weighted networks has been discussed [37]. It is the first time that the belief function was used for assessing of the influence of users on the Twitter network.

In this research, we perform the following steps for crime rate detection using social media.

- Collection of Tweets: The collection of data from Twitter accounts has been done from January 2014 to November 2018 by scrolling the search outputs for different accounts of seven locations (Ghaziabad, Chennai, Bangaluru, Chandigarh, Jammu, Gujarat, and Hyderabad) of India.

- R eal crime data collection: The real crime data for the determined locations were collected from the security agencies (National Crime Records Bureau, http://ncrb.gov.in/) in the periods of 2014, 2015 and 2016, which are useful to check the accuracy of the present research.

- Data Cleaning: As known, the Tweets are the sentiments or intensions of people [31]. Sometimes, they use some non-alphanumeric characters while typing these tweets. Thus, these characters need to be eliminated or removed. To accomplish data cleaning process, the steps are explained in section 2.2.7.

- Check system accuracy: A comparative analysis is performed between the recorded and real crime data. Final analyses establish that the results obtained from collected tweets approximately overlap with the results of real crime data gathered from security forces. The rank-based analysis demonstrates that the top three highest crime cities and lowest crime city are having the same rank in real crime data and twitter data.

This paper is the sum of three sections, out of which the first section is the introduction. Section 2 discusses materials and methods. This section has been including data cleaning and data collection procedures. In Section 3, Highest and Lowest crime rates representation using Twitter data and real crime data have been discussed. Rank Wise Comparative analysis and the latest crime trend are also part of Section 3. In Section 3, summarized critical evaluation of related works has been discussed in Table 8. Conclusion and future works have been included in Section 4.

2 Materials and methods

2.1 Data collection

2.1.1 Twitter data collection

The collection of data from Twitter accounts has been done from January 2014 to November 2018 by scrolling the search outputs for different accounts of seven locations (Ghaziabad, Chennai, Bangaluru, Chandigarh, Jammu, Gujarat, and Hyderabad) of India. The latitude and longitude are given for the considering locations in Table 1 (extracted from Google Map Online). 3801 tweets were collected and saved to the database. We find keywords on social media sites that match keywords and record these matched patterns. This is a word tagging operation. The identification of keywords is very significant in our analysis. Therefore, we determine the words which are directly related to the crime. The following keywords are selected for the analysis of crime rate by using social media as follows: ‘Murder’, ‘Crime’, ‘Encounter’, ‘Hit and Run’, ‘Rape’, ‘Fight’. 999 tweets are filtered for the first location. The keyword “fight” is a topmost word shared by the number of users on Twitter. In the second location, keyword “crime” is widely used by the users. The number of tweets collected for the seven crime cities is shown in Table 2.

Table 1
Seven different crime locations in India

Location Latitude and Longitude

Ghaziabad, India 28.6692° N, 77.4538° E

Chennai, India 13.0827° N, 80.2707° E

Bangaluru, India 12.9716° N, 77.5946° E

Chandigarh, India 30.7333° N, 76.7794° E

Jammu, India 32.7266° N, 74.8570° E

Gujarat, India 22.2587° N, 71.1924° E

Hyderabad, India 17.3850° N, 78.4867° E

Location	Latitude and Longitude
Ghaziabad, India	28.6692° N, 77.4538° E
Chennai, India	13.0827° N, 80.2707° E
Bangaluru, India	12.9716° N, 77.5946° E
Chandigarh, India	30.7333° N, 76.7794° E
Jammu, India	32.7266° N, 74.8570° E
Gujarat, India	22.2587° N, 71.1924° E
Hyderabad, India	17.3850° N, 78.4867° E

Table 2

Numbers of tweets collected for the seven crime cities

Region	Location	Murder	Crime	Encounter	Hit and run	Rape	Fight
1	Ghaziabad	117	228	121	43	222	268
2	Chennai	137	207	58	10	115	190
3	Bangaluru	170	112	43	32	111	158
4	Chandigarh	90	52	27	3	99	87
5	Jammu	41	43	78	1	44	99
6	Gujarat	62	50	37	8	91	101
7	Hyderabad	52	98	48	12	80	156

The reason to consider these seven locations for the study is because these locations are located in the most popular Indian states for International and Domestic Tourist and the security for Domestic and international tourist is the major concern for Indian Government. According to the report of National Crime Records Bureau, out of these seven locations, 5 locations (Ghaziabad, Chennai, Bangaluru, Gujarat, and Hyderabad) are located in the top most crime rate states in India.

Twitter API is also the part of study for the verification of the proposed approach, we measure the latest crime trends for the highest (Ghaziabad, Uttar Pradesh) and lowest crime cities (Jammu) in India. The measurement of the latest crime trends has been recorded for the time duration of 7 days (23, January 2019-30, January 2019) by using TAGS v6.1.

2.1.2 Real crime data collection

Handling crime records is important for detecting and preventing crime in any country. Years by years, Indian police is involving in many new approaches for improving the efficiency of crime records system. In 1985, Indian Government constitutes a new task force to maintain the crime records for different locations in India and after the task force recommendations, Indian Government constituted the National Crime Records Bureau (NCRB).

NCRB records are useful for collection of real crime data for the determined locations. The real crime data for the determined locations were collected from the security agencies which are useful to check the accuracy of the present research. Real crime data for seven cities are shown in Table 3 for 2014, 2015 and 2016. Combined datasets of Tweets collected and crime according to NCRB can be shown in Table 4.

Table 3
Real crime data collected for seven crime cities

Location Year 2014 Year 2015 Year 2016

Ghaziabad 240475 241920 282171

Chennai 193200 187558 179896

Bangaluru 137338 138847 148402

Chandigarh 37162 37983 40007

Jammu 23848 23583 24501

Gujarat 131385 126935 147122

Hyderabad 106830 106282 108991

Location	Year 2014	Year 2015	Year 2016
Ghaziabad	240475	241920	282171
Chennai	193200	187558	179896
Bangaluru	137338	138847	148402
Chandigarh	37162	37983	40007
Jammu	23848	23583	24501
Gujarat	131385	126935	147122
Hyderabad	106830	106282	108991

Table 4

Combined dataset of tweets collected and NCRB crimes

Location	Total tweets collected	Total crime according to NCRB
Ghaziabad	999	764566
Chennai	717	560654
Bangaluru	626	424587
Chandigarh	358	115152
Jammu	306	71932
Gujarat	349	405442
Hyderabad	446	322103

2.2 Proposed crime detection model

Figure 1 presents the proposed model which is the combination of three different steps.

Fig. 1

Proposed crime detection model.

In the first step, data are collected from several Tweets shared by users performed by using keywords. This process brings a large number of tweets; therefore, manual filtering process should be applied to extract the useful tweets out of them. Here, approximately four thousand tweets are recorded and filtered out.

In the second step, the real crime data for the determined locations were collected from the security agencies in three years.

In the third step, a comparative analysis is performed between the recorded and real crime data. In what follows, we describe the whole model in detail.

2.2.1 Data labelling

The data labeling process is to read Tweets carefully and try to extract the required information. The data sharing on Twitter is useful for detecting many facts to predict the election results, to predict the weather for any location, to track high profiles and also to predict the stars’ publicity. Here, we perform this operation to detect the crime rate. Few of recorded tweets that show the criminal activity performed at different locations are given below for instance:

“Jammu and Kashmir: Encounter breaks out between terrorists and security forces in#shopian”

The post shows that an encounter is going on between the terrorists and security forces in Jammu and Kashmir.

“2 accused were arrested today in connection with gang rape of a Russian national on 26th Nov in Manali. They were produced before Court.”

In the above post, twitter user shares the information about the “Rape” case of a Russian Lady in Manali.

2.2.2 Data tagging

The part-of-speech is a major problem in the online conversational text. Here, we use large-scale unsupervised word clustering and new lexical features to make our research more accurate. In [20], the authors released a new version of datasets that helps to solve the problem of part-of-speech in online conversational text. These clusters and annotation guidelines are available at http://www.ark.cs.cmu.edu/TweetNLP. By using these features, we can achieve state-of-the-art tagging results on twitter tasks [20]. As known, the tweets are the sentiments or intentions of people. Sometimes, they use some non-alphanumeric characters while typing these tweets. Indeed, these characters need to be eliminated or removed. To solve this issue, we use a new Twitter part-of-speech tagger building proposed by Olutobi et al. [23].

2.2.3 Tagger

This tagging model is a Markov Model of first order entropy. Decoding and training are extremely efficient for this model [24]. The tag probability is y_t that is conditioned to the input sequence x and y_t - 1. $p (y_{t} = k | y_{t - 1}, x, t; β) \propto$ $exp (β_{yt - 1, k}^{(trans)} + \sum_{j} β_{j, k}^{(obs)} f_{j} (x, t))$ (1)

The transition features used for each label pair and base observation features are extracted from token t and neighboring tokens. A Viterbi algorithm {O (|x| K2)} is used for prediction [25]. Here, K is the number of tags. For t = 1... |x| : $\hat{y_{t}} \leftarrow arg \max_{k} p (y_{t} = k | \hat{y_{t - 1}}, x, t; β)$ (2)

2.2.4 Regularization and training

In training, for a tagged tweet (x, y), the model is the summation over the y_t (observed token tags), each condition on observed previous tags and the tweets being tagged. $ℓ (x, y, β) = \sum_{t = 1}^{| x |} log p (y_{t} | y_{t - 1}, x, t; β)$ (3)

The parameters L1-capable variant of L-BFGS and β [26, 27] are optimized: $\arg \begin{matrix} \min \\ β \end{matrix} - \frac{1}{N} \sum_{〈 x, y 〉} ℓ (x, y, β) + R (β)$ (4)

The number of tokens and total ranges over tagged tweets (x, y) are represented by the N. Elastic net regularization of [28] here is the combination of L₁ and L₂ penalties, note that j corresponds to the indexes over the features. $R (β) = λ_{1} \sum_{j} | β_{j} | + \frac{1}{2} λ_{2} \sum_{j} β_{j}^{2}$ (5)

Finally, to eliminate noisy or irrelevant features, a small value of L₁ penalty should be used.

2.2.5 Data sampling distribution

Basically, a linear transformation Y = BX is generated from X⟶Y with help of non-singular matrix B.

Jacobian of Transformation $\frac{DX}{DY}$ with positive sign is |B|. Therefore, all differential elements are connected by the relation, ${dy}_{1} {dy}_{2} . . . {dy}_{n} = | B | {dx}_{1} {dx}_{2} . . . {dx}_{n}$ (6) where dY=|B| dX, ∀B is an orthogonal matrix, |B|=1.

Now, based on two types of orthogonal matrix, the approach of transformation will take two forms.

Condition 1. If B is a complete orthogonal matrix, then X’X⟶Y’Y and ${(X - μ)}^{'} (X - μ) \to {(Y - ξ)}^{'} (Y - ξ)$ (7) where μ indicates mean value and ξ = Bμ. This is conserving distances.

Condition 2. If B is a partitioned matrix ∀B_i is n_i × n and ∑n_i = n then $B = (\begin{matrix} \begin{matrix} B_{1} \\ \cdot \\ \cdot \end{matrix} \\ B_{n} \end{matrix})$ (8) Also $B_{i} B_{j}^{'} = 0, where i \neq j .$

Such that sub-matrices are orthogonal to each other, but they are not orthogonal by themselves. According to equation (8), transformation is rewritten as, $Y_{1} = B_{1} X, . . . . . . ., Y_{k} = B_{k} X$ (9) where Y₁, . . , Y_k are exclusive subsets of new variables. As per transformation of Quadratic Forms $\begin{matrix} X^{'} X \to Y_{1}^{'} C_{1} Y_{1} + . . . . . . . . . + Y_{k}^{'} C_{k} Y_{k} \\ {(X - μ)}^{'} (X - μ) \to {(Y - ξ_{1})}^{'} \\ C_{1} (Y_{1} - ξ_{1}) + . . . + {(Y_{k} - ξ_{k})}^{'} \\ C_{k} (Y_{k} - ξ_{K}) . \end{matrix}$ (10) where $C_{i} = {(A_{i} A_{i}^{'})}^{- 1}$ and ξ = A_iμ.

Equation (10) shows the transformation with n number of splits of quadratic forms in exclusive subset of the new variables. When B is fully orthogonal and each row is orthogonal to every other row, then splitting occurs fully.

Let Q be a Quadratic form in n variable, and it is a set of n-dimensional data to be clustered, x₁, x₂, …, x_n is a homogeneous quadratic function of the variables. Such that $Q = \sum_{i = 1}^{n} \sum_{j = 1}^{n} a_{ij} x_{i} x_{j} = X^{'} AX$ (11) where X is the column vector of variables. As per problem, sample data Q is n-dimensional Q ={ x₁, …, x_j, …, x_k }. Each datum x_j (1 ≤ j ≤ k) is an n-dimensional point.

$x_{j} = (x_{j 1}, . . . . . . ., x_{ji}, . . . . . . x_{jn}) \forall 1 \leq i \leq n .$

Assume that D is a hypercube and also assume that all data point x_j (1 ≤ j ≤ q) belong to D. Therefore, $D = \prod_{1 \leq j \leq n}^{s_{i} t_{i}} [a_{j} \leq min_{1 \leq j \leq q} (x_{ji}), b_{j} > max_{1 \leq j \leq q} (x_{ji})]$ (12)

Each interval [s_i, t_i] is split into length of l_i. Therefore, total number of intervals is P_i. The partition of each [s_i, t_i] is defined as $[s_{i} t_{i}] = ⋃_{0 \leq m \leq p_{i}} [s_{i} + {ml}_{i}, s_{i} + (k + 1) l_{i}]$ (13) where $l_{i} = \frac{t_{i} + S_{i}}{p_{i}}$ . Partition P is defined as: $P = \prod_{1 \leq i \leq n} P_{i}$

This strips over the domain D. Each slice P_r (1 ≤ r ≤ p) is defined as: $P_{r} = \prod_{1 \leq i \leq n} [α_{r, i}, α_{r, i} + l_{i}]$ (14) where α_r,i = s_i + P_r,il_i and 0 ≤ P_r,i < P_i

The centroid of each slice P_r is an n-dimensional data point x_r. $\begin{matrix} x_{r} = (x_{r, 1}, . . . . . x_{r, i}, . . . . . x_{r, n}) \\ And \\ x_{r, i} = α_{r, i} + \frac{l_{i}}{2} \forall 1 \leq i \leq n \end{matrix}$ (15)

The partition P points x_r (1 ≤ r ≤ p) defined set Was a sampling data space in a discrete fashion. Every initial data point x_j (1 ≤ j ≤ k) belongs to one slice of the domain D, and the data is connected with centroid of this slice. Thus, each data is connected with one point of W.

2.2.6 Clustering

The clustering method is obtained via Brown clustering [29] for a long set of unlabeled tweets. This algorithm divides words into a set of 1000 clusters. Brown clustering generates effective features in addition to that it provides variants that are much better than some other older models and comparisons according to the study in [30]. Therefore, this algorithm makes easier to scale a big amount of data. Let us see an example: $\begin{matrix} A 1 & - - - & Nt & n 0 t & nottttt \\ _n ot_ \\ A 2 & - - - & / / u & / / you & (you \\ iyou \\ A 3 & - - - & wh 0 & ehowhodeho \end{matrix}$

Here, A1 represents the same cluster, in which few challenging twts are given. A1 Cluster represents the word “not”, A2 cluster represents the word “you” and A3 cluster is used for the word “who”. By using this approach, challenging words can be easily modified and used for our analysis. Note that, sometimes, users can use some acronyms like “prbly” that represents the “probably”. These kinds of acronyms can be decoded by using this clustering approach.

Table 5 shows an example of paths for clusters that can be used for most frequent words. Here, the cluster path B1 shows the string of some challenging words, B1 represents the cluster for the word “always”. Another cluster B2 represents the word “gone”, B3 corresponds to the word “who” and finally B4 indicates the word “I”. This approach is very useful for sorting the challenging words and creating a set of data for our analysis.

Table 5
Clusters for twitter words

Cluster Cluster path Words (most frequent)

B1 001011111110 always alwayz alway allways inevitably alwys alwayss

B2 0011000 gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona gonnaa

B3 000110 who who’ve whu wh0 -who #ilikepeoplewho<URL-real.com>whod who’ve who’d whotf eho

B4 0001111 i’ve you’ve ive we’ve they’ve i’ve youve you’ve u’ve we’ve uve

Cluster	Cluster path	Words (most frequent)
B1	001011111110	always alwayz alway allways inevitably alwys alwayss
B2	0011000	gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona gonnaa
B3	000110	who who’ve whu wh0 -who #ilikepeoplewho<URL-real.com>whod who’ve who’d whotf eho
B4	0001111	i’ve you’ve ive we’ve they’ve i’ve youve you’ve u’ve we’ve uve

2.2.7 Data cleaning

As known, the tweets are the sentiments or intensions of people [31]. Sometimes, they use some non-alphanumeric characters while typing these tweets. Thus, these characters need to be eliminated or removed. To accomplish data cleaning process, the following steps are playing an important role.

Step 1. Converting Tweets to lowercase letters.

We convert these tweets to lower case letters and then use them at our database. Some examples of these tweets are shown in Table 6.

Table 6
Some Tweet examples

Tweets Data cleaning

#Dadri Akhlaq murder case: 9th accused arrested from Delhi ISBT. Dadri akhlaq murder case 9th accused arrested from Delhi Interstate bus terminal.

Rape& Murder HinduGirl In Church land (India) then Cry”Our sentiments R hurt”!Christian Hypocrisy at it’s very best ! Rape & murderhindu girl in church land India, then cry our sentiments r hurt christian hypocrisy at very best

Tweets	Data cleaning
#Dadri Akhlaq murder case: 9th accused arrested from Delhi ISBT.	Dadri akhlaq murder case 9th accused arrested from Delhi Interstate bus terminal.
Rape& Murder HinduGirl In Church land (India) then Cry”Our sentiments R hurt”!Christian Hypocrisy at it’s very best !	Rape & murderhindu girl in church land India, then cry our sentiments r hurt christian hypocrisy at very best

Step 2. Stop words need to be removed.

For data cleaning process, the stop words (such as so, is, its etc.) need to be removed from the recorded tweets. Some examples of these tweets are shown in Table 6.

Step 3. Explain abbreviations.

As it is in notice that many social media users mostly used the abbreviations in their tweets instead of full meaning. These abbreviations have to be clearly explained for data cleaning process. Some examples of these tweets are shown in Table 6.

Step 4. Adopt clustering method

Data clustering is very necessary to clean the data recorded from social media. Clustering method has been explained in section 2.2.6. Some examples of these tweets are shown in Table 5.

Step 5. Remove emojis and special character.

For the data cleaning process, emojis and special character have to be removed from the recorded tweets. Some examples of these tweets are shown in Table 6.

This converting process provides us to obtain an authentic and accurate data sheet. Thus, by using the same methodology, the data can be collected for different locations and used for a comparative analysis with the real crime data for different locations.

3 Results

The numbers of tweets that are collected from the seven crime cities are shown in Fig. 2 where the horizontal axis represents the seven crime cities in India and the collected numbers of tweets are given with the vertical axis. We also analyze these seven cities according to their highest and lowest crime rates by using the collected number of tweets. The high crime intensity is labeled in Fig. 3. According to our analysis, Ghaziabad has the highest crime intensity as expected because having a very high population of around 2.7736 million and also a second greatest industrial city in Uttar Pradesh, India (http://indiapopulation2018.in). Contrary to this, Jammu has the lowest crime rate because having a population around 783,317. The density of Jammu city is 596 persons per square kilometers. Our observations demonstrate that, when the number of people increases in a city, the crime possibility increases dramatically as expected.

Fig. 2

Bar chart representation for the tweets collected.

Fig. 3

Highest and lowest crime rates representation using Twitter data.

Other graphs in Figs. 4 and 5 show the real crime rate for the seven determined locations. In Fig. 4, the horizontal axis represents the seven crime cities in India. The obtained real crime data from security units are shown with the vertical axis. We also analyze these seven cities according to their highest and lowest crime rates. According to our real data analyzes, it is obtained from Fig. 5 that Ghaziabad has the highest crime intensity and Jammu has the lowest one. Final analyses establish that the results obtained from collected tweets approximately overlap with the results of real crime data gathered from security forces. Further comparative analyses and discussions are given in the following subsection.

Fig. 4

Highest and lowest real crime rates representation.

Fig. 5

Highest and lowest crime rates representation using real crime data.

In what follows, we make comparisons between the datasheet generated from twitter data and the real crime data received from security units. The objectives of this section are to verify the calculated crime rate obtained from twitter data with the real crime values and to demonstrate that social media can be a good option to detect the crime rate for any city.

After analyzing Figs. 3 and 5, the results show that the highest and lowest crime cities are the same. As per the graph shown in Fig. 3, Ghaziabad is the highest and Jammu is the lowest crime city in India. This result is very significant for security operations because having the knowledge and verified predictions about the highest crime city in a country is important while preventing and foreseeing the crime events. These types of verified social media analyzes as obtained in this paper can also help to predict future crime rates in that city or in any city for taking measures about future crime events. In addition, our study can provide security units to obtain verified statistical outcomes about social media applications such as twitter. It can provide some verified records for cybercrime units for monitoring and detecting cybercriminals. In social media, tagged social media accounts and people can be caught by the proposed method presented in this paper.

3.1 Rank wise comparative analysis

This part assigns the rank to the cities according to their crime intensity. This rank assignment process will provide a reliable analysis. The rank-based comparison between Twitter data and real crime data is presented in Table 7. By using this rank-based comparison, we can verify that our proposed framework is working correctly or not.

Table 7
Comparison analysis based on rank

Location Rank Rank

(Twitter data) (Real crime rate)

Ghaziabad 1 1

Chennai 2 2

Bangaluru 3 3

Chandigarh 5 6

Jammu 7 7

Gujarat 6 4

Hyderabad 4 5

Location	Rank	Rank
Ghaziabad	1	1
Chennai	2	2
Bangaluru	3	3
Chandigarh	5	6
Jammu	7	7
Gujarat	6	4
Hyderabad	4	5

The rank-based analysis demonstrates that the top three highest crime cities and lowest crime city are having the same rank in real crime data and twitter data. On the other hand, the ranks for Chandigarh, Gujarat and Hyderabad are different ranks at both of them. For instance, Chandigarh is the 5th crime city according to the twitter data but it has 6th place in real crime data. It is an acceptable issue that can be solved by improving the filters that are used in tweet filtering process, increasing the collected number of tweets, enhaing the data analyze methods, improving the expression detection process, using natural language processing methods, using error correction methods for obtained data via comparing the collected data with real data, using further statistical analyze methods, using new generation machine learning approaches and utilizing deep learning approaches. Here, the most significant thing is to detect the cities which have the highest crime intensity and lowest crime intensity.

Thus, we approximately obtain 70% accuracy at detection. Here the most significant point is detecting the cities that have the highest and lowest crime rates; however, we aim to enhance our study to obtain 100% accuracy by improving our filtering process, software skills.

3.2 Latest crime trends

For the verification of the proposed approach, we also measure the latest crime trends for the highest (Ghaziabad, Uttar Pradesh) and lowest crime cities (Jammu) in India. The measurement of the latest crime trends has been recorded for the time duration of 7 days (23, January 2019-30, January 2019) by using TAGS v6.1. In the latest crime trends, we show the measurement of a number of tweets recorded and the graph for users versus user friends counts. Figure 6 shows the recorded recent trends for Uttar Pradesh, India using TAGS v6.1.

Fig. 6

Latest crime trends for Ghaziabad, Uttar Pradesh.

For seven days of crime trends, 93 user’s tweets were recorded for the highest crime location in India. The user friend count intensity is also very high for the determined area, near seventy-two thousand. This latest crime trend analysis for Ghaziabad, Uttar Pradesh is verifying the results of our analysis. Another latest crime trend measurement for the lowest crime cities (Jammu) can be shown in Fig. 7.

Fig. 7

Latest crime trends for Jammu.

For seven days of crime trends, 22 user’s tweets were recorded for the lowest crime location in India. The user friend count intensity is very less as compared with the intensity of Ghaziabad, Uttar Pradesh, near twelve thousand two. This latest crime trend analysis for Jammu is also verifying the results of our analysis. Here, if we compare the seven-day crime trends for both locations “Ghaziabad, Uttar Pradesh, and Jammu” then it can be an outcome that the crime intensity of Ghaziabad, Uttar Pradesh is very high compared to Jammu. Comparison of existing work and proposed work are discussed in Table 8.

Table 8

Theoretical comparison between different researches

No	Authors	Descriptions	Limitations
1.	Mohammed Bekkali et al. (2019)	Categorization of short test for Arabic language using SVM and NB classifiers	Other semantic resources need to be included and limited to Arabic language only.
2.	Hossny, A.H et al. (2018)	Introduced a singular value decomposition approach for the textual features in recorded tweets from social media.	This approach is not suitable to eliminate the keywords that may appear due to data spurious nature.
3.	Egele, M et al. (2017)	An approach has been used to identify the high profile accounts compromise for the detection of attack on popular companies.	If attacker is aware with social network accounts detection system then he/she can prevent their account from the detection.
4.	Bosque, L.P.D., et al. (2016)	A number of approaches have been presented for the prediction of aggressive comments recorded from social media.	Utilized dataset is very small. Results verification is not present in paper.
5.	Xinyu Chen,et al. (2015)	Investigation on social media sentimental contents for the prediction of criminal incident	Advanced methods such as vector machine need to be included to verify the non-linear effect in crime incidents and polarity.
6.	Gerber, S.M. (2014)	Statistical topic modeling and linguistic analysis were used to identify the keywords across a geographical location in US.	The analysis has performed only for one geographical location, verification using real crime rate is absent in research. Temporal modeling and deeper semantic analysis of data is required.
7.	Sarvari, Het al. (2014)	Demonstrated a social graph magnitude using a set of email addresses of criminals.	Only the top rank criminals have been included in concerned social media graph.
8.	Proposed Work	Methodology starts with collection of the tweets, filtering of them and finally comparing them with real crime data recorded from NCRB. Our comparison results prove that usage of social media is a good way to detect the crime rate for any city all around the world.	In addition, usage of smarter and next-generation filtering tools can increase our accuracy performance.

An algorithm for the proposed work is detailed below.

Algorithm 1: Determine the partition data point (P) for centroid of a cluster
Input : n dimension dataset
Output : Find out partition data point (P) for centroid of acluster
Step 1: begin
Step 2: n dimension data from Twitter post
Call Viterbi_prediction (k no of tag)
for t = 1... \|x\|. Then
${\hat{y}}_{t} \to$ arg $\max_{k} p (y_{t} = k \| \hat{y_{t - 1}}, x, t; β)$
End for
Step 3: T o eliminate noisy call tweets (x, y)
$R (β) = λ_{1} \sum_{j} \| β_{j} \| + \frac{1}{2} λ_{2} \sum_{j} β_{j}^{2}$ ∀j corresponds to the indexes
Step 4: Relation between all differential elements ${dy}_{1} {dy}_{2} . . . . . . . . {dy}_{n} = \| B \| {dx}_{1} {dx}_{2} . . . . . {dx}_{n}$
if B is a complete orthogonal matrix, then X’X⟶Y’Y
elseif B is a partitioned matrix ∀ B _iis n _i × n and ∑ n _i = n then as per equation 8
Step 5: If data D is a hypercube then, all data pointx_j (1 ≤ j ≤ q) belong to Das per equation 12
Step 6: In domain D, each slice partition P
For P_r = 1 ... P
$P_{r} = \prod_{1 \leq i \leq n} [α_{r, i}, α_{r, i} + l_{i}]$
end for
Step 7: End

4 Conclusion

The objective of the paper is to make real-time crime data analysis according to real data obtained from security units and social media datasets. Seven strategic cities are determined in the case study to realize this proposed study. As known, Twitter is one of the most popular social media that people use it to tell their feelings to their followers from all around the world. For this reason, we decided to use Twitter as a social media device to collect crime data in this study. The data collection process using Twitter is a task with a huge computation load because the tweets are the intentions of users, they can be specific and in any format. Our methodology starts with collection of the Tweets, filtering of them and finally comparing them with real crime data. Our comparison results prove that usage of social media is a good way to detect the crime rate for any city all around the world.

According to our detection, we exactly have the same results with real data obtained from security units for five cities out of seven in India. Thus, we approximately obtain 70% accuracy at detection. Here, the most significant point is detecting the cities that have the highest and lowest crime rates; however, we aim to enhance our study to obtain 100% accuracy by improving our filtering process, software skills. To enhance the accuracy of our obtained results, we aim to enlarge our dataset and to use machine learning, deep learning methods, and next-generation language processing approaches. In addition, usage of smarter and next-generation filtering tools can increase our accuracy performance.

References

, Vo

and Duong

T.H.

, Personalized Facets for Semantic Search Using Linked Open Data with Social Networks. IBICA 2012, Kaohsiung, Taiwan, (2012), 312–317.

, Nguyen

and Le

C.T.

, Race recognition using deep convolutional neural networks, Symmetry 10 (2018), 564.

Chen

, Cho

and Jang

S.Y.

, Crime prediction using twitter sentiment and weather, IEEE System and Information Engineering Design Symposium (2015), 63–71.

Gerber

S.M.

, Predicting crime using Twitter and kernel density estimation, Decision Support Systems 61 (2014), 115–125.

Chainey

, Tompson

and Uhlig

, The utility of hotspot mapping for predicting spatial patterns of crime, , Security Journal 21 (2008), 4–28.

Sarvari

and Abozinadah

, Constructing and Analyzing Criminal Networks, IEEE Security and Privacy Workshops, San Jose, CA, USA17-18 (2014), 24–31.

Qasem

, Jansen

, Hecking

and Ulrich

, Hoppe, Using attractiveness model for actors ranking in social media networks, Computer Social Network (2017), 45–55.

Bermingham

and Smeaton

, On using Twitter to monitor political sentiment and predict election results, in: Proceedings of the Work- shop on Sentiment Analys is where AI meets Psychology (SAAIP 2011), 31 Asian Federation of Natural Language Processing, Chiang Mai, Thailand, (2011), pp. 2–10.

Egele

, Stringhini

, Kruegel

and Vigna

, Towards detecting compromised accounts on social networks, IEEE Transactions on Dependable and Secure Computing 14(4) (2017), 358–364.

10.

Mislove

, Marcon

, Gummadi

K.P.

, Druschel

and Bhattacharjee

, Measurement and analysis of online social networks, In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (2007), pp.29–42, ACM.

11.

Mohler

G.O.

, Short

M.B.

, Brantingham

P.J.

, Schoenberg

F.P.

and Tita

G.E.

, Self-exciting point process modeling of crime, , Journal of the American Statistical Association 106 (2011), 100–108.

12.

Wang

, Brown

and Gerber

, Spatio-temporal modeling of criminal incidents using geographic, demographic, and Twitter-derived information, in: Intelligence and Security Informatics, Lecture Notes in Computer Science, IEEE Press, (2012), 154–162.

13.

Goswami

and Kumar

, A survey of event detection techniques in online social networks, Social Network Analysis and Mining (2016), 32–41.

14.

Bosque

L.P.D.

and Garza

S.E.

, Prediction of aggressive comments in social media: Exploratory study, IEEE Latin America Transactions 14(7) (2016), 142–151.

15.

Sailunaz

, Dhaliwal

, Rokne

and Alhajj

, Emotion detection from text and speech: A survey in Social Network Analysis and Mining (2018), 85–94.

16.

Bekkali

and Lachkar

, An effective short text conceptualization based on new short text similarity, in Social Network Analysis and Mining, (2019), 78–85.

17.

Hossny

A.H.

, Moschuo

, Osborne

, Mitchell

and Lothian

, Enhancing keyword correlation for event detection in social networks using SVD and k-means: Twitter case study, Social Network Analysis and Mining (2018), 104–110.

18.

Kwak

, Lee Park

and Moon

, What is twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web, (2010), pp. 591–600. ACM.

19.

Kumar

, Novak

and Tomkins

, Structure and evolution of online social networks, In Link Mining: Models, Algorithms, and Applications, (2010), 337–357.

20.

Owoputi

, O’Connor

, Dyer

, Gimpel

and Schneider

, Part-of-speech tagging for Twitter:Word clusters and other advances. In Technical Report CMU-ML-12-107, Carnegie Mellon University, 2012.

21.

Marivate

and Moiloa

, Catching crime: Detection of Public Safety Incidents using Social Media, 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech) Stellenbosch, South Africa (2016), 25–34.

22.

Stark

T.H.

, Collecting Social Network Data. In: Vannette

, Krosnick

(eds.) The Palgrave Handbook of Survey Research. Palgrave Macmillan, Cham, Springer, 2018.

23.

OlutobiOwoputi , O’Connor

Brendan

, Dyer

Chris

, Gimpel

Kevin

, Schneider

Nathan

and Smith

Noah A.

: Improved Part-of-SpeechTagging for Online ConversationalText with Word Clusters. HLT-NAACL 2013, 380–390.

24.

Ratnaparkhi

, A maximum entropy model for part-of-speech tagging, In Proc. of EMNLP (1996), 85–92.

25.

Lafferty

, McCallum

and Pereira

, Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML (2001), 594–603.

26.

Liu

and Nocedal

, On the limited memory BFGS method for large scale optimization, Mathematical Programming 45(1) (1989), 196–204.

27.

Andrew and Gao

, Scalable training of L1- regularized log-linear models. In Proc. of ICML (2007), 89–95.

28.

Zou and Hastie

, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2) (2005), 301–320.

29.

Brown

P.F.

, de Souza

P.V.

, Mercer

R.L.

, Della Pietra

V.J.

and Lai

J.C.

, Class-based n-gram models of natural language, Computational Linguistics 18(4) (1992), 87–91.

30.

Blunsom

and Cohn

, A hierarchical Pitman-Yor process HMM for unsupervised part of speech induction. In Proc. of ACL, 2011, 124–131.

31.

Shankar

, Visualization of the Sentiment of the Tweets. Master’s Thesis, North Carolina State University, Raleigh, NC, 2011.

32.

Riquelme

and González-Cantergiani

, Measuring user influence on twitter: a survey. Inf Process Manag., 2016.

33.

Sun

and Tang

, Asurvey of models and algorithms for social influence analysis. In: Social network data analytics. Berlin: Springer (2011), pp. 177–214.

34.

Kleinberg

J.M.

, Authoritative sources in a hyperlinked environment. J ACM 46(5) (1999), 604–32.

35.

Leavitt

, Burchard

, Fisher

and Gilbert

, The influentials: new approaches for analyzing influence on Twitter, Web Ecol Proj (2009), 4:1–8.

36.

Cha

, Haddadi

, Benevenuto

and Gummadi

K.P.

, Measuring user influence in twitter: the million follower fallacy. In: 4th International AAAI Conference on Weblogs and Social Media (ICWSM), 2010.

37.

Cai

, Daijun

, Yong

, Sankaran

and Yong

, A modified evidential methodology of identifying influential nodes in weighted networks, Phys A Stat Mech Appl 392(21) (2013), 5490–500.