Enhancing patterns with linguistic information for criminal event recognition

Abstract

There are several categories of criminal events. However, one of them is focused on people: criminal events against people. It directly affects some of the guarantee of a person or a family. These events are reported in digital media and, without neglecting, digital news media in Spanish. It is relevant to recognize criminal events against people to get useful information about the public security of citizens. Natural Language Processing has techniques that can be possible their identification. However, fine-grained linguistic analysis is required in order to carry out such task. This paper considers the enhancing the discovered patterns with linguistic information (morphological and POS categories) to recognize criminal events against people from Spanish newspapers. Six categories of criminal events are considered: killing, violation, assault, suicide, kidnapping and sexual exploitation. An experimentation is carried out with a gold standard data set of criminal events. The experimentation shows promising results.

Keywords

Criminal events morphological and linguistic information natural language processing event recognition

1 Introduction

Criminal events against people are presented in the major of digital media news. In particular, social networks offer real-time information what is happening in terms of crime around a square, suburb, town, city, state or country. Thus, several questions can be answered about criminal events: “what happened”, which expresses the criminal event. “When it happened”, the time expression that reflects the instant or period when something has happened. “Why it happened”, the cause criminal event. “Who is (are) involved”, the agents or people implicated in a crime. “where it occurred”, the place where the crime has occurred.

Spanish digital media news, thousands of criminal events are generated and described every day. Certainly, it is no exception in Mexican newspapers. Recognizing such events is very relevant for government decision-making, or simply, for a citizen to know the crime incidence of a region, area or zone. Its recognition becomes a challenge of research in analysis of texts. Crime event recognition is a tedious and time-consuming task. Fortunately, Natural Language Processing is a field of Artificial Intelligence that propose methods and techniques for automatic text analysis. However, recognizing criminal events from Spanish text is considered a major challenge to NLP researchers.

The National Institute of Statistics and Geography (INEGI) categorize criminal events in [21], where it presents the Statistical Classification of Crimes (CED), that allows consulting the universe of crimes in Mexico. The CED is sufficiently detailed to characterize each crime and at the same time to allow useful and practical groupings. It is structured into levels: main group, subgroup, unit group and class of crime. This seeks to give flexibility to the classification, from the most general to the most specific, and facilitate its implementation.

Therefore, in this paper, we present an approach for enhancing patterns with morphological information and POS categories in order to recognize and extract criminal events from news reported in Mexican digital newspapers.

Six criminal events, from main groups of the CED presented in [21], are considered (killing, violation, assault, suicide, kidnapping and sexual exploitation). The automatic recognition of these six criminal events, extracted from daily Spanish news, is crucial for the government to make decisions about the implementation of crime prevention policies and strategies to avoid violent events in the near future.

In our particular case, we have delimited the roles closely related with each event of interest, we have extended the security-related events presented in [11], as follows.

A killing is a criminal event where a human is deprived of life by any means, it is surrounded by murderer(s) and murdered, defined as:

Murdered is someone who has been killed.

Murderer is someone who has killed a person.

In a violation event, the perpetrator deprives the victim of his freedom and sexual security through the forced or deceptive performance of copulation. It involves perpetrator(s) and victim(s), described as:

Perpetrator is someone who has sexually abused of a human.

Victim is someone who has been sexually abused by perpetrator.

In an attack, specifically an assault event, the roles closely related are attacker(s) and assaulted, the attacker takes advantage of the assaulted in space and temporary conditions of insecurity, using violence, in order to cause damage or loss. They are described as:

Attacker is someone who has committed an assault.

Assaulted is someone who has been stolen by an attacker.

A suicide event involves a conduct where a person deprives the life by itself, originating only the intervention of suicidal person, described as:

Suicidal is someone who commits a suicide by himself.

A kidnapping is a criminal event, where the kidnapper deprives the kidnapped of personal and space freedom as a pressure, in order to get a monetary benefit that is granted by the victim or by a third person. It involves kidnapper(s) and kidnapped, defined as:

Kidnapper is someone who deprives a person of space freedom.

Kidnapped is someone who has been locked up by a kidnapper.

A sexual exploitation is a criminal event, where the perpetrator deprives to the exploited victim of the sexual freedom by marketing, disposing and exploiting the victim’s sexuality; to get a monetary benefit, or give the necessary means for that purpose, defined as:

Perpetrator is someone who deprives a person of sexual freedom.

Exploited victim is someone who has been locked up by a Perpetrator for sexual trafficking.

Such criminal events are converted into our event of interest, which cover the Statistical Classification of Crimes presented in [21]. As a starting point, we take these six classes and, first, discover linguistic patterns from Spanish news using a semi-supervised method, then, they are enhanced with linguistic information (morphological and POS categories) to recognize criminal events from non-labelled Spanish news.

The rest of this paper is organized as follows. Section 2 presents the related work, which is briefly described to compare them with the approach presented in this paper. Section 3 describes the semi-supervised method for extracting and enhancing patterns to recognize criminal event. Section 4 presents an evaluation based on precision and recall of each event group, also, results and discussions are offered. Finally, Section 5 shows the conclusions and future work.

2 Related work

Event extraction or recognition from text has been proposed for several works, by using a great variety of approaches.

In [22], the authors use corpus-based linguistic resources as vocabulary, n-grams and frame-based patterns, which play a central role in financial events extraction from Russian texts. Real-world events independent of the domain from texts in social media are identified in [6], by using time-dependent semantic similarity measures that are consistent with static measures of similarity but provides high temporal resolution.

Social media is a resource plagued with events. Therefore, they have been widely used as a starting point for event extraction approaches, which are presented as follows. In [2] a weakly supervised approach is proposed for extraction new categories of events from Twitter, using seed-based machine learning.

On the other hand, newspapers also contain events, as shown in [4] and [13], which investigate the global social event extraction by processing online news, authors build an event model and extract event information by classifying English news text using lexico-semantic patterns. Events related to finances are extracted in [1] from news articles, exploiting a framework with components as Gazetteers, a domain-specific ontology is used as a seed for the event detection process. A system’s architecture is described in [10] for integrating event information in a global crisis monitoring system, addressing several NLP-involved tasks: news geo-tagging, automatic pattern learning, pattern specification language, information aggregation. Security-related events extraction from online news is introduced in [16] through event classification method based on domain-specific inference rules, an approach to event geo-tagging based on lexical or semantic patterns, a simple method for cross-lingual event information fusion, and techniques for scoring the relevance rank.

In the biomedical domain, events involve biomolecules such as genes, transcription factors, or enzymes, for example, have a central role in biological processes and functions and provide valuable information for describing physiological and pathogenesis mechanisms. Event extraction from biomedical literature has a broad range of applications as in [12, 18] that using scientific documents as a starting point. Predictive model inferred from a manually tagged corpus [20], including support for information retrieval, knowledge summarization, and information extraction and discovery, it is important to note that these author have focused their work in Spanish texts.

In recent years, linguistic approaches are proposed, as linguistic patterns for analyze French texts [7], deep learning techniques [23], approaches that consider language phenomena [19] and methods for news event extraction based on subject elements, which mixes the study of news topic sentence extraction and the research of event extraction together in Chinese texts [5].

Approaches based on textual patterns, whether linguistic, semantic, syntactic or lexical, have been used for the extraction of several kinds of events. We have found works as [1] that uses semantics-based patterns for detecting economic events, [10] syntactic patterns to real-time news even extraction about global crisis, and [4], where authors use a named entity recognition for creating patterns to extract global social events. The use of patterns to recognize, extract or identify events has been widely considered. However, recognizing event from Spanish texts has a significant lack of existing approaches. Accordingly, in this paper, we rely on an approach to enrich extraction patterns with linguistic information for criminal event recognition from Spanish news.

3 Enhancing patterns and criminal event recognition

In this section, we describe our approach for enhancing patterns to criminal event recognition. The entire process consists on the following phases: a) pre-processing news, it is a step necessary to prepare the news text for coming phases; b) verbal and nominal phrase extraction in order to generate an initial set of candidate phrases; c) filtering candidate phrases and enrich them to create a final set of patterns; d) criminal event recognition involves the implementation of each final patterns structured as JAPE rules to extract the criminal events of interest. In Fig. 1, the entire process for criminal event recognition is shown, within the corresponding phases are included.

Fig.1

Approach to enhance patterns for criminal event recognition.

3.1 Pre-processing Spanish news

In this step, a segmentation process is performed to obtain tokens string. In addition, two processes are applied to texts in order to normalize and use them in future steps. Lemmatization and Part-of-Speech tagging are carried out using TreeTagger [9].

Lemmatization of Spanish news converts each word to its corresponding lemma, without inflections, so words can be analyzed as a normalized item. The lemmatization process allows grouping several inflected forms of the same word into a single item.

Part-of-Speech (POS) tagging is the process to assign or labelling to each word that is part of a text, the corresponding grammatical category. In this way, TreeTagger is used to perform it, which considers the context of the word in which it appears, to assign the suitable grammatical category.

Table 1 shows an example of a phrase decomposed in tokens within its lemmas and POS tags.

Table 1
An example of phrase lemmatization and POS tagging

String token Lemma POS tag

the/la the/el ART

victim/víctima victim/víctima NC

was/fue be/ser VSfin

kidnapped/secuestrada kidnap/secuestrar VLadj

with/con with/con PREP

car/carro car/carro NC

String token	Lemma	POS tag
the/la	the/el	ART
victim/víctima	victim/víctima	NC
was/fue	be/ser	VSfin
kidnapped/secuestrada	kidnap/secuestrar	VLadj
with/con	with/con	PREP
car/carro	car/carro	NC

Both processes, lemmatization and POS tagging are used to future phases. By one hand, they are applied in verbal and nominal phrase extraction to support and adjust the rules for the automatic identification of candidate phrases. And, however, they are considered to enrich the final extracted patterns to recognize criminal events from Spanish news.

3.2 Verbal and nominal phrase extraction

Natural language texts are a way of communicating situations, therefore understand them involves to place a particular attention on the events contained in the interested texts. It is necessary to create a representation of what is expressed in terms of space, time, causes and agents related to events. In texts, the linguistic structure plays as a means in order to express how the real-world situations are constructed.

Consequently, several language structures used to express events should be considered and analyzed from text. The events are present in most historical, journalistic and specialized texts, in which, they are characterized by either a verbal phrase [15, 17] or a nominalization [3].

Both theories L. Tesnière [15] and M. Halliday [17] affirm that the verb is the core, around which, all elements of the sentence rotate. Elements of the sentence are divided in actor roles (agent, object, and recipient) and the circumstantial (instrument, force, time and locative). These theories place the proposed approach in the idea that the verbal core into sentences, represents events, and therefore, verbal phrases must be characterized.

A nominalization is a word formation that uses a verb as a noun, i.e. nouns derived from verbs. They are known as action nominals [3]. Being derived from verbs, nominalizations also express events. In this paper, nominalizations are further characterized for the Spanish language in order to extract nominal phrase that describe criminal events.

For Spanish, the nominalization, according to Hernando [14], creates nominal derivatives by suffixation. It corresponds to a suffix denominal, deadjectival and deverbal. We have a special interest in deverbal nominalization, because it is formed with a base or root form of a verb plus a suffix. In addition, we rely on the nominalizations are referring to a verb and consequently, to an event.

In this paper, we adopt the following two formal definitions of events. In addition, we characterize these events in order to extract them from newspapers in Spanish.

We use JAPE rules to characterize categories of criminal events in order to annotate them from text. JAPE [8] is a Java Annotation Patterns Engine that provides finite state transduction over annotations based on regular expressions based on CPSL – Common Pattern Specification Language.

Definition 1. (Verbal event) A verbal event is composed by a verbal phrase (VP), which is characterized as follows.

Rule: VerbalEvent

(

{VP}

):ve

–>

:ve.VerbalEvent = {rule = “VerbalEvent”, text = :ve@string}

A verbal phrase (VP) is combined with two or more verbal forms: an auxiliary verbal form and other forms with head verbs. VP is characterized as macros with JAPE rules as follows.

MACRO: AUXILIARVERB

({Token.pos==VSadj | Token.pos

==Vsger}

{Token.pos==VHadj | Token.pos

==Vhger}

{Token.pos==VEadj | Token.pos

==Veger}

{Token.pos==VMadj | Token.pos

==Vmger})

MACRO: HEADVERB

({Token.pos==VLinf | Token.pos

==Vlger})

MACRO: VP

({AUXILIARVERB}∥ \lbraceSpaceToken}

{(HEADVERB)})

Where, according to TreeTagger [9], the following tags are distinguished: VS or VE (verb “to be”), VH (verb “to have”), VM (modal verb). The suffixes “ger” and “adj” mean gerund and adjective verbs respectively. For “HEADVERB” macro, three POS tags have been enforced: VLinf (lexical verb in infinitive) and VLger (lexical verb in gerund). Finally, the notation in square brackets means the number of times that the element can appear, in this case “[0, 2]”, zero to two times.

The rule called “VerbalEvent” is implemented and, a window size of five words or the beginning/end of the phrase is extracted from Spanish news.

Example 1. “... the victim has been assaulted on the street ... /... la víctima ha sido asaltado en la calle \dots ”

Example 2. “\dots the trafficker sexually exploited two women ... /... el tratante explotaba sexualmente a dos mujeres ... ”

It is remarkable that there will be non-relevant phrases as shown in Example (3).

Example 3. “... Police insure three buildings in the colony ... /... Policía aseguran tres inmuebles en la colonia ...

Definition 2. (Nominal event) A nominal event (NE) is characterized by a deverbal nominalization, which follows the structure exposed in the following JAPE rule.

Rule: NominalEvent

(

{NE}

):nominalevent

–>

: nominalevent.NominalEvent = {rule

= “NominalEvent”}

In Spanish, a nominal event is defined as a concatenation of verbal root plus a corresponding suffix, as shown below. $NE = V_{r} + {Af}_{n}$

Where V_r expresses the root or stem of a verb in Spanish, from which inflected words are removed, thus verbs are reduced, e.g. the stem of “violate/violar” verb is “viol”.

Af_n represents the suffix that completes the meaning of the nominalization. Some suffixes are, for example, -(at)oria, -sión, -(a, i)m(i)ento, -(a, i)ción, -a, -eo, -o. They do not have English translation because of the nature of the Spanish language with the exception of -tion/-sión -(a, i)tion/-(a, i)ción and suffix.

In the following example, we show two linguistic structures considered in this paper that characterize our events. In Example (1), a verbal event is included with an attack (assault) event expressed by “was assaulted/fue asaltado” verbal phrase. Then, in Example (2), a nominal event is presented through violation event by a deverbal nominalization, verb (violate/viola) plus -(a, i)tion/-(a, i)ción suffix.

The rule called “NominalEvent” produces, with a window size of five words or the beginning/end of the phrase, the phrase shown in Example (4).

Example 4.The violation of a woman in ... /La violación de una mujer en ...

Both rules produce a set of candidate phrases to be converted into final patterns to recognize criminal events.

3.3 Filtering and enhancing patterns

After automatic extraction of verbal and nominal phrases with defined JAPE rules, a set of candidate phrases are obtained. As generic rules have been created for extracting nominal and verbal phrases, this set contains irrelevant phrases which do not characterize or recognize criminal events. Therefore, a filtering process is indispensable. So, from the list of candidate phrases, human experts manually filter extracted verbal and nominal phrases to obtain relevant, useful and final patterns. All of them will conform our knowledge base to be enriched in order to form the final patterns. They decide whether the phrase in decision belongs to a criminal event in six categories in interest or not by manually labeling them. Human experts read the definition and characterization of each criminal event provided by [21] in order to contextualize in such domain.

The filtered phrases are grouped by matching, and eliminating duplicates. A useful list of patterns is acquired. Thus, such list is enriched with linguistic information comes from lemmatization and POS tagging processes.

The enriching process takes all inflections of verbs (time, person and gender) and nouns (person and gender) and substitute lemma instead of simple string. For example: “assault/asalta”, “assaults/asalta” “assaulted/asaltó” is switched by the lemma “assault/asaltar” or “theft/robo” “thefts/robos” is switched by the lemma theft/robo”. Also, determinants, pronouns and prepositions are replaced with the proper POS tag. For example, “el | un | los | unos” is replaced with DT (determinant) tag and “a | con | contra | en | sobre | de” is changed by PREP (preposition).

Enriching patterns with linguistic information from lemmatization and POS tagging process reduces the dimension of patterns, retaining their properties for characterizing criminal events.

After the semi-supervised learning method was carried out and an enhancing process with morphological and POS categories, a final set of patterns are gained for each criminal event. 17 for killing event, 19 for violation, 24 for attack or assault event, 13 for kidnapping, 33 patterns for suicide event and 16 for sexual exploitation. Tables 2, 3, 4, 5, 6 and 7 show a list of the three most frequent patterns organized by event categories.

Table 2
Patterns for killing event

Id pattern Pattern

1 (to kill) (PREP)? // (asesinar) (PREP)?

2 (to be) (kill) // (ser/estar) (asesinar)

3 (DT)? (murder) // (DT)? (asesinato)

Id pattern	Pattern
1	(to kill) (PREP)? // (asesinar) (PREP)?
2	(to be) (kill) // (ser/estar) (asesinar)
3	(DT)? (murder) // (DT)? (asesinato)

Table 3

Patterns for assault event

Id pattern	Pattern
1	(DT)? (theft \| assault) // (DT)? (robo \| asalto)
2	(to assault \| to rob) (PREP)? //(asaltar \| robar) (PREP)?
3	(to be) (assault) // (ser/estar) (asaltar)

Table 4

Patterns for violation event

Id pattern	Pattern
1	(DT)? (sexual \| incestuous) (abuse) // (DT)? (abuso) (sexual \| incestuoso)
2	(to violate) (PREP)? // (violar) (PREP)?
3	(sexually) (to abuse) (PREP)? //(abuser) (sexualmente) (PREP)?

Table 5

Patterns for suicide event

Id pattern	Pattern
1	(commit) (suicide) // (se) (suicidar)
2	(take) (PPO) (life) // (se) (quitar) (PPO \| DT) vida
3	(hang) (PP) // (se) (ahorcar)

Table 6

Patterns for kidnapping event

Id pattern	Pattern
1	DT kidnapping (PREP)? // DT secuestro (PREP)?
2	(to be)? kidnap (PREP)? // (ser \| estar)? secuestrar (PREP)?
3	(to cach \| to capture)? kidnaper (express \| integrated)? // (caer \| atrapar \| capturar)? secuestrador (exprés \| integrado)?

Table 7

Patterns for sexual exploitation event

Id pattern	Pattern
1	DT? sexual exploitation PREP? // DT? explotación sexual PREP?
2	(human \| minor \| women \| children) trafficking // trata PREP (persona \| menor \| infante \| mujer \| blanco)
3	(to be)? obligate PREP (to exercise \| to dedicate)? PREP? DT? (sex service \| to prostitute \| have sex) // (ser \| estar)? obligar PREP (ejercer \| dedicar)? PREP? DT? (sexoservicio \| prostituirse \| tener relación sexual)

It should be emphasized that there are very specific patterns that rarely appear in the training corpus, it doesn’t mean that they are no longer considered for the extraction phase. E. g. the following pattern appears, at most, three to five times for the identification of killing event: “(to be) (calcinate | lynch) // (ser | estar) (calcinar | linchar)”.

A less frequent pattern for the assault event is as follows: “storm // tomar PREP asalto”.

A less frequent pattern for the assault event is illustrated as follows: “(touch) intimate part // (tocar) parte íntimia”.

A less frequent pattern for the assault event is shown as follows: “tie PP PREP PPO neck // se atar PREP PPO cuello”.

Finally, Table 7 exposes the top three patterns for sexual exploitation event.

It is relevant to note the following behavior of the patterns in each event based on verbal and nominal phrases. For the killing event, the two most frequent patterns are characterized by verbal phrases, meanwhile, in assault event the most frequent pattern belong to nominalized phrase. For suicide event the verbal phrases prevail that begin with the pronoun “himself //se”. In addition, the first two patterns for sexual exploitation category belongs to nominal phrases.

3.4 Pattern implementation

All extracted, filtered and enriched patterns are structured in JAPE rules for their implementation. As an example, a pattern of the killing event is given as follows.

Rule: KillingEvent

(

{Token.lemma==“kill”} // asesinar

)

({Token.POS==PREP})? // a, por, de

:ke –>:ke.KillingEvent = {rule =

“KillingEvent”}

The “?” symbol means zero or one occurrence of element. In this manner, the rule called “KillingEvent” can extract chunks of Spanish news, that describe a criminal (killing) event, such as: “X killed to Y // X asesinó a Y”, “X kills with Y // X asesina con Z”, where X is the murderer, Y is the victim and Z can be an instrument perpetrating the murder.

As the example presented above, 122 patterns were recognized and filtered from training corpus of news, they are organized in JAPE rules that are ready to be applied on any Spanish text. The 122 patterns are divided into 31 nominal-based patterns and 91 verbal-based patterns.

In our case, we use these linguistic enriched patterns to discover six criminal events from Spanish news. The linguistic pattern-based extraction task is evaluated in the upcoming section.

4 Evaluation

In this section, we present the training and testing news corpus. In addition, we present the evaluation and results of the proposed approach.

4.1 Training and testing corpus

We use a data set divided into training and testing corpus. The corpus is composed of Spanish news extracted from Mexican newspapers using web scraping, which is a technique used by software programs to extract information from specific websites. Headlines or short descriptions are extracted for each news. The main aim is to filter the news that describe one of the six events: killing, violation, assault, suicide, kidnapping and sexual exploitation.

Training corpus is composed by 1600 Spanish news for each criminal event, so, we have 9600 headlines of criminal news in training corpus. It should be note that one news can be contained one or more criminal event. This corpus is used to recognize verbal and nominal phrases in the first step, and then, they are filtered an enriched to be linguistic patterns used for extracting criminal events.

Testing corpus is composed by 500 Spanish news for each criminal event to get 3000 news in total for this corpus. It should be clarified that they are different from those are in training corpus. From the testing corpus, the computational approach based on 122 patterns has recognized 3120 events grouped in killing, violation, assault, suicide, kidnapping and sexual exploitation. The number of events is greater than Spanish news since one news can be contained one or more criminal event. This number of events are evaluated to find the performance and quality of our approach in terms of accuracy achieved to recognize criminal events. Table 8 shows the distribution of events in each category.

Table 8
Criminal event to be evaluated

Criminal event Number of events

killing 554

violation 573

assault 595

suicide 492

kidnapping 556

sexual exploitation 350

Criminal event	Number of events
killing	554
violation	573
assault	595
suicide	492
kidnapping	556
sexual exploitation	350

The sexual exploitation category has the less amount of events because it is the crime of which little is mentioned in digital media news.

The results of evaluating the criminal event recognition task based on testing corpus are presented in terms of precision, recall and f-measure. A list of recognized criminal events are revised to decide whether is a relevant event.

4.2 Evaluation, results and discussion

The evaluation uses the final list of criminal events recognized from testing corpus to evaluate the accuracy our approach based on 122 patterns.

Our evaluation approach of criminal event recognition resides in comparing the event extracted by the proposed approach against a gold standard, which is constructed manually by human experts from testing corpus. Each news is provided to three experts who decided the criminal event or events that are contained. A news can be contained more than one criminal event, so, each event is tagged as separate. In this evaluation, the accuracy of event extraction task is communicated by the number of hits or errors with respect to gold standard. The following well-known metrics, usually employed in Information Retrieval, are used to measure these issues.

The precision (P) is the coefficient between the amount of criminal events extracted (killing, violation, assault, suicide, kidnapping and sexual exploitation) by the proposed approach that are part of gold standard known as relevant extracted criminal events and the total number of criminal events extracted by the proposed approach as shown in Equation (1).

$\begin{matrix} Precision \\ = \frac{| {Relevent events} \cap {Extracted events} |}{| Extracted events |} \end{matrix}$ (1)

The recall (R) is scored using Equation (2) that express the coefficient between the amount of criminal event extracted by the proposed approach that is part of gold standard known as relevant extracted criminal events and the number of criminal events existing in the gold standard. $Recall = \frac{| {Relevant events} \cap {Extracted events} |}{| Relevant events |}$ (2)

The F-measure (F1) is scored using Equation (3) that represents a harmonic mean between precision and recall. Therefore, it can be used as a global metric for evaluating each criminal event category. $F 1 = \frac{2 * Precision * Recall}{Precision + Recall}$ (3)

All experiments are executed using the 122 patterns over testing corpus. We analyze the results by category of criminal event: killing, violation, assault, suicide, kidnapping and sexual exploitation. Thenceforth, the results for each criminal event are compared to determine the best results for the corpus. Table 7 presents the results obtained in the criminal event extraction for the testing corpus. All categories are considered to compare their results.

The results are shown in Table 9 remark that an average performance of 0.719 is obtained as a precision for the six event categories. Although the results are not as encouraging for recall metric, the proposed approach demonstrates good overall effectiveness for recognizing criminal events from Spanish news. It means that the proposed approach is close to 64% overall effectiveness. The best average effectiveness is achieved for the violation event, which has reached a value above 65%, achieved consistent values in precision and recall. The worst results were obtained in kidnapping event; this result is because there are phrases that were not extracted or has ambiguity in this category. For example, the phrase “X was plagiarized // X fue plagiado” is not extracted by human expert and it has two senses (“retain a person by force” and “copy an idea”). Nonetheless, the proposed approach can support media analysts in the automatic recognition of criminal events, reducing the time invested in manual analysis of this information.

Table 9

Results of criminal event recognition

Criminal event	P	R	F1
killing	0.756	0.560	0.643
violation	0.713	0.609	0.657
assault	0.715	0.567	0.632
suicide	0.739	0.553	0.633
kidnapping	0.671	0.569	0.616
sexual exploitation	0.698	0.582	0.635
average	0.719	0.573	0.636

5 Conclusions and future work

This paper has presented an approach for criminal event extraction based on linguistic enriched patterns, which are originated on verbal and nominal phrases. The proposed method is basically focused on training phase and recognition phase. A set of patterns enriched with linguistic information is obtained from training phase using a semi-supervised learning approach, in which verbal and nominal phrases are automatically extracted. Then, human experts manually filter them to get final set of patterns; finally, in this phase patterns are enriched with linguistic information about lemma and POS tags. In extraction phase, a test corpus is used to extract criminal events using the extracted and enriched patterns. As a result of the presented method a knowledge base with criminal events, it can be very useful for applications about data extraction, question answering systems.

It is important to note that this paper focuses on news in Spanish, a language in which limited works have been proposed. Therefore, verbal and nominal phrases converted into patterns are relevant and they are evaluated for recognizing criminal events.

The main contributions of this papers are: a) a set of linguistically enriched patterns with morphological information and POS categories, they are focused on six criminal events such as killing, violation, assault, suicide, kidnapping and sexual exploitation; b) a knowledge base with a lot of criminal events extracted from Spanish newspapers, considering their verbal and nominal features; c) an event-based evaluation of linguistic pattern that has outperformed the baseline.

An evaluation is accomplished using F-1 macro measure that considers the precision and recall of events extraction. The categories of criminal events used in the training and testing corpus are balanced. The evaluation shows promising results for event extraction in Spanish texts.

As a future work, other categories of criminal events could be considered such as discrimination, narcotics traffic extortion or fraud. Also, the obtained knowledge base can be enriched with information about time and space of events already extracted in this paper.

Finally, early detection of criminal events is highly important for preventing the crime in society. So, a system to predicting crimes using a set of patterns presented in this paper, it would be a research challenge to be addressed.

Footnotes

Acknowledgments

This work was partly supported by SEP-PRODEP. The authors would like to thank the Autonomous Metropolitan University Azcapotzalco and SNI-CONACyT.

References

Hristonboom

, Hogenboom

, Frasincar

, Schouten

and van der Meer

, Semantics-based information extraction for detecting economic events, Multimedia Tools and Applications64 (2013), 27–52.

Ritter

, Wright

, Casey

, Mitchell

, Weakly supervised extraction of computer security events from twitter, Proceedings of the 24th International Conference on World Wide Web , Florence, Italy (2015), 896–905.

Comrie

, The syntax of action nominals: A cross-language study, Lingua40 (1976), 177–201.

Zhu

, Yu

, Cheng-Long

, Global social event extraction and analysis by processing online news, Proceedings of the International Conference on Information System and Artificial Intelligence, Hong King, China2016.

Zhang

, Hong

, Zhang

, The research on event extraction of Chinese news based on subject elements, Proceedings of the 15th IEEE International Conference on Computer and Information Science , Okayama, Jaan (2016), 1–5.

Carrillo

, Cecchi

G.A.

, Sigman

and Slezak

D.F.

, Fast distributed dynamics of semantic networks via social media, Computational Intelligence and Neuroscience50 (2015), 9doi:.

Frontini

, Boukhaled

M.A.

, Ganascia

J.G.

, Linguistic pattern extraction and analysis for classic French plays, Presentation at the CONSCILA Workshop , Paris, France, 2015.

Cunningham

, Maynard

and Tablan

, JAPE: A Java annotation patterns engine;, 1999.

Schmid

, Probabilistic part-of speech tagging using decision trees, New methods in language processingRoutledge (2013), 154.

10.

Tanev

, Piskorski

and Atkinson

, Real-time news event extraction for global crisis monitoring, International Conference on Application of Natural Language to Information SystemsLondon, UK (2008), 207–218.

11.

Reyes-Ortiz

J.A.

, Representación semántica de eventos sobre seguridad: Un enfoque basado en lingüística, , Research in Computing Science124 (2016), 9–22.

12.

Vanegas

J.A.

, Matos

, González

and Oliveira

J.L.

, An overview of biomolecular event extraction from scientific documents,, Computational and Mathematical Methods in Medicine (2015), 19doi:.

13.

Borsje

, Hogenboom

and Frasincar

, Semi-automatic financial events discovery based on lexico-semantic patterns, International Journal of Web Engineering and Technology6 (2) (2010), 115–140.

14.

Hernando

(1998), Sobre la formación de palabras en español. in: Acta del VII Congreso Internacional de ASELE, Santiago de Comostela

Spain

257–264.

15.

Tesnière

, Éléments de syntaxe structurelle, Klincksieck, 2, Paris1976.

16.

Atkinson

, Du

, Piskorski

and Tanev

, Yangarber

and Zavarella

, Techniques for multilingual security-related event extraction from online news, Computational Linguistic, Sofia, Bulgaria, (2013), 163–186.

17.

Halliday

, An introduction to functional grammar, 2, Edward Arnold, London1994.

18.

Bui

Q.C.

, Campos

, van Mulligen

and Kors

A fast rule-based approach for biomedical event extraction in proceedings of the BioNLP Shared Task WorkshopAssociation for Computational Linguistics, Sofia, Bulgaria (2013), pp 104–108, .

19.

Sun

, Guo

, Ji

, Open domain atomic event extraction via double propagation for Chinese text, Proceedings of the IEEE 28th International Conference on Tools with Artificial Intelligence , San Jose, California, USA (2016), 844–851.

20.

Santiso

, Casillas

, Pérez

, Oronoz

and Gojenola

, Document-level adverse drug reaction event extraction on electronic health records in Spanish,, Procesamiento de Lenguaje Natural56 (2016), 49–56.

21.

SSP and INEGI, Clasificación estadística del delito en México, Mexico City, 2012.

22.

Solovyev

and Ivanov

, Knowledge-driven event extraction in Russian: Corpus-based linguistic resources,1– , Computational Intelligence and Neuroscience2016 (2016), 1–10. doi:10.1155/2016/4183760

23.

Yandi

and Yang

, Chinese event extraction using deep neural network with word embedding,, Computation and Language1610 (2016), 1–6.