Abstract
There are several categories of criminal events. However, one of them is focused on people: criminal events against people. It directly affects some of the guarantee of a person or a family. These events are reported in digital media and, without neglecting, digital news media in Spanish. It is relevant to recognize criminal events against people to get useful information about the public security of citizens. Natural Language Processing has techniques that can be possible their identification. However, fine-grained linguistic analysis is required in order to carry out such task. This paper considers the enhancing the discovered patterns with linguistic information (morphological and POS categories) to recognize criminal events against people from Spanish newspapers. Six categories of criminal events are considered: killing, violation, assault, suicide, kidnapping and sexual exploitation. An experimentation is carried out with a gold standard data set of criminal events. The experimentation shows promising results.
Keywords
Introduction
Criminal events against people are presented in the major of digital media news. In particular, social networks offer real-time information what is happening in terms of crime around a square, suburb, town, city, state or country. Thus, several questions can be answered about criminal events: “what happened”, which expresses the criminal event. “When it happened”, the time expression that reflects the instant or period when something has happened. “Why it happened”, the cause criminal event. “Who is (are) involved”, the agents or people implicated in a crime. “where it occurred”, the place where the crime has occurred.
Spanish digital media news, thousands of criminal events are generated and described every day. Certainly, it is no exception in Mexican newspapers. Recognizing such events is very relevant for government decision-making, or simply, for a citizen to know the crime incidence of a region, area or zone. Its recognition becomes a challenge of research in analysis of texts. Crime event recognition is a tedious and time-consuming task. Fortunately, Natural Language Processing is a field of Artificial Intelligence that propose methods and techniques for automatic text analysis. However, recognizing criminal events from Spanish text is considered a major challenge to NLP researchers.
The National Institute of Statistics and Geography (INEGI) categorize criminal events in [21], where it presents the Statistical Classification of Crimes (CED), that allows consulting the universe of crimes in Mexico. The CED is sufficiently detailed to characterize each crime and at the same time to allow useful and practical groupings. It is structured into levels: main group, subgroup, unit group and class of crime. This seeks to give flexibility to the classification, from the most general to the most specific, and facilitate its implementation.
Therefore, in this paper, we present an approach for enhancing patterns with morphological information and POS categories in order to recognize and extract criminal events from news reported in Mexican digital newspapers.
Six criminal events, from main groups of the CED presented in [21], are considered (killing, violation, assault, suicide, kidnapping and sexual exploitation). The automatic recognition of these six criminal events, extracted from daily Spanish news, is crucial for the government to make decisions about the implementation of crime prevention policies and strategies to avoid violent events in the near future.
In our particular case, we have delimited the roles closely related with each event of interest, we have extended the security-related events presented in [11], as follows.
A killing is a criminal event where a human is deprived of life by any means, it is surrounded by murderer(s) and murdered, defined as: Murdered is someone who has been killed. Murderer is someone who has killed a person.
In a violation event, the perpetrator deprives the victim of his freedom and sexual security through the forced or deceptive performance of copulation. It involves perpetrator(s) and victim(s), described as: Perpetrator is someone who has sexually abused of a human. Victim is someone who has been sexually abused by perpetrator.
In an attack, specifically an assault event, the roles closely related are attacker(s) and assaulted, the attacker takes advantage of the assaulted in space and temporary conditions of insecurity, using violence, in order to cause damage or loss. They are described as: Attacker is someone who has committed an assault. Assaulted is someone who has been stolen by an attacker.
A suicide event involves a conduct where a person deprives the life by itself, originating only the intervention of suicidal person, described as: Suicidal is someone who commits a suicide by himself.
A kidnapping is a criminal event, where the kidnapper deprives the kidnapped of personal and space freedom as a pressure, in order to get a monetary benefit that is granted by the victim or by a third person. It involves kidnapper(s) and kidnapped, defined as: Kidnapper is someone who deprives a person of space freedom. Kidnapped is someone who has been locked up by a kidnapper.
A sexual exploitation is a criminal event, where the perpetrator deprives to the exploited victim of the sexual freedom by marketing, disposing and exploiting the victim’s sexuality; to get a monetary benefit, or give the necessary means for that purpose, defined as: Perpetrator is someone who deprives a person of sexual freedom. Exploited victim is someone who has been locked up by a Perpetrator for sexual trafficking.
Such criminal events are converted into our event of interest, which cover the Statistical Classification of Crimes presented in [21]. As a starting point, we take these six classes and, first, discover linguistic patterns from Spanish news using a semi-supervised method, then, they are enhanced with linguistic information (morphological and POS categories) to recognize criminal events from non-labelled Spanish news.
The rest of this paper is organized as follows. Section 2 presents the related work, which is briefly described to compare them with the approach presented in this paper. Section 3 describes the semi-supervised method for extracting and enhancing patterns to recognize criminal event. Section 4 presents an evaluation based on precision and recall of each event group, also, results and discussions are offered. Finally, Section 5 shows the conclusions and future work.
Related work
Event extraction or recognition from text has been proposed for several works, by using a great variety of approaches.
In [22], the authors use corpus-based linguistic resources as vocabulary, n-grams and frame-based patterns, which play a central role in financial events extraction from Russian texts. Real-world events independent of the domain from texts in social media are identified in [6], by using time-dependent semantic similarity measures that are consistent with static measures of similarity but provides high temporal resolution.
Social media is a resource plagued with events. Therefore, they have been widely used as a starting point for event extraction approaches, which are presented as follows. In [2] a weakly supervised approach is proposed for extraction new categories of events from Twitter, using seed-based machine learning.
On the other hand, newspapers also contain events, as shown in [4] and [13], which investigate the global social event extraction by processing online news, authors build an event model and extract event information by classifying English news text using lexico-semantic patterns. Events related to finances are extracted in [1] from news articles, exploiting a framework with components as Gazetteers, a domain-specific ontology is used as a seed for the event detection process. A system’s architecture is described in [10] for integrating event information in a global crisis monitoring system, addressing several NLP-involved tasks: news geo-tagging, automatic pattern learning, pattern specification language, information aggregation. Security-related events extraction from online news is introduced in [16] through event classification method based on domain-specific inference rules, an approach to event geo-tagging based on lexical or semantic patterns, a simple method for cross-lingual event information fusion, and techniques for scoring the relevance rank.
In the biomedical domain, events involve biomolecules such as genes, transcription factors, or enzymes, for example, have a central role in biological processes and functions and provide valuable information for describing physiological and pathogenesis mechanisms. Event extraction from biomedical literature has a broad range of applications as in [12, 18] that using scientific documents as a starting point. Predictive model inferred from a manually tagged corpus [20], including support for information retrieval, knowledge summarization, and information extraction and discovery, it is important to note that these author have focused their work in Spanish texts.
In recent years, linguistic approaches are proposed, as linguistic patterns for analyze French texts [7], deep learning techniques [23], approaches that consider language phenomena [19] and methods for news event extraction based on subject elements, which mixes the study of news topic sentence extraction and the research of event extraction together in Chinese texts [5].
Approaches based on textual patterns, whether linguistic, semantic, syntactic or lexical, have been used for the extraction of several kinds of events. We have found works as [1] that uses semantics-based patterns for detecting economic events, [10] syntactic patterns to real-time news even extraction about global crisis, and [4], where authors use a named entity recognition for creating patterns to extract global social events. The use of patterns to recognize, extract or identify events has been widely considered. However, recognizing event from Spanish texts has a significant lack of existing approaches. Accordingly, in this paper, we rely on an approach to enrich extraction patterns with linguistic information for criminal event recognition from Spanish news.
Enhancing patterns and criminal event recognition
In this section, we describe our approach for enhancing patterns to criminal event recognition. The entire process consists on the following phases: a) pre-processing news, it is a step necessary to prepare the news text for coming phases; b) verbal and nominal phrase extraction in order to generate an initial set of candidate phrases; c) filtering candidate phrases and enrich them to create a final set of patterns; d) criminal event recognition involves the implementation of each final patterns structured as JAPE rules to extract the criminal events of interest. In Fig. 1, the entire process for criminal event recognition is shown, within the corresponding phases are included.

Approach to enhance patterns for criminal event recognition.
In this step, a segmentation process is performed to obtain tokens string. In addition, two processes are applied to texts in order to normalize and use them in future steps. Lemmatization and Part-of-Speech tagging are carried out using TreeTagger [9].
Lemmatization of Spanish news converts each word to its corresponding lemma, without inflections, so words can be analyzed as a normalized item. The lemmatization process allows grouping several inflected forms of the same word into a single item.
Part-of-Speech (POS) tagging is the process to assign or labelling to each word that is part of a text, the corresponding grammatical category. In this way, TreeTagger is used to perform it, which considers the context of the word in which it appears, to assign the suitable grammatical category.
Table 1 shows an example of a phrase decomposed in tokens within its lemmas and POS tags.
An example of phrase lemmatization and POS tagging
An example of phrase lemmatization and POS tagging
Both processes, lemmatization and POS tagging are used to future phases. By one hand, they are applied in verbal and nominal phrase extraction to support and adjust the rules for the automatic identification of candidate phrases. And, however, they are considered to enrich the final extracted patterns to recognize criminal events from Spanish news.
Natural language texts are a way of communicating situations, therefore understand them involves to place a particular attention on the events contained in the interested texts. It is necessary to create a representation of what is expressed in terms of space, time, causes and agents related to events. In texts, the linguistic structure plays as a means in order to express how the real-world situations are constructed.
Consequently, several language structures used to express events should be considered and analyzed from text. The events are present in most historical, journalistic and specialized texts, in which, they are characterized by either a verbal phrase [15, 17] or a nominalization [3].
Both theories L. Tesnière [15] and M. Halliday [17] affirm that the verb is the core, around which, all elements of the sentence rotate. Elements of the sentence are divided in actor roles (agent, object, and recipient) and the circumstantial (instrument, force, time and locative). These theories place the proposed approach in the idea that the verbal core into sentences, represents events, and therefore, verbal phrases must be characterized.
A nominalization is a word formation that uses a verb as a noun, i.e. nouns derived from verbs. They are known as action nominals [3]. Being derived from verbs, nominalizations also express events. In this paper, nominalizations are further characterized for the Spanish language in order to extract nominal phrase that describe criminal events.
For Spanish, the nominalization, according to Hernando [14], creates nominal derivatives by suffixation. It corresponds to a suffix denominal, deadjectival and deverbal. We have a special interest in deverbal nominalization, because it is formed with a base or root form of a verb plus a suffix. In addition, we rely on the nominalizations are referring to a verb and consequently, to an event.
In this paper, we adopt the following two formal definitions of events. In addition, we characterize these events in order to extract them from newspapers in Spanish.
We use JAPE rules to characterize categories of criminal events in order to annotate them from text. JAPE [8] is a Java Annotation Patterns Engine that provides finite state transduction over annotations based on regular expressions based on CPSL – Common Pattern Specification Language.
A verbal phrase (VP) is combined with two or more verbal forms: an auxiliary verbal form and other forms with head verbs. VP is characterized as macros with JAPE rules as follows.
Where, according to TreeTagger [9], the following tags are distinguished: VS or VE (verb “to be”), VH (verb “to have”), VM (modal verb). The suffixes “ger” and “adj” mean gerund and adjective verbs respectively. For “HEADVERB” macro, three POS tags have been enforced: VLinf (lexical verb in infinitive) and VLger (lexical verb in gerund). Finally, the notation in square brackets means the number of times that the element can appear, in this case “[0, 2]”, zero to two times.
The rule called “VerbalEvent” is implemented and, a window size of five words or the beginning/end of the phrase is extracted from Spanish news.
It is remarkable that there will be non-relevant phrases as shown in Example (3).
In Spanish, a nominal event is defined as a concatenation of verbal root plus a corresponding suffix, as shown below.
Where V r expresses the root or stem of a verb in Spanish, from which inflected words are removed, thus verbs are reduced, e.g. the stem of “violate/violar” verb is “viol”.
Af n represents the suffix that completes the meaning of the nominalization. Some suffixes are, for example, -(at)oria, -sión, -(a, i)m(i)ento, -(a, i)ción, -a, -eo, -o. They do not have English translation because of the nature of the Spanish language with the exception of -tion/-sión -(a, i)tion/-(a, i)ción and suffix.
In the following example, we show two linguistic structures considered in this paper that characterize our events. In Example (1), a verbal event is included with an attack (assault) event expressed by “was assaulted/fue asaltado” verbal phrase. Then, in Example (2), a nominal event is presented through violation event by a deverbal nominalization, verb (violate/viola) plus -(a, i)tion/-(a, i)ción suffix.
The rule called “NominalEvent” produces, with a window size of five words or the beginning/end of the phrase, the phrase shown in Example (4).
Both rules produce a set of candidate phrases to be converted into final patterns to recognize criminal events.
Filtering and enhancing patterns
After automatic extraction of verbal and nominal phrases with defined JAPE rules, a set of candidate phrases are obtained. As generic rules have been created for extracting nominal and verbal phrases, this set contains irrelevant phrases which do not characterize or recognize criminal events. Therefore, a filtering process is indispensable. So, from the list of candidate phrases, human experts manually filter extracted verbal and nominal phrases to obtain relevant, useful and final patterns. All of them will conform our knowledge base to be enriched in order to form the final patterns. They decide whether the phrase in decision belongs to a criminal event in six categories in interest or not by manually labeling them. Human experts read the definition and characterization of each criminal event provided by [21] in order to contextualize in such domain.
The filtered phrases are grouped by matching, and eliminating duplicates. A useful list of patterns is acquired. Thus, such list is enriched with linguistic information comes from lemmatization and POS tagging processes.
The enriching process takes all inflections of verbs (time, person and gender) and nouns (person and gender) and substitute lemma instead of simple string. For example: “assault/asalta”, “assaults/asalta” “assaulted/asaltó” is switched by the lemma “assault/asaltar” or “theft/robo” “thefts/robos” is switched by the lemma theft/robo”. Also, determinants, pronouns and prepositions are replaced with the proper POS tag. For example, “el | un | los | unos” is replaced with DT (determinant) tag and “a | con | contra | en | sobre | de” is changed by PREP (preposition).
Enriching patterns with linguistic information from lemmatization and POS tagging process reduces the dimension of patterns, retaining their properties for characterizing criminal events.
After the semi-supervised learning method was carried out and an enhancing process with morphological and POS categories, a final set of patterns are gained for each criminal event. 17 for killing event, 19 for violation, 24 for attack or assault event, 13 for kidnapping, 33 patterns for suicide event and 16 for sexual exploitation. Tables 2, 3, 4, 5, 6 and 7 show a list of the three most frequent patterns organized by event categories.
Patterns for killing event
Patterns for killing event
Patterns for assault event
Patterns for violation event
Patterns for suicide event
Patterns for kidnapping event
Patterns for sexual exploitation event
It should be emphasized that there are very specific patterns that rarely appear in the training corpus, it doesn’t mean that they are no longer considered for the extraction phase. E. g. the following pattern appears, at most, three to five times for the identification of killing event: “(to be) (calcinate | lynch) // (ser | estar) (calcinar | linchar)”.
A less frequent pattern for the assault event is as follows: “storm // tomar PREP asalto”.
A less frequent pattern for the assault event is illustrated as follows: “(touch) intimate part // (tocar) parte íntimia”.
A less frequent pattern for the assault event is shown as follows: “tie PP PREP PPO neck // se atar PREP PPO cuello”.
Finally, Table 7 exposes the top three patterns for sexual exploitation event.
It is relevant to note the following behavior of the patterns in each event based on verbal and nominal phrases. For the killing event, the two most frequent patterns are characterized by verbal phrases, meanwhile, in assault event the most frequent pattern belong to nominalized phrase. For suicide event the verbal phrases prevail that begin with the pronoun “himself //se”. In addition, the first two patterns for sexual exploitation category belongs to nominal phrases.
All extracted, filtered and enriched patterns are structured in JAPE rules for their implementation. As an example, a pattern of the killing event is given as follows.
The “?” symbol means zero or one occurrence of element. In this manner, the rule called “KillingEvent” can extract chunks of Spanish news, that describe a criminal (killing) event, such as: “X killed to Y // X asesinó a Y”, “X kills with Y // X asesina con Z”, where X is the murderer, Y is the victim and Z can be an instrument perpetrating the murder.
As the example presented above, 122 patterns were recognized and filtered from training corpus of news, they are organized in JAPE rules that are ready to be applied on any Spanish text. The 122 patterns are divided into 31 nominal-based patterns and 91 verbal-based patterns.
In our case, we use these linguistic enriched patterns to discover six criminal events from Spanish news. The linguistic pattern-based extraction task is evaluated in the upcoming section.
Evaluation
In this section, we present the training and testing news corpus. In addition, we present the evaluation and results of the proposed approach.
Training and testing corpus
We use a data set divided into training and testing corpus. The corpus is composed of Spanish news extracted from Mexican newspapers using web scraping, which is a technique used by software programs to extract information from specific websites. Headlines or short descriptions are extracted for each news. The main aim is to filter the news that describe one of the six events: killing, violation, assault, suicide, kidnapping and sexual exploitation.
Training corpus is composed by 1600 Spanish news for each criminal event, so, we have 9600 headlines of criminal news in training corpus. It should be note that one news can be contained one or more criminal event. This corpus is used to recognize verbal and nominal phrases in the first step, and then, they are filtered an enriched to be linguistic patterns used for extracting criminal events.
Testing corpus is composed by 500 Spanish news for each criminal event to get 3000 news in total for this corpus. It should be clarified that they are different from those are in training corpus. From the testing corpus, the computational approach based on 122 patterns has recognized 3120 events grouped in killing, violation, assault, suicide, kidnapping and sexual exploitation. The number of events is greater than Spanish news since one news can be contained one or more criminal event. This number of events are evaluated to find the performance and quality of our approach in terms of accuracy achieved to recognize criminal events. Table 8 shows the distribution of events in each category.
Criminal event to be evaluated
Criminal event to be evaluated
The sexual exploitation category has the less amount of events because it is the crime of which little is mentioned in digital media news.
The results of evaluating the criminal event recognition task based on testing corpus are presented in terms of precision, recall and f-measure. A list of recognized criminal events are revised to decide whether is a relevant event.
The evaluation uses the final list of criminal events recognized from testing corpus to evaluate the accuracy our approach based on 122 patterns.
Our evaluation approach of criminal event recognition resides in comparing the event extracted by the proposed approach against a gold standard, which is constructed manually by human experts from testing corpus. Each news is provided to three experts who decided the criminal event or events that are contained. A news can be contained more than one criminal event, so, each event is tagged as separate. In this evaluation, the accuracy of event extraction task is communicated by the number of hits or errors with respect to gold standard. The following well-known metrics, usually employed in Information Retrieval, are used to measure these issues.
The precision (P) is the coefficient between the amount of criminal events extracted (killing, violation, assault, suicide, kidnapping and sexual exploitation) by the proposed approach that are part of gold standard known as relevant extracted criminal events and the total number of criminal events extracted by the proposed approach as shown in Equation (1).
The recall (R) is scored using Equation (2) that express the coefficient between the amount of criminal event extracted by the proposed approach that is part of gold standard known as relevant extracted criminal events and the number of criminal events existing in the gold standard.
The F-measure (F1) is scored using Equation (3) that represents a harmonic mean between precision and recall. Therefore, it can be used as a global metric for evaluating each criminal event category.
All experiments are executed using the 122 patterns over testing corpus. We analyze the results by category of criminal event: killing, violation, assault, suicide, kidnapping and sexual exploitation. Thenceforth, the results for each criminal event are compared to determine the best results for the corpus. Table 7 presents the results obtained in the criminal event extraction for the testing corpus. All categories are considered to compare their results.
The results are shown in Table 9 remark that an average performance of 0.719 is obtained as a precision for the six event categories. Although the results are not as encouraging for recall metric, the proposed approach demonstrates good overall effectiveness for recognizing criminal events from Spanish news. It means that the proposed approach is close to 64% overall effectiveness. The best average effectiveness is achieved for the violation event, which has reached a value above 65%, achieved consistent values in precision and recall. The worst results were obtained in kidnapping event; this result is because there are phrases that were not extracted or has ambiguity in this category. For example, the phrase “X was plagiarized // X fue plagiado” is not extracted by human expert and it has two senses (“retain a person by force” and “copy an idea”). Nonetheless, the proposed approach can support media analysts in the automatic recognition of criminal events, reducing the time invested in manual analysis of this information.
Results of criminal event recognition
This paper has presented an approach for criminal event extraction based on linguistic enriched patterns, which are originated on verbal and nominal phrases. The proposed method is basically focused on training phase and recognition phase. A set of patterns enriched with linguistic information is obtained from training phase using a semi-supervised learning approach, in which verbal and nominal phrases are automatically extracted. Then, human experts manually filter them to get final set of patterns; finally, in this phase patterns are enriched with linguistic information about lemma and POS tags. In extraction phase, a test corpus is used to extract criminal events using the extracted and enriched patterns. As a result of the presented method a knowledge base with criminal events, it can be very useful for applications about data extraction, question answering systems.
It is important to note that this paper focuses on news in Spanish, a language in which limited works have been proposed. Therefore, verbal and nominal phrases converted into patterns are relevant and they are evaluated for recognizing criminal events.
The main contributions of this papers are: a) a set of linguistically enriched patterns with morphological information and POS categories, they are focused on six criminal events such as killing, violation, assault, suicide, kidnapping and sexual exploitation; b) a knowledge base with a lot of criminal events extracted from Spanish newspapers, considering their verbal and nominal features; c) an event-based evaluation of linguistic pattern that has outperformed the baseline.
An evaluation is accomplished using F-1 macro measure that considers the precision and recall of events extraction. The categories of criminal events used in the training and testing corpus are balanced. The evaluation shows promising results for event extraction in Spanish texts.
As a future work, other categories of criminal events could be considered such as discrimination, narcotics traffic extortion or fraud. Also, the obtained knowledge base can be enriched with information about time and space of events already extracted in this paper.
Finally, early detection of criminal events is highly important for preventing the crime in society. So, a system to predicting crimes using a set of patterns presented in this paper, it would be a research challenge to be addressed.
Footnotes
Acknowledgments
This work was partly supported by SEP-PRODEP. The authors would like to thank the Autonomous Metropolitan University Azcapotzalco and SNI-CONACyT.
