Abstract
This work presents an experimental study on the task of Named Entity Recognition (NER) for a narrow domain in Spanish language. This study considers two approaches commonly used in this kind of problem, namely, a Conditional Random Fields (CRF) model and Recurrent Neural Network (RNN). For the latter, we employed a bidirectional Long Short-Term Memory with ELMO’s pre-trained word embeddings for Spanish. The comparison between the probabilistic model and the deep learning model was carried out in two collections, the Spanish dataset from CoNLL-2002 considering four classes under the IOB tagging schema, and a Mexican Spanish news dataset with seventeen classes under IOBES schema. The paper presents an analysis about the scalability, robustness, and common errors of both models. This analysis indicates in general that the BiLSTM-ELMo model is more suitable than the CRF model when there is “enough” training data, and also that it is more scalable, as its performance was not significantly affected in the incremental experiments (by adding one class at a time). On the other hand, results indicate that the CRF model is more adequate for scenarios having small training datasets and many classes.
Introduction
The Named Entity Recognition (NER) task has been studied from the last two decades with the aim of extracting information from news, scientific documents, medicine records, social media, and other domains in different languages including Spanish. The term named entity was coined in the Message Understanding Conference-6 (MUC-6), where was introduced the task of recognizing names of people, organizations and geographical locations, as well as time, currency and percentage expressions in texts [11]. Several approaches have been proposed since then, for example, Maximum Entropy Markov Models (MEMM) [19, 28] and Support Vector Machines (SVM) [2, 14] were very popular in the first years. Later on, in the context of CoNLL-2002, [3] introduced a binary Adaptive Boosting (AdaBoost) algorithm to extract named entities in Spanish and Dutch, and several participants applied Conditional Random Fields (CRF) with great success. More recently, most works have considered Recurrent Neural Networks (RNN) approaches and have used pre-trained word embeddings to enhance the quality and generalization of data in the training process.
Although the extensive work in NER, it has been mainly addressed in English language, in the news domain, and in scenarios having large training datasets and a small number of classes to recognize. Most recent works have used probabilistic and deep learning approaches with competitive results, but there is not a clear and detailed comparison of them, which help to determine the best option to handle NER in Spanish under a narrow domain scenario consisting of a small and imbalanced dataset. This work focuses precisely on this problem.
Mainly, this work introduces an experimental study of two NER approaches, one probabilistic and one based on deep learning techniques, in two Spanish datasets, CoNLL-2002 (refer to Table 1) and MX-News (refer to Table 3). For the probabilistic approach, we used a Conditional Random Field model, whereas, for the deep learning approach, we applied two different Recurrent Neural Networks models, a Bidirectional Long-Short Term Memory, and a Bidirectional Long-Short Term Memory with ELMo embeddings. We selected these models because of their prominent results reported in the state-of-the-art.
The contribution of this paper is twofold. First, it compares the performance of two state-of-the-art NER approaches in two Spanish datasets and under incremental/decremental class scenarios. Second, it shows an in-deep analysis of the robustness, scalability and types of errors of each model, aiming at providing a detail characterization of them for their future application in Spanish related tasks. It is important to mention that the source code and datasets used in this paper are available in the GitHub repository 1 .
The remainder of the paper is organized as follows. Section 2 describes some NER works using statistical and deep learning approaches. Then, Section 3 explains the used CRF and Bi-LSTM models. Section 4 reports the datasets used and experimental settings. Section 5 contains the experiments and results. Finally, the conclusions are presented in Section 6.
Related works
This section describes the two main current approaches for NER, one based on Conditional Random Fields (CRF approach from now on), and the other based on the use of recurrent and convolutional neural networks.
The CRF approach
This probabilistic model allows to segment and label data sequences [15]. It resolves the label bias problem, which usually affects the performance of Maximum-Entropy Markov Models (MEMMs), and it also shows greater robustness than Hidden Markov Models. In [25], it is proposed a CRF model to addressed the sequence labeling task, using shallow parsing features with POS tags and considering the IOB (Inside, Output, Begining) schema to label each word. Then, for the Fine-Grained Named Entity Recognition (FGNER) task, [17] used the CRF to detect the boundary of named entities and a Maximum Entropy (ME) model to classify them. The experiments considered fifteen base entities and one hundred forty-seven fine-grained categories. In other FGNER work, [24] proposed the distributed and parallelized feature extraction as well as the parameter estimation in CRF, to recognize six fine-grained geospatial concepts in German texts. On the other hand, the work of [22] reported a method that combines a logistic regression classifier and a CRF classifier to recognize food entities.
Moreover, the CRF approach has been used for recognizing medical entities in texts. For example, [1] proposed a hybrid method that combines semantic and statistical approaches under IOB schema to label entities from two medical corpora. [7] presented a feature generation method to incorporate multiple segmentation representations as IOB, IOBES, and BIES into a CRF model to achieve the NER task in biomedical and general domain corpora. The best results were obtained with BIES and IOBES. Similarly, [8] used eight different classifiers (CRF among them) to extend segmentation representations (IOB and IOBES included) on the medical i2b2-2010 corpus. In this work, an extra entity is used to represent entities ambiguity. The best results were obtained with the CRF classifier and the IOBES schema.
Neural networks approaches
Several types of neural networks architectures have been designed along the previous years, among them is the Bidirectional Long Short-Term Memory (Bi-LSTM), which has achieved the best results in the NER task. Bi-LSTMs are based in Recurrent Neural Networks (RNNs) [23] and Long Short-Term Memories (LSTMs) [12]; they were introduced in [9] and were compared against other networks architectures on the framewise phoneme classification task using the TIMIT corpus. The Bi-LSTM networks usually outperformed unidirectional LSTMs and are faster to train than RNNs. They have been extensively used in speech recognition because they are able to store past and future context internally. The experiments reported in [10] on two speech datasets showed state-of-the-art results.
Related to the NER task, [13] described several experiments in sequence tagging tasks, such as the Penn TreeBank POS tagging, the CoNLL-2000 chunking, and the CoNLL-2003 named entity tagging. An interesting contribution of this work was the combination of a Bi-LSTM with an extra CRF layer. They also considered spelling features, context features, word embeddings, and gazetteer features. In the experiments, they evaluated different combinations of LSTM networks, but the best results were achieved by the proposed Bi-LSTM-CRF network, even outperforming other state-of-the-art approaches. Following this work, [18] proposed a new neural network architecture for sequence labeling. Basically, it introduced an end-to-end model requiring no task-specific resources, feature engineering, nor data pre-processing; it only required word embeddings pre-trained on unlabeled data. The proposed model combined Convolutional Neural Networks (CNNs) [16] to encode character and word level representations that were used to feed a Bi-LSTM-CRF network. In the experiments, they used four different types of word embeddings, reporting better results than previous works. In the same direction, [6] implemented a Bi-LSTM-CNN using a character-level representation on the CNN layer and word embeddings (Senna 2 , Glove 3 , Word2vec 4 ) on a Bi-LSTM network for the NER task. For the experiments in the CoNLL-2003 and OntoNotes 5.0 datasets, they considered a combination of word embeddings, capitalization features, and lexicon features.
Most works cited above have used large training datasets and have focused on recognizing a few types of named entities. Motivated by this fact, this work contributes in comparing state-of-the-art approaches, that is, Conditional Random Fields and Bidirectional Recurrent Neural Networks, when they are applied to more realistic scenarios, consisting narrow domain data, small training sets and many unbalanced classes 5 .
Models used for the NER task
This section roughly explains the used CRF and Bi-LSTM models; their implementation details are described in Section 4.
CRF model
Conditional Random Fields are a sequence modeling framework introduced in [15]. They have all advantages of MEMMs, but solve the label bias problem. The main difference of the CRFs is that they use a single exponential model for the joint probability p (y|x) of the entire sequence of labels given the observation sequence. In a CRF, two random variables are defined: X = x1, …, x
T
, which is the variable over data sequences of observations (i.e., tokens) to be labeled, and Y = y1, …, y
T
, which is the variable over the corresponding label sequences (i.e., named entity tags) [15, 26]. Formally, a linear-chain CRF can then be defined as follows:
The Figure 1 depicts an example of a CRF structure, where the sentence X is: “López Obrador viaja a Puebla” (Lopez Obrador travel to Puebla, in English), and the corresponding label sequence Y is: “PER-B, PER-E, O, O, GPE-S”. The inputs and outputs are directly connected as opposed to LSTM and bidirectional LSTM networks, where memory cells/recurrent components are employed [13].

A Long Short-Term Memory (LSTM) is a special kind of RNN architecture proposed by Hochreiter and Schmidhuber [12] and [9]. It consists of a set of blocks recurrently connected, each one containing one or more recurrently connected memory cells, with the ability to remove or add information to the cell state. It is regulated by three multiplicative units: input, output and forget gates. Given an input sequence x = (x1, …, x T ), a standard RNN computes the hidden vector sequence h = (h1, …, h T ) and output vector of sequences y = (y1, …, y T ) by iterating the following equations from t = 1 to T. In the NER task, x and y represent input features and tags respectively. In contrast, a LSTM contains three gates, which are functions of the current input x t and hidden state h t : input gate i t , forget gate f t and output gate O t [10, 13]. The Figure 2 illustrates a single LSTM that is implemented as follows:

Long Short-term Memory Cell. Source [10].
Figure 3 shows a labeling example using a bidirectional LSTM network over the same sentence depicted in Figure 1. The Bi-LSTM network allows accessing long-range context in both directions, that means that it can be trained using all available input information on the past (forward states) and future (backward states) of a specific time frame, unlike a LSTM network that only can use the previous context. In Figure 3, filled boxes represent the LSTM cells (they are also called the hidden LSTM layer). The bidirectional LSTM connects two hidden layers to a single output layer, both the hidden forward sequence (

This section presents the used datasets, the models’ hyper-parameters as well as the evaluation measures.
Datasets
Two datasets were used in the experiments, the CoNLL-2002 Spanish corpus and MX-News corpus. The the CoNLL-2002 corpus contains four types of entities, organizations (ORG), persons (PER), locations (LOC), and miscellaneous (MISC). This corpus was tagging under the IOB schema, and it contains three partitions, TestA + TesB + Train. In our experiments we considered all these partitions as well as their union (refered as Ensemble). Table 1 shows some statistics from this dataset. It is important to clarify that for the RNN model the sentences longer than 50 tokens were divided into small sentences, and sentences shorter than 50 tokens were padded with the special “<pad> ” token.
CoNLL-2002 dataset. Sentences * indicate the original number of sentences in each dataset, maintaining their original length and used by the CRF model. Sentences † indicate the number of sentences used by the RNN model, which correspond to 50-tokens length sentences. Tokens indicate the size of the vocabulary in each partition
CoNLL-2002 dataset. Sentences * indicate the original number of sentences in each dataset, maintaining their original length and used by the CRF model. Sentences † indicate the number of sentences used by the RNN model, which correspond to 50-tokens length sentences. Tokens indicate the size of the vocabulary in each partition
The MX-News dataset was gathered by ourselves. It consists of 250 political news documents from Mexico, which were manually labeled with seventeen different types of entities as described in Table 2. The labeling is done based on the IOBES tagging schema [28, 14], as it has shown better results than the traditional IOB schema [7, 8]. The boundaries of the named entities (NEs) are marked with the
Tags used for the annotation of entities in the MX-News corpus
To summarize, Figure 4 shows the classes distribution of the CoNLL-2002 Spanish corpus under IOB tagging schema. Similarly, Figure 5 shows the classes distribution of the MX-News corpus using the IOBES tagging schema. As noticed, both datasets present high class imbalance, particularly the MX-News corpus.

CoNLL-2002 Corpus: Class distribution under the IOB schema.

Mexican News Corpus (MX-News): Class distribution under the IOBES schema.
The features used in the CRF model were: word suffixes, simplified POS tags, flags indicating the use of lowercase, uppercase and digits, marks for titles as well as for the begin and end the sentences, and features of nearby words. The implementation of this model was done using the CRFsuite library
6
, with the following hyperparameters: algorithm = lbfgs (Gradient descent using the L-BFGS method). c1 = 0.1 (coefficient for L1 regularization). c2 = 0.1 (coefficient for L2 regularization). max_iterations = 50 (The maximum number of iterations for optimization algorithms). all_possible_transitions = True (CRFsuite generates transition features that associate all possible label pairs).
Implementation details of the Bi-LSTM model
To implement this model we used Keras 2.2.4
7
, and TensorFlow 1.13.1 as backend. The implemented model consisted of 4 layers: Input Layer. The input sentences to the model. The length of sentences was defined to be 50 words. Shorter sentences were filled with the “<pad> ” word and longer sentences were split into 50 words instances. Embedding Layer. Each word within the vocabulary was mapped to a vector of 1024 size using the ELMo embeddings [21]. The parameters from this layer were: input_dim is the vocabulary size, output_dim = 1024, input_length = 50, weights = embeddings matrix ELMo and trainable = True. Bidirectional LSTM Layer. This layer was feed by the embedding layer. The LSTM layer was set up with units=200, dropout = 0.01, recurrent_dropout = 0.3 and return_sequences = True. The activation function was the Hyperbolic tangent (tanh) and for the recurrent step it was the hard sigmoid function. A Bidirectional wrapper layer was used to learn high-level features in the forward and backward directions of the LSTM layer. Output Layer. Time Distributed is a wrapper layer applying the same dense layer (same weights) to the LSTMs outputs for one time step at a time. In this way, the output layer only needs one connection to each LSTM unit (plus one bias). It uses the parameters: units = positive integer, the dimensionality of the output space (tags length) and activation = softmax function.
For training the model we employed the following configuration: the RMSprop algorithm as optimizer, with learning rate of 0.001 and learning rate decay of 0.0. For the evaluation stage we considered the following parameters: metrics = accuracy and loss = categorical_crossentropy. The bach_size = 50, epochs = 20, validation_split = 0.2 and shuffle = True.
In this work, we used ELMo embeddings [21] in Spanish. The embeddings were built from both corpora (CoNLL-2002 and MX-News), using the elmoformanylangs 0.0.2 Python Library based on [4], which built pre-trained ELMo representations for many languages. The embeddings length was 1024 for each token.
Evaluation
The comparison of the NER approaches was done based on two types of evaluation. The first type considered the individual tags for each word from the NEs 8 , in both tagging schemes, IOB and IOBES. Therefore, in the CoNLL-2002 corpus that has 4 NEs and 2 tags under the IOB schema (I and B), 8 (4x2) individual tags were evaluated. In contrast, in the MX-News corpus that has 17 NEs and 4 tags under the IOBES schema (I, B, E, and S), 65 individual tags were evaluated. We could not evaluate the 68 (17x4) tags because the corpus does not contain examples of all of them as shown in Figure 5. It is important to clarify that in both schemes, the tokens labeled with the letter “O” indicate that they are not named entities.
The second evaluation type considered the complete NEs, without taking into account the information from the IOB or IOBES schemes. That is, in this type of evaluation a named entity labeled as PER-B PER-I PER-E will be only taken as PER PER PER. In other words, a named entity will be considered as correctly recognized if all their individual tags correspond to the same class (PER in our example).
For the two types of evaluation we used the same measures, the macro-average of precision, recall, and F1 9 . To be more precise, we first computed each measure for each one of the NEs and then we computed the average over all their types, as suggested in [20]. Furthermore, for the evaluation of individual tags, macro-average F1 is labeled as F1 i , while for the evaluation of complete NEs, the macro-average F1 is labeled as F1 c .
Experimental results
The main purpose of the experiments was to analyze the performance of the CRF, Bi-LSTM (BL), and Bi-LSTM-ELMo (BLE) models in the CoNLL-2002 and MX-News datasets.
General comparison of the models
This first experiment compares the general performance of the probabilistic and deep learning models in the two datasets, using their ensemble partitions which include all available data. Results from the three models are shown in Table 4. It shows the precision, recall and F1 for the two kinds of evaluation types, where index i is used to indicate the individual tags evaluation and index c is employed for the complete-NE evaluation.
MX-News dataset. Sentences * indicate the original number of sentences in each split, maintaining their original length and used by the CRF model. Sentences † indicate the number of sentences used by the RNN model, which correspond to 50-tokens length sentences. Tokens indicate the size of the vocabulary in each slit
MX-News dataset. Sentences * indicate the original number of sentences in each split, maintaining their original length and used by the CRF model. Sentences † indicate the number of sentences used by the RNN model, which correspond to 50-tokens length sentences. Tokens indicate the size of the vocabulary in each slit
Results from the three models in the CoNLL-2002 (Cn) and MX-News (Mx) datasets. Best results are marked in bold
Table 4 shows that in general the complete-NE results outperform those from the individual-tags evaluation, because of the finer granularity of the latter. Furthermore, these results indicate that the Bi-LSTM-ELMo was the best model for the CoNLL-2002 dataset, regardless of the type of evaluation, whereas the CRF model obtained the best results for the MX-News dataset. These results allow us to formulate the following initial conclusions: for scenarios having large amounts of data and considering few NE classes, the Bi-LSTM-ELMo model is a suitable option, but for the opposite case, consisting of small training sets and many classes, the CRF is a better selection.
For this experiment, the MX-News dataset was reduced from seventeen to four classes (refer hereon as Mex-News-4), and its annotation schema was changed from IOBES to IOB. Thus, both datasets had the same number of clasess and used the same tagging schema. The Mex-News-4 dataset preserved the PER, ORG and GPE12 classes, while the rest were integrated to the MISC class. Figure 6 shows the tags distributions in both datasets; for all classes except the MISC class the CoNLL-2002 has more data than the MX-News-4 corpus, however, for all classes both datasets show similar distributions of B and I labels.

Tags distributions in the CoNLL-2002 and MX-News-4 datasets.
Table 5 shows the results from this experiment, which allow to compare the probabilistic and deep learning models in two collections sharing most features except the quantity of available training data. The obtained results confirm our previous observations, that is, the Bi-LSTM-ELMo is more effective than the other approaches when there is "enough" training data (e.g., it obtained the best results in the CoNLL corpus) and, on the other hand, the CRF model is very competitive, especially for scenarios having limited training data (as the case of the MX-News-4 dataset). Once again the Bi-LSTM (BL) model showed the worst performance; we presume this is because this network does not use external word embeddings and, in consequence, it has difficulties to extract general discriminative patterns.
Results from Table 5 also show that in general the complete-NE results outperform those from the individual-tags evaluation. They also show that F1 c scores of the three models are higher for the MX-News-4 corpus than for the CoNLL-2002 dataset, which could be explained by the highest level of narrowness of the MX-News dataset, i.e., both datasets belong to the news domain, but the MX-News datasets exclusively contains politics news.
Results from the three models in the CoNLL-2002 (Cn) and MX-News-4 (Mx) datasets
These experiments have the purpose of analyzing the robustness and scalability of the three models. To do that, we carried out two kinds of experiments; the first one evaluated the performance of the models in the different partitions of the two datasets, the second experiment used several data subsets consisting of different number of classes and distributions. This second experiment was performed only in the MX-News dataset.
Table 6 shows the F1 results corresponding to the first experiment. Horizontal lines divide the results from the three models, namely, CRF, Bi-LSTM, and Bi-LSTM-ELMo. Inside each division, the first rows of results correspond to the complete-NE evaluation (c), whereas the second rows represent the individual tags evaluation (i). The results indicate that the Bi-LSTM-ELMo model is the most robust to the changes in the dataset, since its results show the lowest variations. In addition, all the models tend to be more stable or robust in the complete-NE evaluation than in the individual tags evaluation.
F1 scores in the different partitions of the CoNLL-2002 and MX-News-4 datasets
F1 scores in the different partitions of the CoNLL-2002 and MX-News-4 datasets
For the second experiment, we applied some incremental/detrimental procedures to generate different evaluation data subsets. Figure 7 illustrates these procedures, where C i indicates the class i, and classes are ordered and integrated from sparse to dense and vice versa. In the ascendant procedure classes are aggregated from sparse to dense, once at a time. In contrast, in the descendant procedure classes are included from dense to sparse. The experiments considered the application of three models (CRF, Bi-LSTM, and Bi-LSTM-ELMo) over the ensemble partition of the MX-News dataset.

Generation of data subsets for evaluating the robustness of models in the Spanish NER task.
The plots from Figure 8 show the performance of the three models in the incremental experimental setting, which consists in aggregating one class at a time starting from the sparse to the dense classes (plot on the top) and vice versa (plot on the bottom). These plots include results from both types of evaluations, the individual tags evaluations are represented by continuous lines, while the complete-NE evaluations are indicated by dotted lines. The numbers in the x-axis correspond to the number of classes considered at each experiment. Results from these plots are very interesting, they indicate that the models are more sensitive, or less robust, to the number of examples from each class than to the number of classes. This is evident in the first part of the plots, where the results from the descendant procedure (which first considers dense classes) are much more better than the results from the ascendant procedure (which first considers sparse classes). Moreover, in the descendant procedure, aggregating more classes (moving from 1 to 17 classes) did not cause a significant lost in effectiveness. It is also important to notice that the CRF model obtained the best results in both incremental procedures, which confirms that it is the more robust model for NER in scenarios with many classes and with few examples per class.

F1 i scores of the three models in the different data subsets from the MX-News corpus.
This section focuses on the analysis of the number and kind of errors generated by the three models, with the aim of determining how complementary or redundant they are. Figures 9 and 10 describe the main kind of errors generated by the three models in the CoNLL-2002 and MX-News-4 collections. In the bottom part of these figures it is indicated the correct classes of the entities, and just above them the most frequent erroneous classes predicted by each of the models. The height of the bars were determined from the information of the confusion matrices; the class “O” was included to show errors related to labeling a named entity as non-entity or a common word as a named entity.
Figures 9 and 10 show some interesting patterns. First, the errors of the three models were slightly different in the two datasets, which means that they are mainly explained by the nature of the data, more than supported by models themselves. For example, LOC entities were usually confused with ORG entities in the CoNLL-2002 corpus, but they were confused with PER and MISC entities in the MX-News-4 dataset. Second, the three models show different errors in general, but they share two of them: they tended to classify words not belonging to any named entity (labeled as O) as miscellaneous entities (MISC) and vice versa.

Errors in Models on CoNLL-2002 dataset.

Errors in Models on MX-News-4 dataset.
Finally, Tables 7, 8 and 9 list the most frequent mistakes, generated by any of the models, in the CoNLL-2002, MX-News-4 and MX-News-17 datasets respectively. In the heading of the tables,
Top mistakenly NE scores recognized for CoNLL-2002
Top mistakenly NE scores recognized for MX-News-4
Top mistakenly NE scores recognized for MX-News-17
This work presented an experimental study in the task of Named Entity Recognition. This study aimed at comparing probabilistic and deep learning models in recognizing named entities in a narrow domain scenario in Spanish language. For the experiments three different models were considered, one Conditional Random Field model and two Bidirectional Long-Short Term Memories, and two different collections were used, the CoNLL-2002 corpus with four classes labeled under IOB tagging schema, and the MX-News corpus with seventeen classes labeled under IOBES schema.
The obtained results allowed us to formulating the following conclusions: (i) the Bi-LSTM-ELMo model is a better option than the CRF model for scenarios having small number of classes and a big number of training examples, on the contrary, the CRF model result to be a better option for the opposite cases; (ii) both kind of models, probabilistic and deep learning based, were robust to the variations of the datasets when the evaluation was done at NE level, but were not equally stable when considering the evaluation at word level (i.e., individual tags); (iii) the number of training instances per class influences or affects more the effectiveness of the models than the number of classes (types of NEs to be recognized), being the CRF model the most robust to difficult conditions as well as the most scalable to consider large number of classes; (iv) the number and kind of errors from the three models are considerably different, and therefore they tend to be complementary to each other, especially the CRF and the Bi-LSTM-ELMo models.
As future work, we plan to extend this study by considering other neural networks models like CNNs and hybrid models (CNN + Bi-LSTM + CRF). We also plan to consider fine-grained named entities in other languages and domains, with aim to verify the performance of models in different scenarios.
Footnotes
For example, the name of a person such as “Andrés Manuel López Obrador” has the tags PER-B, PER-I, PER-I, PER-I in the IOB schema.
For the first evaluation type we used the measures from the scikit-learn
0.20.3 Python library metrics, whereas for the second type of evaluation we used the seqeval
0.0.10 Python library.
It was renamed as LOC, to use the same labels as the CoNLL corpus.
