Abstract
Recently, the extraction of clinical events from unstructured medical texts has attracted much attention of the research community. Machine learning approaches are popular for this task, due to their ability to solve the problem of sequence tagging effectively. It has been suggested previously that simple features, such as word unigrams, part-of-speech tags, chunk tags, among others, are sufficient for this task. We show that more careful preprocessing and feature selection can significantly improve the results. We used conditional random field classifier with more linguistically oriented features and outperformed the current state-of-the-art approaches. We also show that the popular and much simpler Viterbi algorithm (hidden Markov model-based classification algorithm) can produce competitive results, when its parameters are tuned using specific optimization techniques. We evaluate these algorithms for the task of extraction of medical events from the corpus developed for SemEval shared Task 12: Clinical TempEval (Temporal Evaluation) 2016, namely, for its two subtasks: (i) event detection and (ii) event classification based on contextual modality.
Keywords
Introduction
In recent years, the analysis of clinical documents for decision making towards the improvement of patients’ treatment has gained attention of the research community. One of the important tasks of such analysis is to detect medical events in the texts. Several techniques such as machine learning, natural language processing, and medical entity recognition have been used in order to extract clinical knowledge and help physicians in decision making [13].
Studies related to clinical domain focus on medical entity recognition (MER), i.e., the process of extraction, detection and delimitation of phrasal information referring to medical entities from textual data, as well as semantic category identification of the detected entities [7, 32].
The motivation of this work is to consider more linguistically motivated features as compared to the state-of-the-art with the CRF classifier, which usually yields the best results. We also try optimization models, which do not require analytic expressions for improving the properties of a variant of hidden Markov models (Viterbi algorithm).
As the case study, we used the annotated corpus THYME [8] (Temporal Histories of Your Medical Events) provided for the SemEval Task 12: Clinical TempEval 2016, the main international evaluation forum in this field. We used supervised machine learning techniques to identify and classify clinical events in this corpus. In particular, our event classification is based on contextual modality, when an event can be any indicator, which helps in treatment and taking care of the patient or in detection of situations that are relevant on the clinical timeline, such as tumors, illnesses, procedures, symptoms, etc. [44].
The contribution of our study is two-fold. First, we show that with careful preprocessing and feature extraction and selection, our conditional random field (CRF) classifier outperforms the current state-of-the-art methods of SemEval 2016 using the Clinical TempEval 2016 corpus. Second, we show that with suitable parameter tuning, the popular and much simpler Viterbi algorithm, a variant of hidden Markov model, gives competitive results. For parameter tuning, we analyzed two different optimization techniques: differential evolution and hill climbing.
The paper is organized as follows. In Section 2, we give literature review on clinical event extraction. In Section 3, we describe our methodology for extracting clinical events using CRFs and the Viterbi algorithm; the latter one, with two optimization techniques: hill climbing and differential evolution. In Section 4, we present the experimental results obtained with these algorithms and compare them for the SemEval 2016 Task 12 using Clinical TempEval 2016 test corpus. Finally, Section 5 concludes the paper and presents directions of future work.
Related work
In this section, we provide a brief overview of the existing results on the SemEval Task 12: Clinical TempEval (Clinical Temporal Evaluation) 2016 corpus and other available corpora.
Sarath et al. [21, 46] have criticized both the dictionary-based and rules-based approaches for the lack of generalized applicability to situations not reflected in the knowledge base, since the available knowledge depends on the corpus used to extract it. Indeed, our analysis shows that the best results have been obtained with machine learning-based approaches.
Recently, machine learning approaches, such as conditional random field [1, 18], support vector machine [24, 47], and neural networks [3, 49] have demonstrated their ability to produce good prediction results and, thus, they have been widely used for different supervised tasks. These approaches require selection of features, such as word unigrams, syntactic and discourse-level features, as well as word representations, in combination with external resources. These features have been successfully used for several medical entity recognition tasks [24, 49].
Clinical TempEval
The Clinical TempEval 2016 shared task [8] is the main international competition on extracting temporal information from clinical notes. Approaches traditionally used for solving this task [46] are (i) dictionary-based approach, which can lead to the use of the context for detecting time expressions and words, using of ontologies such as MedLEE (a system for encoding clinical information in medical reports), MetaMap (a tool for detecting concepts in Unified Medical Language System), and CTake (a system with several dictionaries and machine learning approaches) [32]; (ii) rules-based approach to identify entities using heuristics [3, 16], and (iii) machine learning approaches, which provide powerful techniques for solving classification problems. Some of the results obtained in the Clinical TempEval 2016, as well as comparison with our own results, are given in Tables 5 and 6 below.
At Clinical TempEval 2016, the winning system UTHealth1, by Lee et al. [24] (see Tables 5 and 6), combined several features such as word unigrams (bag-of-words [38, 43]), syntactic and discourse-level features, and words representation in combination with dictionaries. It trained a hidden Markov model with support vector machine for identifying events and their types. They applied text preprocessing using several toolkits: Clamp for tokenization, OpenNLP for part-of-speech tagging and ClearNLP for dependency parsing. They reported an F-measure of 0.903 for event identification and 0.843 for detection of their contextual modality. Another their system, UTHealth2, that used slightly different features, showed the second best result, with F-measure of 0.895. Khalifa et al. [1] processed each section of the text separately. They extracted lexical and syntactic features such as lexical tokens, lemmas, part-of-speech tags, chunk labels, etc., using CTake, and word shape features such as lowercase, numeric, etc., using ClearTK. Their systems, UTAHBMI-CRF, based on CRF algorithm implemented in the CRF Suite toolkit [31], and UTAHBMI-SVM, based on SVM, both obtained an F-measure of 0.892 for detecting events, and 0.841 (UTAHBMI-CRF) and 0.836 (UTAHBMI-SVM) for identifying their types. Hansart et al. [18] in their system Cental-crf integrated statistical and linguistic approaches, combining CRF and linguistic knowledge for extracting temporal information. Specifically, they applied to clinical texts a state-of-the-art technique previously used for processing news in French, which yielded an F-measure of 0.885 for detecting events. They did not attempt to identify the types of the events. A complete description of other systems with worse performance can be found in [8].
Other corpora
López et al. [28] compared several supervised approaches, namely, expert-MLE, Naïve Bayes, Bayesian networks, Bayesian multivariate, logistic regression, neural networks, support vector machine, and random forest. They applied the Topaz system for detecting and extracting clinical concepts from medical records in the domain of influenza. This approach achieved the best performance when the Naïve Bayes algorithm was applied, with an Area Under the Curve (AUC) value of 0.930 for detecting the disease status. Liu et al. [27] trained CRF on token-level (such as bag-of-words, part-of-speech, affixes) and character-level (such as bag-of-characters, part-of-speech, section information) features for the Protected Health Information (PHI) identification. First, they used the system MedEx for tokenization and feature extraction; next, they used the CRF Suite [31] for classification. Their system achieved an F-measure result of 0.9051 using character-level features, followed by a token-level approach with an F-measure of 0.8370.
On the other hand, Li et al. [26] used CRF to identify and classify for biomedical named entity recognition task, based on several features, such as word, part-of-speech, chunk, boundary word, prefix/suffix and word shape. Moreover, they compared two label models (BIOEW and BIO), when BIOEW achieved the best performance with the F-score of 0.743.
Han-Cheol et al. [12] improved CRF by introducing multiple-segment representation for sequence labeling in the biomedical domain. First, preprocessing was used for tokenization. Then, features were extracted as tokens, lemmas, parts-of-speech, chunks and gazetteers. After this, their model was trained and tested. Furthermore, to verify the results of their algorithm, a re-sampling method was introduced.
Zhang et al. [19] provided an improvement to Hidden Markov model for a biomedical named entity recognition based on orthographic, morphological, part-of-speech and semantic trigger features. Their system used GENIA dataset and achieved the best result with the F-measure of 0.618.
In order to improve the performance of predictive models, optimization techniques have been used to facilitate the information extraction. Differential evolution is a stochastic method that allows maximization of the performance of the algorithms and reduces error rate [13]. Hill climbing is based on local search of the optimal solution. A difference between differential evolution and hill climbing is that the last works by making use of a step-size to find the best candidate solution [10]. An overview research on clinical domain using optimization techniques is presented as follows.
For evaluation of the automatic rule extraction from medical datasets, De Falco [13] introduced a novel classification tool, known as DEREx, which is based on differential evolution and rules for extracting items from clinical datasets. He presented a comparison by evaluating several supervised algorithms, where his system showed competitive results.
More recently, Kumar et al. [22] conducted a multi-objective differential evolution method for extraction of biomedical entities. Their system is subdivided into two sections: feature selection and ensemble learning. It was trained with a multi-objective differential evolution. The result was the F-measure of 0.7522, which outperformed other algorithms such as hidden Markov model, support vector machine, and CRF.
Methodology
In this study, two algorithms: CRF and Viterbi algorithm are trained, tested and compared. Moreover, two optimization techniques: differential evolution and hill climbing for tuning the Viterbi parameters are used.
These machine learning-based approaches are used for solving two tasks: (i) event detection and (ii) event classification. It is important to note that each experiment was performed separately. A brief description of the tasks considered in this study is provided below. Moreover, Fig. 1 illustrates the methodology used in this study and Fig. 2 fig:Heuristic shows the main procedure of optimization methods.

Our methodology.

Optimization of parameters for Viterbi algorithm.
Actual. It is used primarily to describe the events that have occurred or have been scheduled for the future. Hedged. The general idea of hedged is to identify all events with any sort of hedging (lexical or phrasal such as ‘seems’, ‘likely’, ‘possible’, ‘I suspect that...’, etc.). It should be noted that the events are strongly implied. However, they are not stated as fact by the physician due to lack of comprehensive evidence, safety, or liability. Generic. These are events that are related to situations, where a physician gives a justification for certain actions and decisions, medication administration, or covering his or her back. Hypothetical. The events in this category are based on assumptions.
The dataset used (previously reported in [8]) was collected under THYME project (Temporal Histories of Your Medical Events). This corpus consists of clinical and pathological notes of patients with colon cancer of Mayo clinic [8], which is a nonprofit organization committed to clinical practice, education and research. For a better interpretation, this annotated clinical corpus has 750 documents with approximately 211,347 tokens, where 18,989 tokens were detected as events. An overview of this corpus is presented in Table 1.
An overview of THYME corpus used in this study
An overview of THYME corpus used in this study

Detailed function description for training a Hidden Markov model.
In order to enhance the clinical corpus for feature extraction and the efficiency of overall rate classification, some preprocessing techniques were applied. A list of them is depicted as follows: Each document (clinical and pathological) was divided by sections (date, diseases, etc.) according to the information in the medical text. Also, the section names were replaced by a number tag with the goal of weighting all section names. Then, a NLTK package [9] (sent_tokenize) was used to divide the documents section by sentences. The words that are not correctly separated by a white space or combinations of words and numbers, etc. were separated by regular expressions with the aim of having correct tokens. We also used regular expressions to replace punctuation marks (.", etc.) and symbols (?,!,#, etc.) by a white space to take into account the position of these characters in the text. Sentences were divided into tokens using another NLTK package Tokenize. This package analyzes each word in a sentence. Typically, this is applied to find further part-of-speech tags. Furthermore, each sentence was separated into bigrams, which are added as additional features. This is useful because the combination of words can be also an event. We use the same sentence twice, first with words, then with bigrams. Note that we keep the same spans for words and bigrams (we take the span and POS tag of the first element).
Feature extraction
Per each token in the text, we used the following features: (i) wordform (the token as is); (ii) wordform in lowercase; (iii) lemma (we used the NLTK toolkit for lemmatization); (iv) part of speech (POS) of each token (we used the NLTK toolkit for tagging); (v) position in the chunk (event): Begin, Inside or Outside of an event, together with the part of speech of the token (such as Begin-Adj, Inside-NNP, etc.); (vi) type: numeric, word, or punctuation, and (vii) section identifier: the header section_id from the current section of the text.
Classifiers
Hidden Markov model: Viterbi algorithm
The Viterbi algorithm is distinguished for being a recursive optimal solution to find the best sequence of hidden states when a series of observations is given [4, 15, 48]. This method uses probabilistic operations of the Hidden Markov model for data training, where: N is the number of states, T is the number of observations, π is the initial probabilities, O is the sequence of observations, S is the sequence of states, A
ij
is the probability of transition from state
i
to state
j
, B
ij
is the probability of emission of an observation o
j
from state
i
.
Figure 3 shows the architecture of the hidden Markov model, and Fig. 4 gives an example of the Viterbi function when N = 2 and T = 3 using transition probability (A ij ) and emission probability (B ij ). This means P (state i |state j ) and P (observation j |state i ), respectively.

Simple example of three observations with two states using the Viterbi algorithm. The black arrows indicate the shortest path from a state i to state j for each observation.
Feature representation used in this work
In order to improve the results of the Viterbi method, differential evolution and hill climbing were applied over the Viterbi parameters for detecting and classifying events from cinical texts.
Differential evolution is a powerful algorithm for optimization, which was proposed by Storn and Price [50]. This method can be used specifically to solve problems that have several local minima or cost functions that are categorized such as non-differentiable, non-continuous, non-linear, or multi-dimensional [5, 40]. Moreover, differential evolution is more robust to detect the global optimization than the traditional gradient-based methods. The process of this algorithm is shown as follows. Initialization: The initial population of parents is formed by converting A
ij
and B
ij
into a vector called AB, and then, multiplying t random number f per AB vector, such that t represents the population size, A
ij
forms a k x k matrix that describes the transition probability between state s
i
and state s
j
, and B
ij
conforms a k x m matrix that indicates the observation probability o
j
of the state s
i
. Fitness function: According to the results of the Viterbi algorithm, the fitness function used in this study is the F-measure (F1) calculated using Equation (1):
Mutation: The purpose of mutation is to create a new parent, which might be better than the actual parents. For this reason, three random shuffle vectors from population were selected to create the mutant parent using Equation (2), multiplying a random number f by the difference of two random vectors: Crossover: This step is introduced to exchange information between the actual parent xi,j and mutant vector vi,j, as in Equation (3), where rand [0, 1]
j
represents a random number, and k describes a randomly chosen index to ensure the exchange of information [50]:
Stop condition: Differential evolution terminates when a stopping criteria such as the number of generations (q) or the error criteria are reached.

Hill climbing example: each element of vector is modified by adding o substracting random value f; it is evaluated by the Viterbi algorithm to find the best solution.
Hill climbing is a heuristic approach that finds an optimal local minimum [51]. Figure 5 shows the architecture of the hill climbing algorithm used in this study. The goal of this local search method is to improve a candidate solution by selecting a random variable iteration [10]. In this case the fitness function is computed by the Viterbi algorithm using F-measure metric. The hill climbing algorithm is described as follows: From the Viterbi parameters, the transition and emission probability matrices (A
ij
and B
ij
) are concatenated into a probability vector AB. The elements of the AB vector are randomly placed into an array called BA, in order to reduce the computation time. The algorithm takes the first element from the BA vector, adds a random value f and forms BA1. Simultaneously, it rests this vector element from f and produces BA2. The proposed methodology evaluates both produced vectors (BA1 and BA2) by using the Viterbi fitness function and chooses the vector that produces the least error. If BA1 or BA2 gives a least error, the algorithm continues adding or resting f random numbers to this element until the error stops decreasing. It continues with the rest of the vector elements. The hill climbing algorithm stops when all values in the vector BA are evaluated for finding the best optimal solution.
CRF is considered as a powerful algorithm for solving Natural Language Processing (NLP) tasks [23]. According to Agarwal [2], this algorithm has presented better results than the hidden Markov model due to independence assumption that can be relaxed.
We therefore trained and evaluated CRF model using CRF Suite toolkit [31], which is an implementation of CRF for labeling sequential data 1 .
Prediction results for the task of event detection
Prediction results for the task of event detection

Graphical representation of average F-measure of CRF, Viterbi algorithm with differential evolution, Viterbi algorithm with hill climbing, and Viterbi algorithm in task of event detection.
In this section, we describe our experimens on the THYME corpus using the CRF and Viterbi algorithm with two optimization techniques for tuning its parameters: differential evolution and hill climbing. We use three evaluation metrics: (i) precision, the percentage of retrieved items that are in fact labeled as “event”, (ii) recall, the percentage of items labeled as “event” that were retrieved, and (iii) F-measure, the harmonic mean of recall and precision. Usually, F-measure is used for evaluation of the classifiers’ performance in the tasks of detection and classification of events.
In this case study, the algorithms are trained using the subcorpora “train” and “dev”, and are evaluated by using the “test” subcorpus. We include as the baseline method the traditional Viterbi algorithm.
Throughout the empirical study, the differential evolution parameters are based on a population size of 150 individuals with 120 generations in both tasks (event detection and event classification).
Experimental results for event detection
Here, we present our results for the task of event detection. In general, the analysis consists of training and testing the algorithms: CRF, differential evolution and hill climbing for tuning the Viterbi parameters. Furthermore, these algorithms are compared with the baseline (the Viterbi algorithm). Table 3 shows the evaluation results of the proposed algorithms.
Evaluation results of the proposed algorithms for the task of event classification
Evaluation results of the proposed algorithms for the task of event classification

A comparison of average F-measure for the task of event classification achieved by CRF, Viterbi algorithm with differential evolution, Viterbi algorithm with hill climbing, and Viterbi algorithm.
As it is previously mentioned, the classifiers were trained using subcorpora “train” and “dev” (corresponding to training and development). Figure 6 shows that CRF achieved the F-measure of 0.999 when the model was trained and tested on the training corpus for verification of the overfitting. The baseline, the Viterbi algorithm, reported the worst result with an average of the F-measure of 0.762. The classification process using differential evolution and hill climbing provided the F-measure of 0.959 and 0.957, respectively.
Comparison of our results with those of SemEval: Task 12 Clinical TempEval 2016, subtask of event detection.
The rank is shown by F-measure. Some systems are omitted; see [8] for a complete list
Comparison of our results with those of SemEval: Task 12 Clinical TempEval 2016, subtask of event classification.
The rank is shown by accuracy. Some systems are omitted; see [8] for a complete list
When the model was evaluated on the “test” subcorpus (the proper evaluation), again, CRF achieved the best result with the average F-measure of 0.981, and outperformed the Viterbi algorithm with differential evolution and hill climbing, which had the F-measure of 0.928 and 0.907, respectively. The baseline, the Viterbi algorithm, had the F-measure of 0.7431.
In analysis of the performance of the proposed algorithms, it is clear that differential evolution and hill climbing obtained better results than the baseline method. Moreover, it is important to mention that all algorithms did not show overfitting, since the evaluation at the training and testing data has almost the same results for the task of event detection.
We tackled the second task (event classification) using CRF, the Viterbi algorithm tuned with differential evolution and hill climbing, and the baseline Viterbi algorithm. Table 4 shows the results in terms of precision, recall, and F-measure when the clinical corpus (training, dev and test) is used for training and testing.
As presented in Fig. 7 the average F-measure of CRF, Viterbi with differential evolution, and Viterbi with hill climbing were 0.985, 0.604, and 0.522, respectively. The baseline results using Viterbi algorithm without optimization technique was 0.482.
In our experiments with the test data, it has been observed that CRF is the best algorithm for classifying events, although the results demonstrated a low effectiveness of the classification with an average F-measure of 0.532. The second best algorithm is Viterbi and differential evolution with an average F-measure of 0.443. The third best algorithm is hill climbing where the average F-measure is 0.405. Finally, the baseline method, Viterbi, is the worst of the proposed algorithms with an average of 0.383 in F-measure.
As noticed, the performance of CRF on both training and test phases were 0.985 and 0.532, respectively. These results demonstrated that CRF algorithm had problems to generalize sufficiently to new test data in the classes: hypothetical, hedged and generic. Furthermore, we observed that although, the overall average of F-measure results are not quite high for the task of event classification, no occurred a significant difference between the results of differential evolution and hill climbing for training and test phases.
A comparison between this study and SemEval: Clinical TempEval 2016
As it was presented above, the best result in the tasks: (i) event detection and (ii) event classification was achieved by CRF, followed by Viterbi algorithm with differential evolution, and Viterbi algorithm with hill climbing. Finally, the worst algorithm was Viterbi when optimization techniques were not applied.
Based on the contributions of this study, Tables 5 and 6 show that CRF achieved a superior result according to the state-of-the-art in SemEval 2016: Task 12 Clinical TempEval 2016. The classification process was evaluated using SemEval software 2 . The results of the task of event detection are better than the task of event classification, since the classification problem when several classes are given is more difficult. Furthermore, it can be noted from the task of event classification that recall was always higher than precision, when the Viterbi algorithm was applied.
The results provided by the CRF are presented, as follows. CRF achieved the average F-measure of 0.965 for event detection, and the F-measure 0.885 for event classification. As we can see in both task, our proposal overcame the results reported for the winning approach [24], where their proposal presented the F-measure of 0.903 for the first task, and the F-measure of 0.855 for the second task.
Even though the best results were achieved by CRF, we see that Viterbi algorithms with differential evolution and with hill climbing demonstrated similar performance to the most common classifiers used in the state-of-the-art (SemEval Task 12: Clinical TempEval 2016) for extraction of clinical events. However, if the Viterbi algorithm is used without optimization techniques, the obtained results are very low as compared to the state-of-the-art. As we can see in Table 6, CRF achieved the F-measure of 0.885, whereas Viterbi algorithm with differential evolution and hill climbing got the F-measure of 0.598 and 0.478, respectively. These results are lower because the model did not include the robustness, weight assignment and data representation as it is used in conditional random fields, which optimizes the conditional probability of labels for an observation sequence [2].
Conclusions and future work
We have addressed two tasks of the SemEval shared Task 12: Clinical TempEval (Temporal Evaluation) 2016: (i) event detection and (ii) event classification. For evaluation, we used the corpora and the evaluation system officially provided by this competition.
We presented a method based on the CRF classifier applying the features described above (words, bigrams, POS tags, etc.). Our method outperformed all systems that participated in this competition on both tasks, thus showing the best currently known result.
For feature extraction, we used preprocessing techniques from natural language processing field. We extracted simple features such as word unigrams and part-of-speech tags, still outperforming more sophisticated systems that used the information extracted with the Unified Medical Language System (UMLS), Brown clustering, or CLAMP.
As a separate contribution, we studied the performance of another popular classification algorithm: the Viterbi algorithm, a variant of hidden Markov model. This algorithm is much simpler in implementation than CRF; however, it is known to produce results inferior to those of other classifiers such as CRF or SVM. However, we have shown that if its parameters are properly tuned, it produces competitive results. For parameter tuning, we tried two optimization techniques: differential evolution and hill climbing. While hill climbing is simpler, differential evolution gave slightly better results. These optimization techniques, however, are expensive in terms of processing time and, in case of differential evolution, memory.
As a future work, we expect to further refine our preprocessing technique and explore other types of features such as trigrams or syntactic n-grams [42]. As to the tuning of the Viterbi algorithm, we plan to explore different parameters of the optimization, such as pool size or crossover rate, as well as apply our tuning techniques to algorithms other than Viterbi. We also plan to try unsupervised pre-processing [36] and semi-supervised methods, such as fuzzy-based SVM [34], and deep-learning approaches with linguistic hierarchy and context modeling [29, 37].
Footnotes
Acknowledgments
This work was partially supported by the Mexican Government via the CONACYT project 240844, SNI, and Instituto Politécnico Nacional grants COFAA, SIP 20171813, SIP 20171344, and SIP 20172008). It was also partially funded by CONACYT under the Thematic Networks program (Language Technologies Thematic Network project 281795).
