Medical events extraction to analyze clinical records with conditional random fields

Abstract

The rapid growth in the extraction of clinical events from unstructured clinical records has raised considerable challenges. In this paper, we propose the use of different features with a statical modeling method called conditional random fields, which is consider an algorithm for effectively solving problems of sequence tagging. Our goal is to determine which feature selection can affect the performance of four subtasks presented in SemEval Task-12: Clinical TempEval 2016. We applied a careful preprocessing, where the proposed method was tested on real clinical records from Task-12: Clinical TempEval 2016. The comparative analyses obtained indicate that our proposal achieves good results compared to the work presented in Task-12: Clinical TempEval 2016 challenges.

Keywords

Clinical reports medical information extraction natural language processing machine learning feature selection conditional random fields

1 Introduction

An important study in the clinical domain is event detection for decision making towards the improvement of the quality of life in patients. SemEval (Semantic Evaluation): Task-12: Clinical TempEval (Temporal Evaluation) challenge presents nine subtasks based on time expression identification, event expression identification, and temporal relation identification. Our study focused on event expression identification, detecting and classifying an event based on its contextual modality, polarity, and type. We recommend reading [6] for a complete description of the tasks presented in Clinical TempEval. The two contributions of this paper are: (1) a presentation of the state-of-the art of Clinical TempEval, and (2) a demostration of how our system outperforms the winning approach in the Clinical TempEval challenge. The paper is organized as follows. Section 2 introduces the state-of-the-art on event expression identification. Section 3 presents the process of conducting the analysis using CRFs with several features. Section 4 provides the experimental results to assess the effectiveness of our study. Furthermore, a comparison between our results and Clinical TempEval 2016 is included in this section. Lastly, Section 5 includes our conclusions and the proposed future work, respectively.

2 Related work

The coverage of related work is divided into three parts. The first part is based on literature related to the Workshop on Clinical Temporal Evaluation (SemEval) 2016 [6], where participants used a corpus with information from patients with colon cancer. The second part is based on Semeval: Clinical TempEval 2017 [7] where the corpus has additional medical reports that include brain cancer, and the last part is devoted to related work on event detection using other available corpora.

2.1 SemEval 2016: Task 12-Clinical TempEval

SemEval (Semantic Evaluation) is a workshop focusing on extracting temporal information from the medical records of real patients. Lee et al. [17] presented an approach based on supervised learning for the SemEval tasks. Their approach used several variables such as lexical, syntactic, discourse level, word representation, and external resources. Moreover, they used three toolkits for text preprocessing: Clamp (tokenization), OpenNLP (part-of-speech tagging), and ClearNLP (dependency parsing). The system was trained using hidden Markov model with support vector machine and ranked first with an F-measure of 0.903 for event identification, 0.855 for contextual modality, 0.887 for polarity, and 0.882 for type. Another approach, reported by the same author provided the second best performance with an F-measure of 0.895, 0.847, 0.880, and 0.871 for event detection, contextual modality, polarity and type, respectively. Khalifa et al. [3] introduced two systems using CRF based on CRFSuite toolkit [22], and support vector machines. In their system, they used lexical features (tokens, lemmas, part-of-speech, tags, chunk labels, etc.) extracted from CTake toolkit, along with word shape features (lowercase, numeric, etc.) using ClearTk. Their results achieved the third and fourth place, UTAHBMI-CRF provided an F-measure of 0.892 on event detection, 0.841 on contextual modality, 0.876 on polarity, and 0.866 on type. Finally, the second approach UTAHBMI-SVM achieved 0.892, 0.836, 0.874, 0.849, for event detection, contextual modality, polarity and type, respectively. Hansart et al. [14] participated only in the subtask event detection by integrating techniques previously used for preprocessing news in French, along with CRF, statical and linguistic approaches. The system achieved good accuracy based on the state-of-the-art with an F-measure of 0.885. A complete list of works presented in SemEval 2016: Task 12-Clinical TempEval can be found in [6].

2.2 SemEval 2017: Task 12-Clinical TempEval

In addition to measuring the validity of related work presented on Task 12-Clinical TempEval, we provide a brief overview of the 2017 competition, where the committee provided more clinical records unlike the competition that took place in 2016. Tourille et al. [25] provided two approaches based on supervised and unsupervised domain adaptation. The first approach is based on recurrent neural networks and features such as character and word embeddings. This approach achieved the best performance with an F-measure of 0.72 for event detection, 0.64 for contextual modality, 0.69 for polarity, and 0.70 for type. The second approach used support vector machines along with features such as words and part-of-speech tags. It achieved an F-measure of 0.76 for event detection, 0.69 for contextual modality, 0.75 for polarity and 0.75 for type. MacAvaney et al. [21] trained conditional random fields, decision tree ensembles and rules along with features such as n-grams, words, word shapes, word clusters, word embeddings, part-of-speech tags, syntactic, dependency tree, semantic roles, and UMLS concept types. They also evaluated their results into two varities of domain adaptation: supervised and unsupervised. The first domain achieved an F-measure of 0.71, 0.56, 0.65 and 0.68 for event detection, contextual modality, polarity, and type, respectively. The second domain achieved good performance with an F-measure of 0.74 for event detection, 0.66 for contextual modality, 0.58 for polarity, and 0.72 for type.

Leeuwenberg and Moens [18] presented support vector machines on features extracted from linguistic and word forms. The system achieved a good performance with an F-measure of 0.68, 0.62, 0.67, 0.66 in the tasks of event, contextual modality, polarity and type, respectively.

2.3 Other corpora

This section describes related work using different corpora to solve tasks that involve clinical events. Huesch et al. [15] evaluated chest imaging using free text reports and natural language processing for pulmonary embolism. They applied commercial text mining and predictive analytics software to preprocess the clinical text. The approach achieved good accuracy based on related work. Ben et al. [2] proposed a system using conditional random fields, and linguistic features to identify drug name recognition and classification. Tokenization, part-of-speech and parsing of the sentences are used as a preprocessing step through the Stanford parser. The proposal outperformed all teams, which participated in the DDI Detection and Classification task at SemEval 2013 with an F-measure of 0.80, and 0.658 for detection and classification, respectively. Lauren et al. [16] trained a kernel-based extreme learning machine (ELM) with multiple discriminant analysis on skip-gram and paragraph vector-distributed bag of words. The experiments showed that ELM provided a good performance when is compared with support vector machines (SVM) and multilayer perceptron (MLP). Forsyth et al. [13] described conditional random fields methods for the extraction of information from eletronic health records in regards to breast cancer symptoms. The system achieved an F-measure of 0.81. Kuteesa et al. [9] presented a survey of machine learning using HIV clinical records.

3 Methodology

3.1 Problems definition

In order to tackle this research, four tasks based on SemEval [6] were evaluated. We used the annotated corpus, where the events and their classification can be found, and each task was analyzed separately. A description of the classification is briefly outlined below.

Event detection. That is, words that have been detected as events or not. For instance, tumors, illnesses, procedures, symptoms, etc.

Contextual modality. Given a word, the goal of this task is to identify whether is an event (actual, hedged, generic or hypothetical), or is no an event based on their contextual modality.

“actual” indicates the event which can be scheduled for the future or which has already happened.

“hedged” represents events, which are strongly implied but are not a fact due to the lack of comprehensive evidence, safety, or liability.

“generic” consists of events that are related to situations where a physician gives a justification for certain actions and decisions.

“hypothetical” represents events which are based on assumptions.

Polarity. Given a word, this task is concerned with measuring the polarity of an event. It represents both positive or negative event.

Type. Given a word, the goal is to determine the type of aspectual information. The type fits into the following categories: no event, n/a, aspectual and evidential.

“n/a” represents the default value of an event.

“aspectual” emphasizes the possibility of an event continuing or reappearing in the future.

“evidential” provides information based on demonstration, evidence, confirmation or relevation which is related to tests, images, and human observation.

3.2 Data

The task 12 (Clinical TempEval) challenge for extracting temporal information from Mayo clinic medical reports provided a corpus of 750 full-text documents divided by clinical and pathological notes of patients with colon cancer [6]. Moreover, a XML format of all documents was provided from where we can extract span, contextual modality, polarity and type.

3.3 Data cleaning and preprocessing

Before applying conditional random fields, we applied some preprocessing and cleaning steps for a better understanding of the medical texts. Firstly, the texts were imported into XML format, where each individual patient record was divided into sections. The name sections were replaced by a number tag to avoid information loss and to weigh all section names. Furthermore, the exact positioning of the section within the clinical text was included. Figure 1 shows an XML format example used to parse each document. (This example was not taken from the original clinical records).

Fig.1

Clinical record into XML format.

Secondly, we parsed the XML format from each clinical document for the preprocessing step. The sections are divided into sentences using sent_tokenize from NLTK (Natural Language Toolkit) [8]. After that, each sentence was preprocessed to identify the words using tokenize from NLTK (Natural Language Toolkit) [8], and regular expressions when words are not separated correctly due to lack of white spaces, or to replace punctuaction marks and symbols by means of white space. The aim being to respect the position of the characters in the text for future analysis. Finally, following the domain expert, an event can be also a word combination, thus, the data preparation process includes bigrams. Notice that we use the sentence twice, first, by extracting tokens, then, by extracting bigrams. (The previously study [12] used the first part-of-speech of the first word; in this study, we used both part-of-speech of each word). A brief overview of this corpus after applying data cleaning and preprocessing is presented in Table 1.

Table 1

An overview of THYME corpus used in this study

Dataset	Record	Token	Events
Training + Dev	597	202,803	17,741
Testing	153	221,547	18,946

3.4 Features

Several kinds of features were used in this study: linguistic (lemma, and part-of-speech), word-forms (word, and lowercase), and discourse level (word size, type and section information), lexical (prefix, and sufix), external resources (UMLS-Unified Medical Language System [1]), and word representation (word embeddings). Table 2 presents the feature representation used in this study. We applied two model representation: bag-of-words and bigrams. (This example is used for illustrative purposes only, and it was not taken from the clinical corpus).

Table 2
Feature representation used in this work

Token position Word_form Lowercase Lemma POS word_size Type Section_id prefix sufix UMLS

1 She she She PRP 3 WordToken 100 N/A N/A N/A

2 has has ha VBZ 3 WordToken 100 N/A N/A N/A

3 a a a DT 1 WordToken 100 N/A N/A N/A

4 tumor tumor tumor NN 5 WordToken 20107 N/A N/A e0321587

1(2nd pass) She_has she_has She_has PRP_VBZ 7 WordToken 100 N/A N/A N/A

... ... ... ... ... ... ... ... ... ... ... ...

Token position	Word_form	Lowercase	Lemma	POS	word_size	Type	Section_id	prefix	sufix	UMLS
1	She	she	She	PRP	3	WordToken	100	N/A	N/A	N/A
2	has	has	ha	VBZ	3	WordToken	100	N/A	N/A	N/A
3	a	a	a	DT	1	WordToken	100	N/A	N/A	N/A
4	tumor	tumor	tumor	NN	5	WordToken	20107	N/A	N/A	e0321587
1(2nd pass)	She_has	she_has	She_has	PRP_VBZ	7	WordToken	100	N/A	N/A	N/A
...	...	...	...	...	...	...	...	...	...	...	...

3.4.1 Combination of features

In the experimental validation results, we trained and tested conditional random fields using different sets of features in order to validate and compare the influence of features in clinical records.

LWD: Linguistic (lemma, and part-of-speech), word-forms (word and lowercase), and discourse level (word size, type and section information).

LWDLE: Linguistic (lemma, and part-of-speech), word-forms (word and lowercase), discourse level (word size, type and section information), lexical (prefix and suffix), and external resources (UMLS-Unified Medical Language System [1]).

WR: Word representation (word embeddings) with a dimension of 300.

3.4.2 Word embedding

Word embeddings is distinguished for being a popular approach for capturing meaningful syntatic and semantic of words using real-valued vectors of configurable dimension. Word2vec is the most popular tool fot word embedding, which was introduced by Tomas Mikolov in order to provide semantic word embeddings and is based on unsupervised learning from selected training corpus [5, 20 24]. In this experimental layout, we trained word2vec on clinical records from SEMEVAL 2016 [6].

3.5 CRFs classifier

CRFs is an efficient supervised classifier, which is widely used for solving Natural Language Processing (NLP) tasks [12], mainly in named entity recognition. The basis of CRFs is based on the maximization of the conditional probability P (y|x) for a sequence of labels, y = y₁, y₂, . . . , y_n, when an observed sequence of tokens, x = x₁, x₂, . . . , x_n, is given. The time complexity of the training process [26] is shown in Equation (1): $O (mNT Q, 2 nS),$ (1) where, m, is the number of training iterations, N, is the number of training data sequences, T, is the average length of training sequences, Q, is the number of class labels, and n, is the number of features.

We experimented with conditional random fields using CRFSuite 1 [22] for event detection, and Scikit-learn 2 [23] for contextual modality, polarity and type. Table 3 shows a simple example of the data format used by CRFSuite for training and tagging.

Table 3

A simple example format for CRFsuite

NO	w [0] = She	w [1] = has	pos [0] = PRP	pos [1] = VBZ	BOS
NO	w [-1] = She	w [0] = has	w [1] = a	pos [-1] = PRP	pos [0] = VBZ	pos [1] = DT
NO	w [-1] = has	w [0] = a	w [1] = tumor	pos [-1] = VBZ	pos [0] = DT	pos [1] = NN
EVENT	w [-1] = a	w [0] = tumor	pos [-1] = DT	pos [0] = NN	EOS

4 Experimental results

We present experimental validation in this section and discuss conditional random fields, which was trained using the subcorpora “train” and “dev”, and was evaluated by using the “test” subcorpus. Experiments were perfomed using several types of features (LWD, LWDLE, and WR). Moreover, the best results were compared with SEMEVAL challenge [6] participating systems.

4.1 Subtask 1: Event detection

Figure 2 presents that the features LWDLE achieved a superior performance with an average F-measure of 0.926, followed by LWD with an average F-measure of 0.925. Whereas the results of WR provided an average F-measure of 0.915, when the model was tested on the test corpus. Finally, we also trained and tested on the training corpus to verify overfitting problem, where the combination LWD provided an average F-measure of 0.998 followed by WR with an average F-measure of 0.994, and LWDLE reported an average F-measure of 0.993.

Fig.2

Average F-measure achieved by CRFs for the subtask: event detection.

According to the results in Table 4, the combination of features (LWDLE) gives us the best performance by reaching an average F-measure of 0.926. However, LWDLE did not get a significant improvemente when is compared with LWD or WR with 300 dimensions.

Table 4

Prediction results for the task of event detection

Features	Class	Train + Dev			Test
		Precision	Recall	F-measure	Precision	Recall	F-measure
LWD	event	0.985	0.981	0.998	0.875	0.851	0.863
	no event	0.998	0.998	0.998	0.985	0.987	0.986
	av	0.991	0.990	0.998	0.930	0.919	0.925
LWDLE	event	0.985	0.982	0.983	0.881	0.851	0.866
	no event	0.998	0.998	0.998	0.985	0.988	0.986
	av	0.991	0.990	0.991	0.933	0.920	0.926
WR-300	event	0.955	0.995	0.994	0.980	0.990	0.985
	no event	0.994	0.995	0.994	0.889	0.805	0.845
	av	0.974	0.995	0.994	0.934	0.897	0.915

4.2 Subtask 2: Contextual modality

As presented in Table 5, the best results by contextual modality were provided using LWDLE features with an average of F-measure of 0.508, followed by WR-300 and LWD with an average F-measure of 0.501 and 0.500, respectively. Regarding recall, we found that CRFs had significantly better results on predicting “actual” class than “hypothetical”, “hedged”, and “generic”.

Table 5
Prediction results for the task of contextual modality

Features Class Train + Dev Test

Precision Recall F-measure Precision Recall F-measure

LWD no event 0.998 0.998 0.998 0.984 0.988 0.986

actual 0.981 0.980 0.980 0.828 0.842 0.835

hypothetical 0.980 0.952 0.966 0.679 0.224 0.337

hedged 0.976 0.948 0.961 0.442 0.253 0.094

generic 0.973 0.980 0.977 0.512 0.164 0.249

av 0.982 0.972 0.977 0.690 0.454 0.500

LWDLE no event 0.998 0.998 0.998 0.984 0.989 0.987

actual 0.981 0.980 0.980 0.835 0.841 0.838

hypothetical 0.980 0.952 0.966 0.666 0.229 0.340

hedged 0.973 0.953 0.963 0.465 0.062 0.110

generic 0.973 0.980 0.950 0.517 0.180 0.268

av 0.981 0.973 0.971 0.545 0.460 0.508

WR-300 no event 0.994 0.995 0.995 0.980 0.990 0.985

actual 0.936 0.935 0.935 0.845 0.791 0.816

hypothetical 0.896 0.730 0.804 0.534 0.236 0.328

hedged 0.865 0.614 0.718 0.238 0.092 0.133

generic 0.927 0.811 0.865 0.365 0.180 0.241

av 0.924 0.817 0.863 0.592 0.458 0.501

Features	Class	Train + Dev	Test
LWD	no event	0.998	0.998	0.998	0.984	0.988	0.986
	actual	0.981	0.980	0.980	0.828	0.842	0.835
	hypothetical	0.980	0.952	0.966	0.679	0.224	0.337
	hedged	0.976	0.948	0.961	0.442	0.253	0.094
	generic	0.973	0.980	0.977	0.512	0.164	0.249
	av	0.982	0.972	0.977	0.690	0.454	0.500
LWDLE	no event	0.998	0.998	0.998	0.984	0.989	0.987
	actual	0.981	0.980	0.980	0.835	0.841	0.838
	hypothetical	0.980	0.952	0.966	0.666	0.229	0.340
	hedged	0.973	0.953	0.963	0.465	0.062	0.110
	generic	0.973	0.980	0.950	0.517	0.180	0.268
	av	0.981	0.973	0.971	0.545	0.460	0.508
WR-300	no event	0.994	0.995	0.995	0.980	0.990	0.985
	actual	0.936	0.935	0.935	0.845	0.791	0.816
	hypothetical	0.896	0.730	0.804	0.534	0.236	0.328
	hedged	0.865	0.614	0.718	0.238	0.092	0.133
	generic	0.927	0.811	0.865	0.365	0.180	0.241
	av	0.924	0.817	0.863	0.592	0.458	0.501

As can be seen in Figure 3, CRFs produced more robust prediction over traning set with an average F-measure of 0.977, 0.971 and 0.863 using LWD, LWDLE and WR, respectively. However, it did significantly worst on predicting the test set. Such a difference in precision and recall can be due to the skewness of the dataset. This is due to the classes are unbalanced. Moreover, it can be appreciated that CRFs have an overfitting problem due to the algorithm not providing a correct generalization in testing set.

Fig.3

Average F-measure achieved by CRFs for the subtask: modality contextual.

4.3 Subtask 3: Polarity

For polarity classification, Figure 4 shows the best results on the test were achieved using LWD and LWDLE, instead of WR. LWD and LWDLE features achieved an improvement of 3.6% in detecting the event polarity over WR-300. Table 6, shows the recall, precision, and F-measure of polarity classification.

Fig.4

Average F-measure achieved by CRFs for subtask: polarity.

Table 6

Prediction results for the task of polarity

Features	Class	Train + Dev			Test
		Precision	Recall	F-measure	Precision	Recall	F-measure
LWD	no event	0.998	0.998	0.998	0.984	0.988	0.986
	pos	0.982	0.981	0.981	0.858	0.834	0.846
	neg	0.990	0.977	0.984	0.862	0.693	0.768
	av	0.990	0.984	0.988	0.900	0.840	0.867
LWDLE	no event	0.998	0.998	0.998	0.984	0.988	0.986
	pos	0.982	0.980	0.981	0.854	0.833	0.843
	neg	0.991	0.975	0.983	0.862	0.699	0.772
	av	0.990	0.985	0.988	0.901	0.839	0.867
WR-300	no event	0.994	0.996	0.995	0.980	0.990	0.985
	pos	0.943	0.931	0.937	0.863	0.780	0.820
	neg	0.932	0.876	0.903	0.772	0.621	0.688
	av	0.956	0.935	0.945	0.871	0.797	0.831

4.4 Subtask 4: Type

A careful analysis of these results indicates that WR features provided better results than simples features: LWD and LWDLE. In fact, this is the unique task where word representation produced a best rate classification, considering that the average F-measure was 0.778.

Analyzing the results from Table 7 we see that, the difference in performance between the test and train set was larger due to overfitting problem. This is because CRFs fit very well in the training set but fail to generalize in the testing set (see Fig. 5).

Fig.5

Average F-measure achieved by CRFs for subtask: type.

Table 7

Prediction results for the task of type

Features	Class	Train + Dev			Test
		Precision	Recall	F-measure	Precision	Recall	F-measure
LWD	no event	0.998	0.998	0.998	0.984	0.988	0.986
	n/a	0.982	0.977	0.980	0.858	0.832	0.845
	aspectual	0.991	0.974	0.982	0.690	0.222	0.337
	evidential	0.956	0.979	0.967	0.750	0.690	0.702
	av	0.982	0.982	0.982	0.820	0.683	0.718
LWDLE	no event	0.998	0.998	0.998	0.984	0.988	0.986
	n/a	0.982	0.977	0.980	0.864	0.831	0.847
	aspectual	0.991	0.978	0.984	0.661	0.204	0.312
	evidential	0.956	0.980	0.968	0.751	0.661	0.704
	av	0.982	0.983	0.983	0.815	0.671	0.712
WR-300	no event	0.994	0.996	0.995	0.980	0.990	0.985
	n/a	0.950	0.929	0.940	0.878	0.783	0.828
	aspectual	0.950	0.905	0.927	0.733	0.436	0.547
	evidential	0.898	0.927	0.912	0.738	0.767	0.752
	av	0.948	0.939	0.944	0.832	0.744	0.778

4.5 Comparison between our results and SemEval 2016: Task 12-Clinical TempEval

In order to directly evaluate and compare the performance of our proposals with the oficial results from SemEval 2016: Task 12 (Clinical TempEval), each clinical records was parsed into XML format based on CRFs results. We used the system provided by the SemEval challenge 3 to compute in terms of precision, recall and F-measure based on the correct number of events detected. In the academic SemEval 2016 Task 12: Clinical TempEval, it can be seen that the top system was achieved by UTHealth [17], which combined different features. For instance (a) Lexical features (b) Syntantic features (c) Discourse-level, (e) Word representation and syntactic, and (f) Features from external resources. These features were an adaptation from the state-of-the-art techniques for entity recognition, and they were used with hidden markov model and support vector machines. From Table 8, it can be seen that the top system UTHealth [17] achieved an F-measure of 0.903 for event detection, whereas, the analysis results of the proposed approach in terms of F-measure were LWDLE with 0.865, LWD with 0.862, and WR-300 with 0.844. As can be seen, our prediction model did not provide an improvement for solving event detection task. However, it provided a reasonable result due to an increase in F-measure of 11% above the worst team [4]. We believe that conditional random fields had not improved the results for this task. This can be explained by the role the skewness, and the feature extraction used in this task.

Table 8
Comparison of our results with those of SemEval: Task 12 Clinical TempEval 2016, subtask of event detection. The rank is shown by F-measure. Some systems are omitted; see [6] for a complete list

TempEval 2016 rank System Precision Recall F-measure

1 UTHealth1 [17] 0.915 0.891 0.903

2 UTHealth2 [17] 0.903 0.886 0.895

3 UTAHBMI-SVM [3] 0.897 0.886 0.892

... ... ... ... ...

9 UTA-5 [19] 0.900 0.850 0.874

– CRFs: LWDLE 0.881 0.849 0.865

– CRFs: LWD 0.875 0.849 0.862

10 VUACLTL-1 [10] 0.868 0.828 0.847

– CRFs: WR-300 0.889 0.803 0.844

... ... ... ... ...

– CDE-IIITH-crf 0.835 0.797 0.815

... ... ... ... ...

16 Brundlefly [4] 0.883 0.660 0.755

TempEval 2016 rank	System	Precision	Recall	F-measure
1	UTHealth1 [17]	0.915	0.891	0.903
2	UTHealth2 [17]	0.903	0.886	0.895
3	UTAHBMI-SVM [3]	0.897	0.886	0.892
...	...	...	...	...
9	UTA-5 [19]	0.900	0.850	0.874
–	CRFs: LWDLE	0.881	0.849	0.865
–	CRFs: LWD	0.875	0.849	0.862
10	VUACLTL-1 [10]	0.868	0.828	0.847
–	CRFs: WR-300	0.889	0.803	0.844
...	...	...	...	...
–	CDE-IIITH-crf	0.835	0.797	0.815
...	...	...	...	...
16	Brundlefly [4]	0.883	0.660	0.755

In Table 9, we can see what the UT-health team [17] continues to hold down first place in the competion for classification event based on its contextual modality. However, the results presented by the winning approach showed a slight improvement when is compared with our best results using LWD features (F-measure = 0.854), which are not significantly different from the first-ranked system (F-measure = 0.855).

Table 9

Comparison of our results with those of SemEval: Task 12 Clinical TempEval 2016, subtask of contextual modality. The rank is shown by F-measure. Some systems are omitted; see [6] for a complete list

TempEval 2016 rank	System	Precision	Recall	F-measure
1	UTHealth1 [17]	0.866	0.843	0.855
–	Our: CRFs: LWD	0.936	0.784	0.854
–	Our: CRFs: LWDLE	0.936	0.784	0.853
2	UTHealth2 [17]	0.855	0.839	0.847
3	UTAHBMI-CRF [3]	0.850	0.832	0.841
...	...	...	...	...
7	GUIR-1 [11]	0.830	0.817	0.824
–	Our: CRFs: WR -300	0.927	0.739	0.822
3	UTA-4 [19]	0.842	0.780	0.810
...	...	...	...	...
16	brundlefly [4]	0.819	0.612	0.701

As presented in Table 10 the best result by polarity classification was obtained using LWDLE features. These feature combinations achieved an F-measure of 0.889, whereas the winning approach [17] achieved a result of 0.887 in terms of F-measure.

Finally, Table 11 shows the experimental results using our proposals, as compared with the official results from SemEval challenge [6] for detecting type event. LWDLE and LWD features achieved an F-measure of 0.885 and 0.884, respectively. These results outperformed the UTHealth team [17], which obtained an F-measure of 0.882.

Table 10

Comparison of our results with those of SemEval: Task 12 Clinical TempEval 2016, subtask of polarity. The rank is shown by F-measure. Some systems are omitted; see [6] for a complete list

TempEval 2016 rank	System	Precision	Recall	F-measure
–	Our: CRFs: LWDLE	0.973	0.819	0.889
–	Our: CRFs: LWD	0.972	0.819	0.889
1	UTHealth1 [17]	0.900	0.875	0.887
2	UTHealth2 [17]	0.888	0.872	0.880
3	UTAHBMI-SVM [3]	0.879	0.869	0.874
4	UTAHBMI-CRF [3]	0.885	0.867	0.876
...	...	...	...	...
7	GUIR-1 [11]	0.869	0.855	0.862
–	Our: CRFs: WR - 100	0.958	0.762	0.849
8	UTA-4 [11]	0.876	0.812	0.842
...	...	...	...	...
16	Brundlefly [3]	0.856	0.640	0.733

Table 11

Our results in comparison to those of SemEval: Task 12 Clinical TempEval 2016, subtask of type. The rank is shown by F-measure. Some systems are omitted; see [6] for a complete list

TempEval 2016 rank	System	Precision	Recall	F-measure
–	Our: CRFs: LWDLE	0.969	0.814	0.885
–	Our: CRFs: LWD	0.969	0.813	0.884
1	UTHealth1 [17]	0.894	0.870	0.882
2	UTHealth2 [17]	0.880	0.863	0.871
3	UTAHBMI-CRF [3]	0.875	0.857	0.866
–	Our: CRFs: WR - 100	0.973	0.774	0.862
4	UTAHBMI-SVM [3]	0.854	0.843	0.849
...	...	...	...	...
16	Brundlefly [3]	0.829	0.620	0.709

The results obtained in this section showed that our approach of CRFs provided better F-measure for subtasks: classification based on contextual modality, polarity and type than the top system [17].

5 Conclusions and future work

As presented earlier, the best improvements in subtask 1: event detection, 2: contextual modality, and 3: polarity were gained by LWDLE (linguistic, word-forms, discourse level, lexical and external resources) variables. Thus, we found out that simple features can be more effective in such subtasks, while word embeddings are not as helpful. However, word representation achieved the best results in Subtask 4: type, but when its results were compared with the works in SemEval 2016, these not provide competitive results. This can be explained by the fact that it was not possible to predict events correctly. As this study shows, a substantial contribution to current research is that CRFs and feature extraction can predict significantly better than the works presented in SemEval 2016 Task-12: Clinical TempEval for predicting two subtasks, 3: polarity and 4: type, where LWDLE variables play a key role in improving the performance of the before mentioned tasks. However, we could not improve he results of winnning aproach for subtask 1: event detection and subtask 2: modality contextual using the feature extraction proposal. For this reason, future work will involve several direction such as (a) usage of more clinical data, (b) deep semantic analysis, (c) evaluation of CRF against several machine learning algorithms, (d) explorarion of other types of features.

Footnotes

Acknowledgements

This work was partially supported by the Mexican Government via the CONACYT project 240844.

References

Unified medical language system (umls). https://www.nlm.nih.gov/research/umls/aboutumls.html.

Abacha

A.B.

, Chowdhury

M.F.M.

, Karanasiou

, Mrabet

, Lavelliand

and Zweigenbaum

, Text mining for pharmacovigilance: Using machine learning for drugname recognition and drug–drug interaction extraction and classification, Journal of Biomedical Informatics58 (2015), 122–132.

Abdulrahman

, Velupillai

and Meystre

, UtahBMI at SemEval-Task 12: Extracting temporalinformation from clinical text, In Proc. of the 10th International Workshop SemanticEvaluation (2016), 1256–1262.

Alan Fries

, Brundlefly at SemEval-Task 12: Recurrent neural networksvs. joint inference for clinical temporal information extraction, In Proc. of the 10th International Workshop SemanticEvaluation (2016), 1274–1279.

Banerjee

, Chen

M.C.

, Lungren

M.P.

and Rubin

D.L.

, Radiology report annotation using intelligent word embeddings:Applied to multi-institutional chest ct cohort, Journal of Biomedical Informatics77 (2018), 11–20.

Bethard

, Guergana

, Che

W.-T.

, Derczynski

, Pustejovsky

and Verhagen

, SemEval-Task 12: Clinical TempEval, In Proceedings of NAACL-HLT 2016 (2016), 820–830.

Bethard

, Guergana

, Palmer

and Pustejovsky

, SemEval-Task 12: Clinical TempEval, In Proceedings 11th International Workshop on SemanticEvaluations (SemEval-2017) (2017), 565–572.

Bird

, Klein

, Loper

, (2009). Natural Language Processing with Python. O’Reilly Inc. Media, 1st edition.

Bisaso

K.R.

, Anguzu

G.T.

, Karungi

S.A.

, Kiragga

and Castelnuovo

, A survey of machine learning applications in hiv clinical researchand care, Computers in Biology and Medicine91 (2017), 366–371.

10.

Caselli

and Morante

, VUACLTL at SemEval Task 12: A CRF pipeline to, In Proc. of the 10th International Workshop SemanticEvaluation (2016), 1241–1247.

11.

Cohan

, Meurer

and Goharian

, GUIR at SemEval-Task 12: Temporal information processingfor clinical narratives, In Proc. of the 10th International Workshop SemanticEvaluation (2016), 1248–1255.

12.

Focil-Arias

, Sidorov

, Gelbukh

and Arce

, Extracting medical events using conditional random fields and hidden markov model parameter tuning, Journal of Intelligent & Fuzzy Systems (2018).

13.

Forsyth

A.W.

, Barzilay

, Hughes

K.S.

, Lui

, Lorenz

K.A.

, Enzinger

, Tulsky

J.A.

and Lindvall

, Machine learning methods to extract documentation of breast cancersymptoms from electronic health records, Journal of Pain and Symptom Management (2018).

14.

Hansart

, De Meyere

, Watrin

, Bittar

, and Fairon

, CENTAL at SemEval-Task 12: A linguistically fed CRFmodel for medical and temporal information extraction, In Proc. of the 10th International Workshop SemanticEvaluation (2016), . pages 1286–1291.

15.

Huesch

M.D.

, Cherian

, Labib

, Mahraj

, (2018). Evaluating report text variation and informativeness: Naturallanguage processing of ct chest imaging for pulmonary embolism, Journal of the American College of Radiology, 15(3, Part B):554–562. Data Science: Big Data Machine Learning and Artificial Intelligence.

16.

Lauren

, Qu

, Zhang

and Lendasse

, Discriminant document embeddings with an extreme learning machine forclassifying clinical narratives. Hierarchical Extreme Learning Machines, Neurocomputing277 (2018), 129–138.

17.

Lee

H.-J.

, Xu

, Wang

, Zhang

, Moon

, Xu

and Wu

, UTHealth at SemEval-Task 12: An end-to-end system fortemporal information extraction from clinical notes, In Proc. of the 10th International Workshop SemanticEvaluation (2016), 1292–1297.

18.

Leeuwenberg

and Moens

M.-F.

, KULeuven-LIIR at SemEval- Task 12: Crossdomain temporalinformation extraction from clinical records, In Proc. of the 11th International Workshop on SemanticEvaluation (SemEval-2017). (2017), 1027–1031.

19.

and Huang

, UTA DLNLP at SemEval-Task 12: Deep learning based natural language processing system for clinical information identification from clinical notes and pathology reports, In Proc. of the 10th International Workshop SemanticEvaluation (2016), 1268–1273.

20.

Luo

, Recurrent neural networks for classifying relations in clinicalnotes, Journal of Biomedical Informatics72 (2017), 85–95.

21.

MacAvaney

, Cohan

and Goharian

, GUIR at SemEval- Task 12: A framework for cross-domainclinical temporal information extraction, In Proc. of the 11th International Workshop on SemanticEvaluation (SemEval-2017). (2017), 1021–1026.

22.

Okazaki

, (2007). Crfsuite: a fast implementation of conditional random fields(CRFs).

23.

Pedregosa

, Varoquaux

, Gramfort

, Michel

, Thirion

, Grisel

, Blondel

, Prettenhofer

, Weiss

, Dubourg

, Vanderplas

, Passos

, Cournapeau

, Brucher

, Perrot

, and Duchesnay

, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research12 (2011), 2825–2830.

24.

Tao

, Filannino

and Uzuner

özlem

, Prescription extraction using crfs and word embeddings, Journal of Biomedical Informatics72 (2017), 60–66.

25.

Tourille

, Ferret

, Tannier

and Névéol

, LIMSI-COT at SemEval- Task 12: Neural architecture for temporal information extraction from clinical narratives, In Proc. of the 11th International Workshop on SemanticEvaluation (SemEval-2017). (2017), 595–600.

26.

XuanHieu

Phan L.-M.N.

, and Nguyeno

C.-T.

, FlexCRFs: Flexible Conditional Random Fields (2004).