Uncertainty query sampling strategies for active learning of named entity recognition task

Abstract

Active learning approach is well known method for labeling huge un-annotated dataset requiring minimal effort and is conducted in a cost efficient way. This approach selects and adds most informative instances to the training set iteratively such that the performance of learner improves with each iteration. Named entity recognition (NER) is a key task for information extraction in which entities present in sequences are labeled with correct class. The traditional query sampling strategies for the active learning only considers the final probability value of the model to select the most informative instances. In this paper, we have proposed a new active learning algorithm based on the hybrid query sampling strategy which also considers the sentence similarity along with the final probability value of the model and compared them with four other well known pool based uncertainty query sampling strategies based active learning approaches for named entity recognition (NER) i.e. least confident sampling, margin of confidence sampling, ratio of confidence sampling and entropy query sampling strategies. The experiments have been performed over three different biomedical NER datasets of different domains and a Spanish language NER dataset. We found that all the above approaches are able to reach to the performance of supervised learning based approach with much less annotated data requirement for training in comparison to that of supervised approach. The proposed active learning algorithm performs well and further reduces the annotation cost in comparison to the other sampling strategies based active algorithm in most of the cases.

Keywords

Active learning named entity recognition uncertainty query sampling

1. Introduction

The term ‘Named Entity’ first came into existence in the Sixth Message Understanding Conference (MUC-6), and it was focused on the task of information extraction which includes identification of localizations, organizations, people and various numerical expressions [1, 2]. Named Entity Recognition (NER) is one of the important part for the information extraction systems that helps in identification of named entities which contribute further in the extraction of meaningful structured information from a large amount of unstructured text. The main task of the named entity recognition (NER) is to identify the named entities which are present in the unstructured text and to classify them into appropriate pre-defined classes [3]. Different factors such as increasing vocabulary, different naming conventions, the polysemy of a named entity, non-availability of annotated dataset among others make the task of named entity recognition (NER) difficult and challenging [3]. There is exponential rise in amount of textual data in various social media platforms for which data mining approach is required [4, 5]. Named entity recognition is widely used in various natural language processing applications as initial subtask.

The supervised learning approaches are often used to solve different machine learning tasks, including named entity recognition (NER), which requires a large amount of annotated dataset for the training of supervised machine learning models. The performance of the supervised learning models is mainly dependent on the availability of the annotated datasets, and hence these models are suitable only in presence of annotated dataset. It is often the case that the availability of the annotated dataset is very less in comparison to that of un-annotated datasets since the annotation of an un-annotated dataset is very costly and time taking task. Also, there is a requirement of domain experts who know the field for which the un-annotated dataset is available.

Active learning approaches are widely adopted in such cases since they largely reduce the requirement of manual annotations for the creation of the annotated dataset for the training of machine learning models. By selecting the most informative instances of data from an un-annotated dataset to be labeled by annotator (or oracle), the active learning approach helps in minimizing the amount of data required for training of the machine learning model. The active learning approach uses a query sampling strategy to select the most informative instances of data from the un-annotated data pool such that the model learns the most by getting trained from these selected data instances after their annotation. In this paper, a new similarity-based hybrid query sampling strategy has been proposed for the active learning approach for the named entity recognition task. Also, the existing pool based uncertainty based query sampling strategies have also been studied and applied for named entity recognition task in this paper, which includes the least confidence, the margin of confidence, entropy, and the ratio of confidence based query sampling strategies. The proposed query sampling strategy is compared with the above existing uncertainty based query sampling strategies on the basis of annotation cost which depends on the selection of most informative instances from the data pool for the training of the classifier such that classifier needs the least amount of annotated training data to reach to the certain performance level.

The contribution of our study can be summarized as follows.

•
To deal with the problem of unavailability of labeled dataset, the efficiency of the different uncertainty based query sampling strategies for the active learning approach have been validated in terms of annotation cost by conducting experiment after simulating real world scenario for active learning of named entity recognition task.
•
A new hybrid query sampling strategy is proposed for the active learning approach which also considers the sentence similarity information while selecting the most informative instances of data from the un-annotated data pool for annotation. The proposed hybrid query sampling strategy is combined with least confidence query sampling strategy for the active learning of named entity recognition task. Other existing uncertainty based query sampling strategies only considers the probability value of the active learner for the selection of most informative instances of data from un-annotated data pool for annotation.
•
The performance of the proposed hybrid query sampling strategy is compared with the other query sampling strategies in terms of amount of annotated data required by each of them for training the active learner for named entity recognition task. The result shows that the proposed hybrid query sampling strategy further improves the performance of active learner and is more consistent in comparison to any of existing query sampling strategies.

This paper contains following sections: introduction and background of named entity recognition and active learning, the general active learning approach in more detail in the next section, followed by the four existing uncertainty query sampling strategies for active learning, the proposed query sampling approach is presented in next section, further the experimental details are discussed that consists of: dataset details, feature extraction details and baseline classifier, experimental setting & evaluation metrics and criteria for the active learning experiment for named entity recognition task; followed by the result, discussion and conclusion.
2. Background

Named entity recognition is often used in various application areas such as machine translation [6], sentiment analysis [7], named entity linking [8], web information mining [9, 10], question-answering [11], recommender system [12], etc. But for this work, concentration is put towards active learning approach to deal with the problem of insufficient annotated data specifically for named entity recognition task. So in this section, some of existing related research work is discussed in the area of active learning for the named entity recognition and sentence similarity. The work presented in this paper is related to the research work discussed in this section and is inspired by them.

A new framework is proposed by [13] for entity recognition in the low resource open-domain text corpus, which combines the active learning and conditional random fields (CRF). They were able to achieve a good F1 score by keeping low annotation costs. Three different types of sampling strategies that include baseline sampling, diversity-based and uncertainty based sampling are used for active learning experiments for clinical entity recognition by [14]. Results were compared with the passive learning approach which applies random sampling. Also, the authors found that the uncertainty query strategies outperformed all the other query sampling strategies. Therefore, we have also studied the uncertainty based query strategy and proposed a new similarity-based hybrid uncertainty query sampling strategy that further improves the performance for the active learning approach for named entity recognition task. Similar other work is done by [15, 16] for active learning-based clinical concept extraction and social media tweets. In other work [17], have proposed two different methods for active learning of named entity recognition task. One of the method is based on Support Vector Machine (SVM), and the second method utilizes the ensemble of Support Vector Machine (SVM) and Conditional Random Field (CRF). These methods are evaluated over English, Bengali, Hindi, and biomedical domain dataset. Recently, the deep active learning approach is proposed for the task of named entity recognition (NER) by [18]. To reduce the time of iterative retraining for deep learning-based active learning, the authors have proposed a lightweight and computationally efficient CNN-CNN-LSTM architecture (where CNN stands for Convolution Neural Network and LSTM stands for Long Short-Term Memory) for fast and incremental active learning. Also [19], have proposed an architecture named LUSTRE based on active learning for named entity recognition that uses a query sampling strategy based on term frequency-inverse document frequency (TF-IDF) that can learn the structure of the entities iteratively while utilizing very few labeled mentions.

A new uncertainty based query sampling strategy is proposed by [20], which considers the intermediate results along with the final output. The traditional sampling strategies only depend upon the final probability value of the respective model. Their proposed strategy is named as the lowest token probability. They have also compared their proposed approach with the least confidence and normalized least confidence query sampling strategies. The proposed and the existing active learning approaches have been applied with the combined Bidirectional Encoder Representations from Transformers-Conditional Random Field (BERT-CRF) model. Almost all the active learning method for named entity recognition assumes that the cost of annotation of each sample is the same. A new cost-aware query sampling strategy based on CAUSE algorithm (Clustering And Uncertainty Sampling Engine) is proposed by [21], which is capable of selecting less costly but more informative sentences by considering that the cost of annotation for each sample can be different. Another new query sampling strategy is proposed by [22], which makes use of k-means clustering, stratified sampling, and entropy criterion to select the most informative sentences for training the active learner for named entity recognition task.

Various sentence similarity measures have been used by researchers for different natural language processing tasks such as text summarization. The fuzzy-Graph based sentence similarity score along with two other algorithms is used by [23] to find highly correlated sentences of the paragraph for improving key term weightage algorithm for text summarization task. In another task for selection of top-k similar words, fuzzy based similarity measure is used by [24] which is further combined with the association rules for measuring word similarity at a global level. Similar task of sentence similarity using fuzzy set similarity measure is presented by [25] in their work.

Figure 1.

Flowchart for active learning algorithm.

In past different experiment have been conducted to decide when it would be appropriate to stop the active learning experiment. In this direction [26], have compared three different stopping conditions for active learning of the task of named entity recognition. One of the stopping criteria based on gradient stopped the algorithm reliably and was able to achieve near-optimal results. A similar work in which a new stopping criterion is proposed by [27] which uses least confident uncertainty sampling approach for text classification and named entity recognition task using different machine learning models which includes Bayesian Logistic Regression, Maximum Entropy (ME) and Support Vector Machine (SVM).

3. Active learning

The active learning approach is very popular and is used for wide number of applications including image recognition [28], recommender systems [29], sentiment analysis [30], text summarization [31], etc. The main objective behind the active learning approach is that the learning algorithm can iteratively select the most informative data instances from the unlabeled data pool such that the model can learn faster and perform better than the traditional supervised machine learning model while using considerably less amount of data for training. The active learning task begins with randomly selecting some of the instances from a large pool of unlabeled dataset $U$ . These selected sentences are then labeled by oracle (or labeling source) and passed to the machine learning algorithm (here conditional random field, i.e., CRF) for training. Using a query sampling strategy, the active learning algorithm selects few most appropriate data instances from the pool of unlabeled dataset $U$ . The selected sentences are removed from the pool of unlabeled data and labeled with the correct labels by oracle (or labeling source). The correctly labeled sentences are then added to the labeled training dataset $L$ , which is further used by the machine learning algorithm. The performance of the machine learning algorithm is often improved after getting trained in these selected sentences. The above task of selecting most appropriate instances of data, removing this selected dataset from the unlabeled data pool $U$ , annotating them by labeling source, adding the annotated instances of selected data to labeled training dataset $L$ and training of machine learning model over new training data is repeated until stopping criteria is reached [32]. The general workflow of the active learning approach, which is discussed above, is presented in the form of a flowchart in Fig. 1 and is followed in this work.

4. Existing uncertainty sampling based active learning approach

In this paper, a new similarity-based hybrid uncertainty query sampling strategy has been proposed for the active learning algorithm for the named entity recognition task. The proposed strategy is applied and compared with four existing uncertainty sampling approaches, namely: least confidence, the margin of confidence, entropy, and the ratio of confidence query sampling strategies for selecting the most appropriate data from the pool of un-annotated data. Various notations used here include $U$ which denotes the pool of unlabeled sentences, $L$ denotes the set of labeled sentences which was initially empty, $x^{*}\in U$ is the most uncertain instance according to the sampling strategy, $y^{*}$ is the label annotated by the oracle to the instance $x^{*}$ . Each of these strategies is discussed as follows [14, 15, 32, 33]:

4.1 Least confident

This strategy is the most basic uncertainty sampling strategy that selects the unlabeled instances for which the classifier has the least confidence in its classification. It can also be said that the most informative instances are those instances that have the highest uncertainty. The mathematical representation of this approach can be given as:

$\displaystyle{x}^{*}=\mathop{\operatorname{arg\;max}}\limits_{x}P(\hat{y}|x)=% \mathop{\operatorname{arg\;max}}\limits_{x}1-P(\hat{y}|x)$ (1)

where $\hat{y}=\operatorname{arg\;max}_{y}P(y|x)$ is the maximum posterior probability obtained using the baseline classifier. This strategy takes only the best prediction by the classifier into account. This sampling strategy is one of the most effective uncertainty sampling strategies and is widely used by the researchers including [14, 15, 20]. Algorithm 4.1 presents the least confidence query sampling strategy based active learning algorithm for named entity recognition, which is used in our experiment.

normalsize Least confidence query sampling strategy based active learning algorithm for named entity recognition (NER)

Empty labeled set – $L$ Unlabeled train data set (or pool) – $U$ Classifier – C

Randomly select seed sentences from $U$ which is 2% of total sentences in train set

Use labeling source to correctly annotate the selected sentences and place it in $L$

Using sentences in labeled set $L$ , train the classifier $C$

For every sentences in $U$ : i. For every word of the sentence: a. Predict probability for tags of each category using trained classifier $C$ b. Using Eq. (1), calculate the confidence value utilizing the probability value of the tag of specific class for which likelihood is maximum according to classifier $C$ ii. The confidence value of every word is compared and the largest confidence value is stored as the least confidence score of this particular sentence in consideration

The least confidence scores of the sentences are compared so that top 10 most uncertain sentences which are having largest least confidence scores are selected and removed from $U$

If stop condition not meet then repeat from step 2, else stop

4.2 Entropy

This strategy utilizes the Shannon entropy $H$ [34] as a measure to determine the most uncertain samples as follows:

$\displaystyle{x}^{*}=\mathop{\operatorname{arg\;max}}\limits_{x}H(Y|x)=\mathop% {\operatorname{arg\;max}}\limits_{x}-\sum_{y}{P}(y|x)\log_{2}P(y|x)$ (2)

where $y$ varies using all the possible labels of $x$ , i.e., $y\in Y$ . The average information of the variable is measured using entropy, which can be used as an uncertainty query sampling approach for the active learning task. This sampling strategy is also used by the [14, 15] and in much other research work related to active learning for named entity recognition. Algorithm 4.2 presents the entropy query sampling strategy based active learning algorithm for named entity recognition that is used in our experiment.

Entropy query sampling strategy based active learning algorithm for named entity recognition (NER)

Empty labeled set – $L$ Unlabeled train data set (or pool) – $U$

Classifier – C

Randomly select seed sentences from $U$ which is 2% of total sentences in train set

Use labeling source to correctly annotate the selected sentences and place it in $L$

Using sentences in labeled set $L$ , train the classifier $C$

For every sentences in $U$ : i. For every word of the sentence: a. Predict probability for tags of each category using trained classifier $C$ b. Using Eq. (2), calculate the Shannon entropy utilizing the probability value of the tags of all the classes ii. The entropy value of every word is compared and word having the largest entropy value is the selected and stored as the entropy score of this particular sentence in consideration

The entropy scores of the sentences are compared so that top 10 most uncertain sentences which are having largest entropy scores are selected and removed from $U$

If stop condition not meet then repeat from step 2, else stop

Margin query sampling strategy based active learning algorithm for named entity recognition (NER)

Empty labeled set – $L$ Unlabeled train data set (or pool) - $U$ Classifier – C

Randomly select seed sentences from $U$ which is 2% of total sentences in train set

Use labeling source to correctly annotate the selected sentences and place it in $L$

Using sentences in labeled set $L$ , train the classifier $C$

For every sentences in $U$ : i. For every word of the sentence: a. Predict probability for tags of each category using trained classifier $C$ b. Using Eq. (3), calculate the margin value utilizing the probability values of the top two classes for which classifier $C$ is most confident ii. The margin value of every word is compared and the smallest margin value is stored as the margin score of this particular sentence in consideration

The margin scores of the sentences are compared so that top 10 most uncertain sentences which are having smallest margin scores are selected and removed from $U$

If stop condition not meet then repeat from step 2, else stop

4.3 Margin of confidence

This strategy takes the margin between the first and the second most likely classifier predictions into consideration for the selection of most uncertain instances. The mathematical representation of this approach can be given as:

$\displaystyle{x}^{*}=\mathop{\operatorname{arg\;min}}\limits_{x}[P(\hat{y}_{1}% |x)-P(\hat{y}_{2}|x)]=\mathop{\operatorname{arg\;max}}\limits_{x}[P(\hat{y}_{2% }|x)-P(\hat{y}_{1}|x)]$ (3)

where $\hat{y}_{1}$ and $\hat{y}_{2}$ are the top two most likely predictions using the baseline classifier. Unlike the least confidence strategy, this strategy also considers the predictions other than the best prediction by the classifier. This strategy is also very commonly used for the active learning task of entity recognition and has been used by [14, 15, 17] and many others. The margin query sampling strategy based active learning algorithm for named entity recognition is presented in Algorithm 4.2.

4.4 Ratio of confidence

This query sampling strategy is similar to the margin of confidence sampling strategy. The difference is that it takes the ratio between the first and the second most likely classifier predictions into consideration for the selection of most uncertain instances. The mathematical representation of this approach can be given as:

$\displaystyle{x}^{*}=\mathop{\operatorname{arg\;max}}\limits_{x}\left[\frac{P(% \hat{y}_{2}|x)}{P(\hat{y}_{1}|x)}\right]$ (4)

where $\hat{y}_{1}$ and $\hat{y}_{2}$ are the top two most likely predictions using the baseline classifier. This query sampling strategy is lesser-known and is used by [35]. Algorithm 4.4 presents the ratio of confidence sampling approach within the active learning algorithm for the entity recognition task.

Ratio of confidence strategy based active learning algorithm for named entity recognition (NER)

Empty labeled set - $L$ Unlabeled train data set (or pool) - $U$ Classifier -C

Randomly select seed sentences from $U$ which is 2% of total sentences in train set

Use labeling source to correctly annotate the selected sentences and place it in $L$

Using sentences in labeled set $L$ , train the classifier $C$

For every sentences in $U$ : i. For every word of the sentence: a. Predict probability for tags of each category using trained classifier $C$ b. Using Eq. (4), calculate the ratio value utilizing the probability values of the top two classes for which classifier $C$ is most confident ii. The ratio value of every word is compared and the largest ratio value is stored as the ratio score of this particular sentence in consideration

The ratio scores of the sentences are compared so that top 10 most uncertain sentences which are having largest ratio scores are selected and removed from $U$

If stop condition not meet then repeat from step 2, else stop

The drawback of the least confidence strategy is that it considers only the information (probability) related to the least confident class to calculate the uncertainty for the particular instance in consideration. This shortcoming is addressed by both the ratio of confidence and margin of confidence sampling strategy as they also consider the second-best probability value among the class for calculation of the uncertainty. However, both approaches still ignore the rest of the output distribution, which can be a problem in case of a large number of instances. The entropy sampling strategy solves the above problem and considers all the probability values of the different classes for calculation of the uncertainty of the instances [32]. The above four uncertainty query sampling strategy based active learning approach selects the unlabeled data instances for which trained classifier is most uncertain. In the next section, the proposed similarity-based hybrid uncertainty query sampling strategies for the active learning approach for named entity recognition task have been discussed.

5. Proposed active learning based approach

A new similarity-based hybrid uncertainty query sampling strategy has been proposed for the active learning approach for the named entity recognition task. According to [14], the uncertainty query sampling strategies perform best in most of the cases in comparison to the other sampling strategies. Also, in the past, many researchers have applied uncertainty based query sampling and discussed the details concisely. So we have studied the known uncertainty query sampling strategies in depth. Further, we have improved the uncertainty query strategy by incorporating the similarity information of the most uncertain sentences according to the least confidence query sampling approach. In the proposed approach, instead of selecting ten sentences, we select a total of 30 sentences using the least confidence score of the unlabeled sentences. For each of the 30 sentences, the average similarity score is computed by taking the average of the similarity score of each sentence with the other 29 sentences. The similarity score is the fuzzy string matching score, which uses Levenshtein distance to compute the similarity of the sequence with other sequences. For ease of implementation, we have used the popular python package ‘FuzzyWuzzy’ for this task, which provides the number of scorers to find the similarity score between the sequences. We have used ‘WRatio’ scorer in our implementation, which can handle the upper cases, lower cases, and also various other parameters. Using the library, we can compare the sequences with each other and obtain a similarity score out of 100. The similarity score of 100 for the two sequences means that the two sequences are identical by the similarity index.

In each iteration of the proposed algorithm, ten sentences having the least average similarity scores have been selected from the 30 most uncertain sentences according to the least confidence query sampling strategy. So the ten sentences selected in each iteration are the least similar sentences out of 30 most uncertain sentences. The proposed algorithm is presented below:

Similarity based hybrid uncertainty query sampling strategy for active learning algorithm for named entity recognition (NER)

Empty labeled set – $L$ Unlabeled train data set (or pool) – $U$

Classifiers – C

Randomly select seed sentences from $U$ which is 2% of total sentences in train set

Use labeling source to correctly annotate the selected sentences and place it in $L$

Using sentences in labeled set $L$ , train the classifier $C$

The least confidence scores of the sentences are compared so that top 30 most uncertain sentences which are having largest least confidence scores are selected

For each of 30 selected sentences: i. Compute the similarity score of sentence with respect to remaining 29 sentences ii. Compute and store the average of the similarity scores of the current sentence with other 29 selected sentences

Sort the sentence_id with according to the average similarity score in ascending order

Select the first 10 sentence_ids having minimum average similarity score and remove them from $U$

If stop condition not meet then repeat from step 2, else stop

Table 1
Detail of named entity recognition datasets used in active learning experiments

Dataset	Train set			Validation set			Test set			#entity types (except others)
	#abstract	#sent	#tokens	#abstract	#sent	#tokens	#abstract	#sent	#tokens
Disease and adverse effect	320 (approx.)	3491	90012	–	–	–	80 (approx.)	871	22622	2
NCBI disease	593	5424	135701	100	923	23969	100	940	24497	1
JNLPBA	1800	18607	446890	200	1939	47661	404	4260	101443	5
CoNLL 2002 Spanish	–	8323	264715	–	1915	52923	–	1517	51533	4

In the next section, the experimental details are discussed for the above active learning experiment based on different query sampling strategies for named entity recognition task.

6. Experimental details

6.1 Datasets

In this paper, we have studied and applied the four different uncertainty query sampling strategies for the active learning approach for the named entity recognition task. We have also proposed a new similarity-based hybrid uncertainty query sampling strategy and compared its performance with the above four existing uncertainty query sampling approaches for an active learning approach for the named entity recognition task over three different biomedical text datasets and one Spanish language dataset. The first text dataset is the biomedical text dataset, which contains two different annotated entity classes i.e., the disease names and the adverse effects dataset [36]. The dataset consists of 400 abstracts that are randomly selected with the PubMed query ‘Disease OR Adverse effect’. The dataset has a total of 1428 tokens that belong to disease class and 813 tokens, which falls in adverse effect class. The second text dataset is also a biomedical dataset but in disease domain, i.e., the NCBI disease dataset, which contains a total of 793 PubMed abstracts, which contains a total of 6892 tokens that are disease mentions [37]. There is only one annotated entity class for this dataset, which is a disease. It is one of the benchmark dataset widely used for the disease named entity recognition.

The third text dataset is the JNLPBA dataset, which was derived from the GENIA Term dataset for the BioNLP/JNLPBA shared task 2004 organized by the GENIA project that deals with the classification of the molecular biology entities [39]. The dataset contains a total of 2404 MEDLINE abstracts having tokens annotated with a total of five different entity classes i.e., protein, RNA, DNA, cell line, and cell type. It is one of the largest benchmark named entity recognition dataset and has been widely used in the past by the researchers, particularly in the biomedical domain. The JNLPBA dataset is divided into train set, validation set, and test set for our experiment similar to that used by [40]. The fourth text dataset is taken from the shared task of CoNLL-2002, which contains four different annotated entity classes, namely: miscellaneous, organizations, locations, and person entities in Spanish language [41]. This dataset is divided into train set, validation (or testa) set, and test (or testb) set, and it consists of a total of 8323, 1915, and 1517 sentences in train, development, and test set respectively. Further details of the datasets are presented in Table 1.

All the above text datasets are publically available and are present in IOB2 format (Inside-Outside-Begin) [42]. Further details on how the above datasets are used for evaluation of the active learning of named entity recognition (NER) task using different uncertainty based sampling approaches are discussed in later sections.

6.2 Feature details and baseline classifier used

The performance of any machine learning model for any classification task depends mainly on the features used by the machine learning model. We have extracted various features from the text and tried different possible combinations of these features. In this paper, we have reported and used one of the various combinations of features on which the performance of the baseline machine learning model (i.e., Conditional Random Field model) was better. The features that are extracted and used in this paper include the following [43, 44]:

•
Word type: Here the information stored is related to the type of word, i.e., whether all the character of the word are alphabets or digits or contains both.
•
Word case: The information related to the case of the word is also stored, i.e., whether the word is in title case or upper case or lower case.
•
Word pattern: All the characters present in the word are replaced with certain predefined characters such that new word formed by this replacement contains the pattern information of the word [2]. All the uppercase characters are replaced by ‘ $U$ ’, lowercase characters are replaced by ‘ $L$ ’, digits are replaced by ‘ $D$ ’, comma and full stop characters are replaced by full stop character, symbols (‘;’, ‘:’, ‘?’, ‘!’) are replaced by ‘;’, symbols (‘ $+$ ’, ‘ $$ ’, ‘/’, ‘ $=$ ’, ‘|’, ‘_’) are replaced by ‘#’, brackets (‘(’, ‘{’, ‘[’, ‘ $<$ ’) are replaced by ‘(’ and brackets (‘)’, ‘}’, ‘]’, ‘ $>$ ’) are replaced by ‘)’ respectively to form the word pattern for each word.
•
Suffix and prefix: To store the morphological information of the word, suffix, and prefix if available, up to 4 characters have been stored as a feature.
•
Part-of-speech tag: The part-of-speech (POS) tags are one of the important features which are also passed to the baseline machine learning model while training. We have used NLTK (Natural Language Tool Kit) POS tagger to tag each word with the respective part-of-speech (POS) tags for all the three biomedical datasets. The part-of-speech tags were already present in the Conference on Natural Language Learning (CoNLL) 2002 Spanish named entity dataset.
•
Context information: The information of the previous and next word is also stored along with the information of the current word and used as a feature.
•
Begin and end of sentence: In most of the cases the first word of the sentence is often an entity and similarly last word of the sentence is also have its own importance. So we have also set Begin of Sentence (BOS) and End of Sentence (EOS) as True for appropriate word of each sentence so that it can also be used as a feature. If any sentence is containing only single word then the BOS and EOS both will be true for that particular word of the sentence.
•
Other*: The other information includes the lowercase version of the word and the word length, which is also used as a feature. Apart from that, hasHyphen() function is also implemented to be used as a feature that returns ‘True’ if the particular token in consideration contains a hyphen, i.e. ‘-’; otherwise, it returns ‘False.’

Different supervised machine learning models are used by different researchers in past for the task of Named entity recognition which includes Support Vector Machine (SVM), Hidden Markov Model (HMM), Maximum Entropy (ME), Conditional Random Field (CRF) and various Neural Network-based models [42, 45]. However, the most widely used and popular among them is the Conditional Random Field (CRF) model and is commonly used as a baseline classifier for the task of entity recognition. The conditional random field model is a discriminative undirected graphical machine learning model that makes no assumption related to the underlying distribution and models undirected graph connecting labels with observations. In this model, the conditional probability of the label sequence given the input sequence is considered instead of the joint probability of the label sequence and the input sequence [46]. It allows the arbitrary and non-independent features of the input sequences. The probability of the transitions between the labels depends on the past and future observations. For this research work, the sklearn-crfsuite has been used, which is a scikit-learn compatible python-crfsuite wrapper that includes the implementation of the Conditional Random Field (CRF). We have used the baseline Conditional Random Field (CRF) model for both traditional supervised approach and active learning-based approach for named entity recognition (NER) task so that both the approaches can be compared fairly in terms of performance or for the amount of annotated training data requirement. For training the Conditional Random Field (CRF) algorithm, we have used same parameters as given in [43] i.e. L-BFGS as training algorithm, ${c}_{1}=0.1$ , ${c}_{2}=0.1$ , max iterations as 100, and for the rest of the parameters, the default values have been taken.
6.3 Experimental setting

For the active learning of the named entity recognition (NER) task, the labels have been removed from all the datasets so that they were treated as a pool of un-annotated sentences in our experiment. Also, in place of the oracle, we have assigned the actual labels which were removed from those sentences programmatically by keeping track of the indexes of the sentences in $U$ and is treated as labeling source for the experiment so that the datasets get annotated correctly before adding them to the labeled dataset $L$ . The disease names and the adverse effects dataset is having a train and test set only. The remaining three datasets are divided into train set, validation set, and test set. In the experiment, the train set is used as an un-annotated data pool from which sentences are chosen using query sampling strategies for training, and both the validation & the test set are used only for testing purposes for which we have presented results separately.

For the active learning-based approach, initially we have randomly selected 2% of the total sentences present in un-annotated data pool $U$ for annotation by labeling source (or oracle) and added them to the empty train set $L$ for all the datasets after their annotation. Later until the stopping criteria is achieved, a total of ten correctly annotated sentences have been added to the labeled training set $L$ . The ten selected sentences from the respective dataset are the ones which contain words for which the baseline classifier is most uncertain about their annotation class. The experiment is conducted for each of the four different uncertainty strategies for the active learning approach for the named entity recognition. The sentences having the word for which classifier is most uncertain are treated as most uncertain sentences, and hence they were removed from the un-annotated data pool, and after getting annotated by oracle, these sentences are placed to the annotated training set $L$ . Hence in each iteration, the baseline classifier gets the updated training set with an increased number of uncertain sentences selected using respective uncertainty sampling strategies.

6.4 Evaluation metrics and criteria

The F1-score is standard metrics which is used for the evaluation of all the named entity recognition experiments [47]. The weighted mean of the recall (R) and precision (P) can be interpreted as F1-score, where the value of 0 represents the worst score, and the value of 1 represents the best value. The formula for the F1-score is given by [17]:

$\displaystyle\text{F1-score}=\frac{2\times\text{Precision}\times\text{Recall}}% {\text{Precision}+\text{Recall}}$ (5)

where,

$\displaystyle\text{Precision}=\frac{\text{number of correctly tagged entities}% }{\text{total number of tagged entities}}$ (6) $\displaystyle\text{Recall}=\frac{\text{number of correctly tagged entities}}{% \text{total number of entities}}$ (7)

The main objective of this research work is to demonstrate the effectiveness of the existing uncertainty sampling strategies and to enhance their performance further by incorporating similarity features for the active learning of the named entity recognition task. The proposed and existing approaches have been compared with each other in terms of annotation cost, i.e., amount of labeled training data required by each approach to reach to the performance level attained by the supervised approach. We have used the performance of the traditional supervised approach only to find out the performance level (F1-score) it can reach using the baseline Conditional Random Field (CRF) model with a completely annotated training dataset so that it can be used as stopping criteria for each of the active learning approaches. According to [27], it can be done only if we already have the correct annotated labels for the training set in advance which is present in our case. The annotation cost have been measured for each of the query sampling strategies for the active learning experiment at the end by recording the amount of labeled data required by them to reach the stop condition. The active learning algorithm is stopped once they reach to the performance achieved by the traditional supervised approach using the baseline Conditional Random Field (CRF) classifier. So firstly, experiment is conducted with the baseline Conditional Random Field (CRF) classifier for all the datasets in the traditional supervised way. The results obtained by the traditional supervised approach have been reported in Table 2 for all the four named entity recognition datasets of different domains.

Table 2

Final results obtained by the baseline classifier in traditional supervised way for four named entity recognition datasets of different domains

Dataset		Precision	Recall	F1-score
Disease and adverse effect	Test set	0.698	0.501	0.580
NCBI disease	Validation set	0.873	0.797	0.833
	Test set	0.859	0.783	0.819
JNLPBA	Validation set	0.796	0.761	0.776
	Test set	0.749	0.732	0.737
CoNLL 2002 Spanish	Validation set	0.782	0.754	0.764
CoNLL 2002 Spanish	Test set	0.813	0.797	0.804

Table 3

Details of the annotated training data required by different uncertainty sampling strategies is compared for active learning of named entity recognition (NER) task for three biomedical and one Spanish language datasets

Dataset	Total sentence in train set	#Seed sentences in $L$	Dataset detail	Amount of labeled sentences in $L$ when classifier reached stop criteria	Least confident	Entropy	Margin of confidence	Ratio of confidence	Proposed algorithm
Disease and adverse effect	3491	69	Test set	No of sentences	1049	689	1099	799	719
				Percentage of sentences	30.05	19.74	31.48	22.88	20.59
NCBI disease	5424	108	Validation set	No of sentences	898	1518	2358	2128	968
				Percentage of sentences	16.56	27.98	43.47	39.24	17.85
			Test set	No of sentences	1258	1708	1088	1548	1098
				Percentage of sentences	23.19	31.49	20.06	28.54	20.24
JNLPBA	18607	372	Validation set	No of sentences	5222	5422	7322	5282	4982
				Percentage of sentences	28.06	29.14	39.35	28.38	26.77
			Test set	No of sentences	6872	6372	9302	7382	6732
				Percentage of sentences	36.93	34.25	49.99	39.67	36.18
CoNLL 2002 Spanish	8323	166	Validation set	No of sentences	2646	2456	2656	2376	2096
				Percentage of sentences	31.79	29.51	31.92	28.55	25.18
			Test set	No of sentences	3276	3046	3436	3456	2526
				Percentage of sentences	39.36	36.59	41.28	41.53	30.35

To measure the annotation cost (i.e., amount of labeled training data required by each of the active learning-based approach until it satisfies the stop criteria), the active learning-based approaches have been allowed to run until they reach to the performance achieved by the baseline classifier in a traditional supervised way. The performance of the active learning approach for different uncertainty sampling strategies is compared in terms of final annotation cost (depends on number of annotated sentences in labeled train set $L$ for training) using baseline Conditional Random Field (CRF) classifier have been reported in Table 3 for all the four named entity recognition datasets of different domains. The best results obtained for the validation set and test set of the above datasets are highlighted in bold. Also, the results of the performance have been plotted against the size of the training set i.e., how F1 score varies with the increase in the number of annotated sentences present in the labeled training set $L$ for each of uncertainty sampling strategy for the active learning approach [14, 48]. The F1 score are plotted only if they increase with the number of annotated sentences in the labeled train set $L$ for better visualization. The above plots are shown in Fig. 2 for all the named entity recognition datasets of different domains. Based on the above results, the comparison of performance for different query sampling strategies is presented in Fig. 3 in terms of final annotation cost i.e. number of annotated sentences required by each of them for training over all the four datasets.

Figure 2.

Performance of different uncertainty strategies based active learning approach with increasing number of annotated sentences in the labeled training set $L$ for four named entity recognition datasets of different domains. (a) Disease and adverse effect test set; (b) NCBI disease validation set; (c) NCBI disease test set; (d) JNLPBA validation set; (e) JNLPBA test set; (f) CoNLL 2002 Spanish validation set; (g) CoNLL 2002 Spanish test set.

Figure 3.

Amount of annotated training data (sentences) required by the different uncertainty sampling strategies based active learning algorithms to reach to the stop condition over four named entity recognition (NER) datasets of different domains.

7. Results and discussion

As discussed above, Table 2 presents the performance of the supervised approach using the baseline Conditional Random Field (CRF) model which is used as the stop condition in the active learning experiment. The performance of the proposed query sampling strategy including the existing uncertainty query sampling strategies is measured in terms of the annotation cost. It is important to note that the annotation cost depends on the percentage of annotated sentences used by the active learner based on different query sampling strategies to reach the stop condition. If the annotation cost for a particular active learning approach is less, then it means that this particular approach needs a minimum amount of annotated sentences for training to reach the stop condition and vice-versa. The final amount of annotated data required by different query sampling strategies is presented in Table 3. The F1-score is recorded for each iteration of the respective uncertainty approaches based on active learning experiments for the named entity recognition so that their performance can be visualized and compared at any iteration level of the active learning experiment. The above learning curve is shown in Fig. 2. Also, the performance of various uncertainty query sampling strategies used in this experiment is compared on the basis of the final annotation cost, i.e., amount of total annotated training data used by them for reaching the stop criteria. Their performance (in terms of annotation cost) is compared over four named entity recognition datasets of different domains at the end and is presented in Fig. 3.

The result is discussed mostly in terms of final annotation cost for each of the query sampling strategies of active learning experiment; however, their performance curve can be seen at each iteration level possessing different numbers of annotated sentences in Fig. 2 for each of the datasets.

For the disease names and adverse effect dataset, the entropy sampling strategy performed well in comparison to other sampling strategies. Its performance is followed by the performance of the proposed algorithm and the ratio of confidence based uncertainty active learning sampling strategy. They all took 19.74% to 22.88% of total sentences to reach to the stop condition. The least confidence and margin of confidence sampling strategy took the maximum number of sentences in the labeled training set $L$ (i.e., maximum annotation cost) for achieving the stop criteria i.e. 30.05% and 31.48%, respectively. The F1 score for each iteration of respective query sampling strategies for above dataset can be seen in the performance curve shown in Fig. 2a.

For the NCBI disease validation dataset, the least confidence strategy, and proposed algorithm performed well altogether. They took just 16.56% and 17.85% of annotated sentences in labeled training set $L$ to reach the stop criteria respectively. Their performance is followed by the entropy sampling strategy, which reached stop criteria with 27.98% sentences in the labeled train set $L$ . Both the ratio of confidence and margin of confidence reached to stop condition with the requirement of the maximum number of the labeled sentences in the labeled train set. For the NCBI disease test dataset, the margin of confidence, and proposed algorithm performed well and they used minimum number of annotated sentences for training to reach the stop criteria. Their performance is followed by the least confident sampling strategy. They all reached the stop condition by taking 20.06% to 23.19% of annotated sentences in the labeled train set $L$ . Their performance is followed by the ratio of confidence and entropy sampling strategy, which took 28.54% and 31.49% of labeled sentences for the training of active learners to reach to stop condition. The learning curve for the above is shown in Fig. 2b and c, respectively.

The JNLPBA is the largest dataset having a total of 24806 sentences. For the JNLPBA validation dataset, all the sampling strategies except the margin of confidence sampling strategy performed well and reached to the stop criteria by taking 26.77% to 29.14% of the labeled sentences in the labeled train set $L$ from the total train set. The margin of confidence sampling strategy based algorithm required 39.35% of the labeled sentences to achieve stop condition i.e., this strategy is having maximum annotation cost. Similarly, for the JNLPBA test dataset, the entropy, the proposed algorithm, and the least confident based active algorithm performed well and achieved stop criteria with the requirement of just 34.25% and 36.93% of labeled sentences from the total train set respectively. Their performance is followed by the ratio of confident, which took 39.67% of the labeled sentences to reach stop condition. The margin of confidence based active algorithm required the maximum amount of labeled sentence in the labeled train set $L$ , i.e., 49.99% to reach stop condition. Their performance curve is presented in Fig. 2d and e, respectively.

The CoNLL Spanish dataset is the second-largest dataset and the only non-biomedical dataset among the considered datasets for the experiment. For the CoNLL Spanish validation set, the proposed algorithm performed best and reached to stop condition with just 25.18% of labeled sentences for training. Its performance is followed by the ratio of confidence, and entropy-based active algorithms, which took 28.55% to 29.51% of labeled sentences for training the active learner until it achieves the stop condition. The least confident and the margin of confidence based active algorithms took the maximum amount of labeled sentences to reach the stop condition, i.e., 31.79% and 31.92% respectively. For the CoNLL Spanish test set, the proposed algorithm performed best and reached to the stop condition by using just 30.35% of labeled data for training the active learner. Their performance is followed by the entropy-based active algorithm, which required 36.59% of labeled sentences. The least confidence, margin of confidence, and ratio of confidence required the 39.36% to 41.53% of labeled sentences from the total train set to train the active learner. Their learning curve is shown in Fig. 2f and g, respectively.

It is important to note that the active learning approach for all the different uncertainty measures always reaches the performance of the baseline Conditional Random Field (CRF) classifier, with much fewer annotated data required for training, i.e., they reduce the annotation cost significantly. The proposed algorithm performed well and required the minimum number of labeled sentences in most of the cases. The result achieved by the proposed algorithm is either the best or very competitive to the best performing active learning algorithms in almost all the cases. The main reason behind the competitive result by using proposed strategy is that it selects 10 least similar sentences out of the 30 most uncertain sentences selected using the least confidence query sampling strategy in each iteration of active learning experiment for different datasets. So the proposed strategy further improves the performance of the least confidence strategy, and hence the performance of the proposed algorithm also depends on the performance of the least confidence sampling strategy. It is clear from the above result that the performance of the proposed algorithm is better than the least confidence approach in almost all the cases and very competitive in the remaining case in terms of the annotation cost. The results of the above experiments are evident that combining the sentence similarity scores with the least confident sampling strategy further improves its performance, i.e., proposed approach further reduces the annotation cost. Also, if we compare only the performance of all the existing query sampling strategies for all the datasets, we will find that none of the existing query sampling strategies performs well in all the datasets altogether. But the performance of the proposed algorithm is either very competitive to the best performing query sampling strategy or it provides the best result itself, consistently for all the datasets in comparison to the other query sampling strategies.

So in most of the cases, the proposed algorithm performed well and required a minimum amount of data to be labeled for the training of active learner in comparison to the other sampling strategies based on active learning algorithms. In the proposed work, the similarity score of the sentences is combined with the least confidence sampling strategy and considered for selecting the most appropriate unlabeled sentences from the unlabeled train set $U$ such that the active learner learns the most from them after their labeling. Hence, we can say that the proposed active learning algorithm can further improve the existing uncertainty sampling strategies and can help in reducing the annotation cost without compromising the performance of the machine learning model. Thus the proposed active learning algorithm can be easily used in place of the existing uncertainty sampling strategies based active learning algorithm.

8. Conclusion

The performance of the traditional supervised machine learning algorithms depends highly on the annotated training data size. There is need of large annotated datasets so that the supervised models can perform well for different tasks. To minimize the dependency over the large annotated datasets, a new uncertainty based query sampling strategy have been proposed for active learning approach that have been evaluated for named entity recognition (NER) task. The proposed query sampling strategy is compared with other pool-based uncertainty sampling strategies, i.e., least confident, margin of confidence, ratio of confidence, and entropy-based query sampling strategies on the basis of annotation cost, which depends on appropriate instances of data selected for training of the active learner. For this purpose, various features are extracted, and a baseline Conditional Random Field (CRF) classifier is used. In order to compare the performance of the proposed query sampling strategy with the existing uncertainty query sampling strategies, the active learning algorithms keep adding sentences in training set until it reaches to the performance of the supervised model. The baseline Conditional Random Field (CRF) classifier is used in all the experiments. It has been found that the active learning algorithms based on different uncertainty sampling strategies always reaches to the performance of the traditional supervised baseline Conditional Random Field (CRF) model while reducing the annotation cost significantly.

The existing uncertainty based sampling strategies consider only the probability value of the model to determine the instances of data for selection. The proposed hybrid uncertainty based active learning algorithm combines and considers the fuzzy string matching score with the least confidence sampling strategy for selection of most informative sentences from which classifier can learn most about the unseen sentences. The proposed algorithm in most of the cases perform better than the other query sampling strategies (including the least confidence-based approach), which suggests that considering the similarity of the sentences is helpful in appropriate data selection for the active learning and can contribute further to reduce the labeling cost. The proposed active learning algorithm is easy to implement and can be used for the classification of important entities in the textual named entity recognition dataset of any domain.

In this paper, the sentence similarity score is only combined with least confidence query sampling strategy for the active learning for named entity recognition task. In future, the sentence similarity score can be combined with the other uncertainty based query sampling strategies for the above task. Also, different sentence similarity algorithms can be used for the experiment in future.

References

Grishman

Sundheim

. Message understanding conference-6: A brief history. in: Proceedings of the 16th Conference on Computational Linguistics [Internet]. Copenhagen, Denmark: Association for Computational Linguistics. 1996; 466-471.

Nadeau

Sekine

. A survey of named entity recognition and classification. Lingvisticae Investig. 2007; 30(1): 3-26.

Lee

Hwang

Kim

Rim

. Biomedical named entity recognition using two-phase model based on SVMs. J Biomed Inform [Internet]. 2004; 37(6): 436-447.

Krouska

Troussas

Virvou

. A literature review of social networking-based learning systems using a novel ISO-based framework. Intell Decis Technol. 2019; 13: 23-39.

Lee

Yoon

Kim

, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics [Internet]. 2019; 36(4): 1234-1240.

Yeniterzi

Oflazer

. Turkish Named-Entity Recognition. in: Turkish Natural Language Processing [Internet]. Oflazer

Saralar

, eds. Cham: Springer International Publishing. 2018; 115-132.

Mehta

Chandra

. NICFS: A novel feature selection method applied to lexicon based sentiment analysis. Intell Decis Technol. 2019; 13: 41-48.

Waitelonis

Jrges

Sack

. Remixing entity linking evaluation datasets for focused benchmarking. Semant Web. 2019; 10: 385-412.

Anoop

Asharaf

. Conceptualized phrase clustering with distributed k-means. Intell Decis Technol. 2019; 13: 153-160.

10.

Prakash

Saha

. A study on use of the web for automatic answering of remedy finding questions of common users. Technol Heal Care. 2019; 27: 23-35.

11.

Abdi

Hasan

Arshi

Shamsuddin

Idris

. A question answering system in hadith using linguistic knowledge. Comput Speech Lang [Internet]. 2019; 101023.

12.

Karacapilidis

Malefaki

Charissiadis

. A novel framework for augmenting the quality of explanations in recommender systems. Intell Decis Technol. 2017; 11: 187-197.

13.

Gao

Karampatziakis

Potharaju

Cucerzan

. Active entity recognition in low resource settings. in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019. 2261-2264.

14.

Chen

Lasko

Mei

Denny

. A study of active learning methods for named entity recognition in clinical text. J Biomed Inform [Internet]. 2015; 58: 11-18.

15.

Kholghi

Sitbon

Zuccon

Nguyen

. Active learning reduces annotation time for clinical concept extraction. Int J Med Inform [Internet]. 2017; 106: 25-31.

16.

Nguyen

Nez

Trawiński

. A named entity recognition approach for tweet streams using active learning. J Intell Fuzzy Syst. 2017; 32(2): 1277-1287.

17.

Ekbal

Saha

Sikdar

. On active annotation for named entity recognition. Int J Mach Learn Cybern [Internet]. 2016; 7(4): 623-640.

18.

Shen

Yun

Lipton

Kronrod

Anandkumar

. Deep active learning for named entity recognition. CoRR [Internet]. 2017; abs/1707.0.

19.

Bhutani

Qian

, Jagadish H V Hernandez

Vasa

. Exploiting structure in representation of named entities using active learning. in: Proceedings of the 27th International Conference on Computational Linguistics [Internet]. Santa Fe, New Mexico, USA: Association for Computational Linguistics. 2018; 687-699.

20.

Liu

Wang

. LTP: A new active learning strategy for bert-crf based named entity recognition. 2020.

21.

Wei

Chen

Salimi

Denny

Mei

Lasko

, et al. Cost-aware active learning for named entity recognition in clinical text. J Am Med Informatics Assoc [Internet]. 2019; 26(11): 1314-1322.

22.

Huang

Wang

Jin

. A low-cost named entity recognition research based on active learning. Sci Program. 2018; 2018: 10.

23.

Vetriselvi

Gopalan

. An improved key term weightage algorithm for text summarization using local context information and fuzzy graph sentence score. J Ambient Intell Humaniz Comput [Internet]. 2020.

24.

Liu

Huang

Xuan

Zhang

Gao

. A fuzzy word similarity measure for selecting top-k similar words in query expansion. IEEE Trans Fuzzy Syst. 2020; 1.

25.

Cross

Mokrenko

Crockett

Adel

. Using fuzzy set similarity in sentence similarity measures. IEEE. 2020;

26.

Laws

Schtze

. Stopping criteria for active learning of named entity recognition. in: Proceedings of the 22Nd International Conference on Computational Linguistics – Volume 1 [Internet]. Stroudsburg, PA, USA: Association for Computational Linguistics. 2008; 465-472. (COLING ’08).

27.

Vlachos

. A stopping criterion for active learning. Comput Speech Lang [Internet]. 2008; 22(3): 295-312.

28.

Johns

Leutenegger

Davison

. Pairwise decomposition of image sequences for active multi-view recognition. in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.

29.

Rubens

Elahi

Sugiyama

Kaplan

. Active learning in recommender systems. in: Recommender Systems Handbook [Internet]. Ricci

Rokach

Shapira

, eds. Boston, MA: Springer US. 2015; 809-846.

30.

Kranjc

Smailović

Podpečan

Grčar

Žnidaršič

Lavrač

. Active learning for sentiment analysis on data streams: Methodology and workflow implementation in the ClowdFlows platform. Inf Process Manag [Internet]. 2015; 51(2): 187-203.

31.

Sun

Han

. A survey on deep learning for named entity recognition. CoRR [Internet]. 2018; abs/1812.0.

32.

Settles

. Active learning. Synth Lect Artif Intell Mach Learn [Internet]. 2012; 6(1): 1-114.

33.

Settles

Craven

. An analysis of active learning strategies for sequence labeling tasks. in: Proceedings of the Conference on Empirical Methods in Natural Language Processing [Internet]. Stroudsburg, PA, USA: Association for Computational Linguistics. 2008; 1070-1079. (EMNLP ’08).

34.

Shannon

. A mathematical theory of communication. Bell Syst Tech J. 1948; 27: 379-423, 623-656.

35.

Munro

. Human-in-the-loop machine learning [Internet]. Manning. 2019

36.

Gurulingappa

Klinger

Hofmann-Apitius

Fluck

. An empirical evaluation of resources for the identification of diseases and adverse effects in biomedical literature. in: 2nd Workshop on Building and evaluating resources for biomedical text mining (7th edition of the Language Resources and Evaluation Conference). Valetta, Malta. 2010.

37.

Doğan

Leaman

. NCBI disease corpus: A resource for disease name recognition and concept normalization. J Biomed Inform [Internet]. 2014; 47: 1-10.

38.

Collier

Kim

. Introduction to the bio-entity recognition task at JNLPBA. in: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications ({NLPBA}/{B}io{NLP}) [Internet]. Geneva, Switzerland: COLING. 2004; 73-78.

39.

Project

. BioNLP/JNLPBA shared task 2004 [Internet]. 2004.

40.

Crichton

Pyysalo

Chiu

Korhonen

. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinformatics [Internet]. 2017; 18(1): 368.

41.

Tjong

Sang

. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. in: Proceedings of the 6th Conference on Natural Language Learning – Volume 20 [Internet]. Stroudsburg, PA, USA: Association for Computational Linguistics. 2002; 1-4. (COLING-02).

42.

Goyal

Gupta

Kumar

. Recent named entity recognition and classification techniques: A systematic review. Comput Sci Rev [Internet]. 2018; 29: 21-43.

43.

Korobov

. Sklearn-crfsuite docs [Internet]. 2015. [cited 2019 Apr 11].

44.

Okazaki

. CRFsuite: A fast implementation of Conditional Random Fields (CRFs) [Internet]. 2007.

45.

Wang

Yang

Guan

. A comparative study for biomedical named entity recognition. Int J Mach Learn Cybern [Internet]. 2018; 9(3): 373-382.

46.

Lafferty

, Mc Callum

Pereira

FCN

. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. in: Proceedings of the Eighteenth International Conference on Machine Learning [Internet]. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 2001; 282-289. (ICML ’01).

47.

TeamHG Memex. Scikit-learn inspired API for CRFsuite [Internet].

48.

Tran

Nguyen

Fujita

Hoang

Hwang

. A combination of active learning and self-learning for named entity recognition on Twitter using conditional random fields. Knowledge-Based Syst [Internet]. 2017; 132: 179-187.

Uncertainty query sampling strategies for active learning of named entity recognition task

Abstract

Keywords

1. Introduction

4. Existing uncertainty sampling based active learning approach

4.1 Least confident

Table 1 Detail of named entity recognition datasets used in active learning experiments

6.1 Datasets

6.2 Feature details and baseline classifier used

6.4 Evaluation metrics and criteria

8. Conclusion

References

Table 1
Detail of named entity recognition datasets used in active learning experiments