Abstract
Active learning approach is well known method for labeling huge un-annotated dataset requiring minimal effort and is conducted in a cost efficient way. This approach selects and adds most informative instances to the training set iteratively such that the performance of learner improves with each iteration. Named entity recognition (NER) is a key task for information extraction in which entities present in sequences are labeled with correct class. The traditional query sampling strategies for the active learning only considers the final probability value of the model to select the most informative instances. In this paper, we have proposed a new active learning algorithm based on the hybrid query sampling strategy which also considers the sentence similarity along with the final probability value of the model and compared them with four other well known pool based uncertainty query sampling strategies based active learning approaches for named entity recognition (NER) i.e. least confident sampling, margin of confidence sampling, ratio of confidence sampling and entropy query sampling strategies. The experiments have been performed over three different biomedical NER datasets of different domains and a Spanish language NER dataset. We found that all the above approaches are able to reach to the performance of supervised learning based approach with much less annotated data requirement for training in comparison to that of supervised approach. The proposed active learning algorithm performs well and further reduces the annotation cost in comparison to the other sampling strategies based active algorithm in most of the cases.
Introduction
The term ‘Named Entity’ first came into existence in the Sixth Message Understanding Conference (MUC-6), and it was focused on the task of information extraction which includes identification of localizations, organizations, people and various numerical expressions [1, 2]. Named Entity Recognition (NER) is one of the important part for the information extraction systems that helps in identification of named entities which contribute further in the extraction of meaningful structured information from a large amount of unstructured text. The main task of the named entity recognition (NER) is to identify the named entities which are present in the unstructured text and to classify them into appropriate pre-defined classes [3]. Different factors such as increasing vocabulary, different naming conventions, the polysemy of a named entity, non-availability of annotated dataset among others make the task of named entity recognition (NER) difficult and challenging [3]. There is exponential rise in amount of textual data in various social media platforms for which data mining approach is required [4, 5]. Named entity recognition is widely used in various natural language processing applications as initial subtask.
The supervised learning approaches are often used to solve different machine learning tasks, including named entity recognition (NER), which requires a large amount of annotated dataset for the training of supervised machine learning models. The performance of the supervised learning models is mainly dependent on the availability of the annotated datasets, and hence these models are suitable only in presence of annotated dataset. It is often the case that the availability of the annotated dataset is very less in comparison to that of un-annotated datasets since the annotation of an un-annotated dataset is very costly and time taking task. Also, there is a requirement of domain experts who know the field for which the un-annotated dataset is available.
Active learning approaches are widely adopted in such cases since they largely reduce the requirement of manual annotations for the creation of the annotated dataset for the training of machine learning models. By selecting the most informative instances of data from an un-annotated dataset to be labeled by annotator (or oracle), the active learning approach helps in minimizing the amount of data required for training of the machine learning model. The active learning approach uses a query sampling strategy to select the most informative instances of data from the un-annotated data pool such that the model learns the most by getting trained from these selected data instances after their annotation. In this paper, a new similarity-based hybrid query sampling strategy has been proposed for the active learning approach for the named entity recognition task. Also, the existing pool based uncertainty based query sampling strategies have also been studied and applied for named entity recognition task in this paper, which includes the least confidence, the margin of confidence, entropy, and the ratio of confidence based query sampling strategies. The proposed query sampling strategy is compared with the above existing uncertainty based query sampling strategies on the basis of annotation cost which depends on the selection of most informative instances from the data pool for the training of the classifier such that classifier needs the least amount of annotated training data to reach to the certain performance level.
The contribution of our study can be summarized as follows.
To deal with the problem of unavailability of labeled dataset, the efficiency of the different uncertainty based query sampling strategies for the active learning approach have been validated in terms of annotation cost by conducting experiment after simulating real world scenario for active learning of named entity recognition task. A new hybrid query sampling strategy is proposed for the active learning approach which also considers the sentence similarity information while selecting the most informative instances of data from the un-annotated data pool for annotation. The proposed hybrid query sampling strategy is combined with least confidence query sampling strategy for the active learning of named entity recognition task. Other existing uncertainty based query sampling strategies only considers the probability value of the active learner for the selection of most informative instances of data from un-annotated data pool for annotation. The performance of the proposed hybrid query sampling strategy is compared with the other query sampling strategies in terms of amount of annotated data required by each of them for training the active learner for named entity recognition task. The result shows that the proposed hybrid query sampling strategy further improves the performance of active learner and is more consistent in comparison to any of existing query sampling strategies.
This paper contains following sections: introduction and background of named entity recognition and active learning, the general active learning approach in more detail in the next section, followed by the four existing uncertainty query sampling strategies for active learning, the proposed query sampling approach is presented in next section, further the experimental details are discussed that consists of: dataset details, feature extraction details and baseline classifier, experimental setting & evaluation metrics and criteria for the active learning experiment for named entity recognition task; followed by the result, discussion and conclusion.
Named entity recognition is often used in various application areas such as machine translation [6], sentiment analysis [7], named entity linking [8], web information mining [9, 10], question-answering [11], recommender system [12], etc. But for this work, concentration is put towards active learning approach to deal with the problem of insufficient annotated data specifically for named entity recognition task. So in this section, some of existing related research work is discussed in the area of active learning for the named entity recognition and sentence similarity. The work presented in this paper is related to the research work discussed in this section and is inspired by them.
A new framework is proposed by [13] for entity recognition in the low resource open-domain text corpus, which combines the active learning and conditional random fields (CRF). They were able to achieve a good F1 score by keeping low annotation costs. Three different types of sampling strategies that include baseline sampling, diversity-based and uncertainty based sampling are used for active learning experiments for clinical entity recognition by [14]. Results were compared with the passive learning approach which applies random sampling. Also, the authors found that the uncertainty query strategies outperformed all the other query sampling strategies. Therefore, we have also studied the uncertainty based query strategy and proposed a new similarity-based hybrid uncertainty query sampling strategy that further improves the performance for the active learning approach for named entity recognition task. Similar other work is done by [15, 16] for active learning-based clinical concept extraction and social media tweets. In other work [17], have proposed two different methods for active learning of named entity recognition task. One of the method is based on Support Vector Machine (SVM), and the second method utilizes the ensemble of Support Vector Machine (SVM) and Conditional Random Field (CRF). These methods are evaluated over English, Bengali, Hindi, and biomedical domain dataset. Recently, the deep active learning approach is proposed for the task of named entity recognition (NER) by [18]. To reduce the time of iterative retraining for deep learning-based active learning, the authors have proposed a lightweight and computationally efficient CNN-CNN-LSTM architecture (where CNN stands for Convolution Neural Network and LSTM stands for Long Short-Term Memory) for fast and incremental active learning. Also [19], have proposed an architecture named LUSTRE based on active learning for named entity recognition that uses a query sampling strategy based on term frequency-inverse document frequency (TF-IDF) that can learn the structure of the entities iteratively while utilizing very few labeled mentions.
A new uncertainty based query sampling strategy is proposed by [20], which considers the intermediate results along with the final output. The traditional sampling strategies only depend upon the final probability value of the respective model. Their proposed strategy is named as the lowest token probability. They have also compared their proposed approach with the least confidence and normalized least confidence query sampling strategies. The proposed and the existing active learning approaches have been applied with the combined Bidirectional Encoder Representations from Transformers-Conditional Random Field (BERT-CRF) model. Almost all the active learning method for named entity recognition assumes that the cost of annotation of each sample is the same. A new cost-aware query sampling strategy based on CAUSE algorithm (Clustering And Uncertainty Sampling Engine) is proposed by [21], which is capable of selecting less costly but more informative sentences by considering that the cost of annotation for each sample can be different. Another new query sampling strategy is proposed by [22], which makes use of k-means clustering, stratified sampling, and entropy criterion to select the most informative sentences for training the active learner for named entity recognition task.
Various sentence similarity measures have been used by researchers for different natural language processing tasks such as text summarization. The fuzzy-Graph based sentence similarity score along with two other algorithms is used by [23] to find highly correlated sentences of the paragraph for improving key term weightage algorithm for text summarization task. In another task for selection of top-k similar words, fuzzy based similarity measure is used by [24] which is further combined with the association rules for measuring word similarity at a global level. Similar task of sentence similarity using fuzzy set similarity measure is presented by [25] in their work.
Flowchart for active learning algorithm.
In past different experiment have been conducted to decide when it would be appropriate to stop the active learning experiment. In this direction [26], have compared three different stopping conditions for active learning of the task of named entity recognition. One of the stopping criteria based on gradient stopped the algorithm reliably and was able to achieve near-optimal results. A similar work in which a new stopping criterion is proposed by [27] which uses least confident uncertainty sampling approach for text classification and named entity recognition task using different machine learning models which includes Bayesian Logistic Regression, Maximum Entropy (ME) and Support Vector Machine (SVM).
The active learning approach is very popular and is used for wide number of applications including image recognition [28], recommender systems [29], sentiment analysis [30], text summarization [31], etc. The main objective behind the active learning approach is that the learning algorithm can iteratively select the most informative data instances from the unlabeled data pool such that the model can learn faster and perform better than the traditional supervised machine learning model while using considerably less amount of data for training. The active learning task begins with randomly selecting some of the instances from a large pool of unlabeled dataset
Existing uncertainty sampling based active learning approach
In this paper, a new similarity-based hybrid uncertainty query sampling strategy has been proposed for the active learning algorithm for the named entity recognition task. The proposed strategy is applied and compared with four existing uncertainty sampling approaches, namely: least confidence, the margin of confidence, entropy, and the ratio of confidence query sampling strategies for selecting the most appropriate data from the pool of un-annotated data. Various notations used here include
Least confident
This strategy is the most basic uncertainty sampling strategy that selects the unlabeled instances for which the classifier has the least confidence in its classification. It can also be said that the most informative instances are those instances that have the highest uncertainty. The mathematical representation of this approach can be given as:
where
normalsize Least confidence query sampling strategy based active learning algorithm for named entity recognition (NER)
Empty labeled set –
Randomly select seed sentences from
Use labeling source to correctly annotate the selected sentences and place it in
Using sentences in labeled set
For every sentences in
The least confidence scores of the sentences are compared so that top 10 most uncertain sentences which are having largest least confidence scores are selected and removed from
If stop condition not meet then repeat from step 2, else stop
This strategy utilizes the Shannon entropy
where
Entropy query sampling strategy based active learning algorithm for named entity recognition (NER)
Empty labeled set –
Classifier – C
Randomly select seed sentences from
Use labeling source to correctly annotate the selected sentences and place it in
Using sentences in labeled set
For every sentences in
The entropy scores of the sentences are compared so that top 10 most uncertain sentences which are having largest entropy scores are selected and removed from
If stop condition not meet then repeat from step 2, else stop
Margin query sampling strategy based active learning algorithm for named entity recognition (NER)
Empty labeled set –
Randomly select seed sentences from
Use labeling source to correctly annotate the selected sentences and place it in
Using sentences in labeled set
For every sentences in
The margin scores of the sentences are compared so that top 10 most uncertain sentences which are having smallest margin scores are selected and removed from
If stop condition not meet then repeat from step 2, else stop
This strategy takes the margin between the first and the second most likely classifier predictions into consideration for the selection of most uncertain instances. The mathematical representation of this approach can be given as:
where
This query sampling strategy is similar to the margin of confidence sampling strategy. The difference is that it takes the ratio between the first and the second most likely classifier predictions into consideration for the selection of most uncertain instances. The mathematical representation of this approach can be given as:
where
Ratio of confidence strategy based active learning algorithm for named entity recognition (NER)
Empty labeled set -
Randomly select seed sentences from
Use labeling source to correctly annotate the selected sentences and place it in
Using sentences in labeled set
For every sentences in
The ratio scores of the sentences are compared so that top 10 most uncertain sentences which are having largest ratio scores are selected and removed from
If stop condition not meet then repeat from step 2, else stop
The drawback of the least confidence strategy is that it considers only the information (probability) related to the least confident class to calculate the uncertainty for the particular instance in consideration. This shortcoming is addressed by both the ratio of confidence and margin of confidence sampling strategy as they also consider the second-best probability value among the class for calculation of the uncertainty. However, both approaches still ignore the rest of the output distribution, which can be a problem in case of a large number of instances. The entropy sampling strategy solves the above problem and considers all the probability values of the different classes for calculation of the uncertainty of the instances [32]. The above four uncertainty query sampling strategy based active learning approach selects the unlabeled data instances for which trained classifier is most uncertain. In the next section, the proposed similarity-based hybrid uncertainty query sampling strategies for the active learning approach for named entity recognition task have been discussed.
A new similarity-based hybrid uncertainty query sampling strategy has been proposed for the active learning approach for the named entity recognition task. According to [14], the uncertainty query sampling strategies perform best in most of the cases in comparison to the other sampling strategies. Also, in the past, many researchers have applied uncertainty based query sampling and discussed the details concisely. So we have studied the known uncertainty query sampling strategies in depth. Further, we have improved the uncertainty query strategy by incorporating the similarity information of the most uncertain sentences according to the least confidence query sampling approach. In the proposed approach, instead of selecting ten sentences, we select a total of 30 sentences using the least confidence score of the unlabeled sentences. For each of the 30 sentences, the average similarity score is computed by taking the average of the similarity score of each sentence with the other 29 sentences. The similarity score is the fuzzy string matching score, which uses Levenshtein distance to compute the similarity of the sequence with other sequences. For ease of implementation, we have used the popular python package ‘FuzzyWuzzy’ for this task, which provides the number of scorers to find the similarity score between the sequences. We have used ‘WRatio’ scorer in our implementation, which can handle the upper cases, lower cases, and also various other parameters. Using the library, we can compare the sequences with each other and obtain a similarity score out of 100. The similarity score of 100 for the two sequences means that the two sequences are identical by the similarity index.
In each iteration of the proposed algorithm, ten sentences having the least average similarity scores have been selected from the 30 most uncertain sentences according to the least confidence query sampling strategy. So the ten sentences selected in each iteration are the least similar sentences out of 30 most uncertain sentences. The proposed algorithm is presented below:
Similarity based hybrid uncertainty query sampling strategy for active learning algorithm for named entity recognition (NER)
Empty labeled set –
Classifiers – C
Randomly select seed sentences from
Use labeling source to correctly annotate the selected sentences and place it in
Using sentences in labeled set
For every sentences in
The least confidence scores of the sentences are compared so that top 30 most uncertain sentences which are having largest least confidence scores are selected
For each of 30 selected sentences: i. Compute the similarity score of sentence with respect to remaining 29 sentences ii. Compute and store the average of the similarity scores of the current sentence with other 29 selected sentences
Sort the sentence_id with according to the average similarity score in ascending order
Select the first 10 sentence_ids having minimum average similarity score and remove them from
If stop condition not meet then repeat from step 2, else stop
Detail of named entity recognition datasets used in active learning experiments
Detail of named entity recognition datasets used in active learning experiments
In the next section, the experimental details are discussed for the above active learning experiment based on different query sampling strategies for named entity recognition task.
Datasets
In this paper, we have studied and applied the four different uncertainty query sampling strategies for the active learning approach for the named entity recognition task. We have also proposed a new similarity-based hybrid uncertainty query sampling strategy and compared its performance with the above four existing uncertainty query sampling approaches for an active learning approach for the named entity recognition task over three different biomedical text datasets and one Spanish language dataset. The first text dataset is the biomedical text dataset, which contains two different annotated entity classes i.e., the disease names and the adverse effects dataset [36]. The dataset consists of 400 abstracts that are randomly selected with the PubMed query ‘Disease OR Adverse effect’. The dataset has a total of 1428 tokens that belong to disease class and 813 tokens, which falls in adverse effect class. The second text dataset is also a biomedical dataset but in disease domain, i.e., the NCBI disease dataset, which contains a total of 793 PubMed abstracts, which contains a total of 6892 tokens that are disease mentions [37]. There is only one annotated entity class for this dataset, which is a disease. It is one of the benchmark dataset widely used for the disease named entity recognition.
The third text dataset is the JNLPBA dataset, which was derived from the GENIA Term dataset for the BioNLP/JNLPBA shared task 2004 organized by the GENIA project that deals with the classification of the molecular biology entities [39]. The dataset contains a total of 2404 MEDLINE abstracts having tokens annotated with a total of five different entity classes i.e., protein, RNA, DNA, cell line, and cell type. It is one of the largest benchmark named entity recognition dataset and has been widely used in the past by the researchers, particularly in the biomedical domain. The JNLPBA dataset is divided into train set, validation set, and test set for our experiment similar to that used by [40]. The fourth text dataset is taken from the shared task of CoNLL-2002, which contains four different annotated entity classes, namely: miscellaneous, organizations, locations, and person entities in Spanish language [41]. This dataset is divided into train set, validation (or testa) set, and test (or testb) set, and it consists of a total of 8323, 1915, and 1517 sentences in train, development, and test set respectively. Further details of the datasets are presented in Table 1.
All the above text datasets are publically available and are present in IOB2 format (Inside-Outside-Begin) [42]. Further details on how the above datasets are used for evaluation of the active learning of named entity recognition (NER) task using different uncertainty based sampling approaches are discussed in later sections.
Feature details and baseline classifier used
The performance of any machine learning model for any classification task depends mainly on the features used by the machine learning model. We have extracted various features from the text and tried different possible combinations of these features. In this paper, we have reported and used one of the various combinations of features on which the performance of the baseline machine learning model (i.e., Conditional Random Field model) was better. The features that are extracted and used in this paper include the following [43, 44]:
Word type: Here the information stored is related to the type of word, i.e., whether all the character of the word are alphabets or digits or contains both. Word case: The information related to the case of the word is also stored, i.e., whether the word is in title case or upper case or lower case. Word pattern: All the characters present in the word are replaced with certain predefined characters such that new word formed by this replacement contains the pattern information of the word [2]. All the uppercase characters are replaced by ‘ Suffix and prefix: To store the morphological information of the word, suffix, and prefix if available, up to 4 characters have been stored as a feature. Part-of-speech tag: The part-of-speech (POS) tags are one of the important features which are also passed to the baseline machine learning model while training. We have used NLTK (Natural Language Tool Kit) POS tagger to tag each word with the respective part-of-speech (POS) tags for all the three biomedical datasets. The part-of-speech tags were already present in the Conference on Natural Language Learning (CoNLL) 2002 Spanish named entity dataset. Context information: The information of the previous and next word is also stored along with the information of the current word and used as a feature. Begin and end of sentence: In most of the cases the first word of the sentence is often an entity and similarly last word of the sentence is also have its own importance. So we have also set Begin of Sentence (BOS) and End of Sentence (EOS) as True for appropriate word of each sentence so that it can also be used as a feature. If any sentence is containing only single word then the BOS and EOS both will be true for that particular word of the sentence. Other: The other information includes the lowercase version of the word and the word length, which is also used as a feature. Apart from that, hasHyphen() function is also implemented to be used as a feature that returns ‘True’ if the particular token in consideration contains a hyphen, i.e. ‘-’; otherwise, it returns ‘False.’
Different supervised machine learning models are used by different researchers in past for the task of Named entity recognition which includes Support Vector Machine (SVM), Hidden Markov Model (HMM), Maximum Entropy (ME), Conditional Random Field (CRF) and various Neural Network-based models [42, 45]. However, the most widely used and popular among them is the Conditional Random Field (CRF) model and is commonly used as a baseline classifier for the task of entity recognition. The conditional random field model is a discriminative undirected graphical machine learning model that makes no assumption related to the underlying distribution and models undirected graph connecting labels with observations. In this model, the conditional probability of the label sequence given the input sequence is considered instead of the joint probability of the label sequence and the input sequence [46]. It allows the arbitrary and non-independent features of the input sequences. The probability of the transitions between the labels depends on the past and future observations. For this research work, the sklearn-crfsuite has been used, which is a scikit-learn compatible python-crfsuite wrapper that includes the implementation of the Conditional Random Field (CRF). We have used the baseline Conditional Random Field (CRF) model for both traditional supervised approach and active learning-based approach for named entity recognition (NER) task so that both the approaches can be compared fairly in terms of performance or for the amount of annotated training data requirement. For training the Conditional Random Field (CRF) algorithm, we have used same parameters as given in [43] i.e. L-BFGS as training algorithm,
For the active learning of the named entity recognition (NER) task, the labels have been removed from all the datasets so that they were treated as a pool of un-annotated sentences in our experiment. Also, in place of the oracle, we have assigned the actual labels which were removed from those sentences programmatically by keeping track of the indexes of the sentences in
For the active learning-based approach, initially we have randomly selected 2% of the total sentences present in un-annotated data pool
Evaluation metrics and criteria
The F1-score is standard metrics which is used for the evaluation of all the named entity recognition experiments [47]. The weighted mean of the recall (R) and precision (P) can be interpreted as F1-score, where the value of 0 represents the worst score, and the value of 1 represents the best value. The formula for the F1-score is given by [17]:
where,
The main objective of this research work is to demonstrate the effectiveness of the existing uncertainty sampling strategies and to enhance their performance further by incorporating similarity features for the active learning of the named entity recognition task. The proposed and existing approaches have been compared with each other in terms of annotation cost, i.e., amount of labeled training data required by each approach to reach to the performance level attained by the supervised approach. We have used the performance of the traditional supervised approach only to find out the performance level (F1-score) it can reach using the baseline Conditional Random Field (CRF) model with a completely annotated training dataset so that it can be used as stopping criteria for each of the active learning approaches. According to [27], it can be done only if we already have the correct annotated labels for the training set in advance which is present in our case. The annotation cost have been measured for each of the query sampling strategies for the active learning experiment at the end by recording the amount of labeled data required by them to reach the stop condition. The active learning algorithm is stopped once they reach to the performance achieved by the traditional supervised approach using the baseline Conditional Random Field (CRF) classifier. So firstly, experiment is conducted with the baseline Conditional Random Field (CRF) classifier for all the datasets in the traditional supervised way. The results obtained by the traditional supervised approach have been reported in Table 2 for all the four named entity recognition datasets of different domains.
Final results obtained by the baseline classifier in traditional supervised way for four named entity recognition datasets of different domains
Details of the annotated training data required by different uncertainty sampling strategies is compared for active learning of named entity recognition (NER) task for three biomedical and one Spanish language datasets
To measure the annotation cost (i.e., amount of labeled training data required by each of the active learning-based approach until it satisfies the stop criteria), the active learning-based approaches have been allowed to run until they reach to the performance achieved by the baseline classifier in a traditional supervised way. The performance of the active learning approach for different uncertainty sampling strategies is compared in terms of final annotation cost (depends on number of annotated sentences in labeled train set
Performance of different uncertainty strategies based active learning approach with increasing number of annotated sentences in the labeled training set 
Amount of annotated training data (sentences) required by the different uncertainty sampling strategies based active learning algorithms to reach to the stop condition over four named entity recognition (NER) datasets of different domains.
As discussed above, Table 2 presents the performance of the supervised approach using the baseline Conditional Random Field (CRF) model which is used as the stop condition in the active learning experiment. The performance of the proposed query sampling strategy including the existing uncertainty query sampling strategies is measured in terms of the annotation cost. It is important to note that the annotation cost depends on the percentage of annotated sentences used by the active learner based on different query sampling strategies to reach the stop condition. If the annotation cost for a particular active learning approach is less, then it means that this particular approach needs a minimum amount of annotated sentences for training to reach the stop condition and vice-versa. The final amount of annotated data required by different query sampling strategies is presented in Table 3. The F1-score is recorded for each iteration of the respective uncertainty approaches based on active learning experiments for the named entity recognition so that their performance can be visualized and compared at any iteration level of the active learning experiment. The above learning curve is shown in Fig. 2. Also, the performance of various uncertainty query sampling strategies used in this experiment is compared on the basis of the final annotation cost, i.e., amount of total annotated training data used by them for reaching the stop criteria. Their performance (in terms of annotation cost) is compared over four named entity recognition datasets of different domains at the end and is presented in Fig. 3.
The result is discussed mostly in terms of final annotation cost for each of the query sampling strategies of active learning experiment; however, their performance curve can be seen at each iteration level possessing different numbers of annotated sentences in Fig. 2 for each of the datasets.
For the disease names and adverse effect dataset, the entropy sampling strategy performed well in comparison to other sampling strategies. Its performance is followed by the performance of the proposed algorithm and the ratio of confidence based uncertainty active learning sampling strategy. They all took 19.74% to 22.88% of total sentences to reach to the stop condition. The least confidence and margin of confidence sampling strategy took the maximum number of sentences in the labeled training set
For the NCBI disease validation dataset, the least confidence strategy, and proposed algorithm performed well altogether. They took just 16.56% and 17.85% of annotated sentences in labeled training set
The JNLPBA is the largest dataset having a total of 24806 sentences. For the JNLPBA validation dataset, all the sampling strategies except the margin of confidence sampling strategy performed well and reached to the stop criteria by taking 26.77% to 29.14% of the labeled sentences in the labeled train set
The CoNLL Spanish dataset is the second-largest dataset and the only non-biomedical dataset among the considered datasets for the experiment. For the CoNLL Spanish validation set, the proposed algorithm performed best and reached to stop condition with just 25.18% of labeled sentences for training. Its performance is followed by the ratio of confidence, and entropy-based active algorithms, which took 28.55% to 29.51% of labeled sentences for training the active learner until it achieves the stop condition. The least confident and the margin of confidence based active algorithms took the maximum amount of labeled sentences to reach the stop condition, i.e., 31.79% and 31.92% respectively. For the CoNLL Spanish test set, the proposed algorithm performed best and reached to the stop condition by using just 30.35% of labeled data for training the active learner. Their performance is followed by the entropy-based active algorithm, which required 36.59% of labeled sentences. The least confidence, margin of confidence, and ratio of confidence required the 39.36% to 41.53% of labeled sentences from the total train set to train the active learner. Their learning curve is shown in Fig. 2f and g, respectively.
It is important to note that the active learning approach for all the different uncertainty measures always reaches the performance of the baseline Conditional Random Field (CRF) classifier, with much fewer annotated data required for training, i.e., they reduce the annotation cost significantly. The proposed algorithm performed well and required the minimum number of labeled sentences in most of the cases. The result achieved by the proposed algorithm is either the best or very competitive to the best performing active learning algorithms in almost all the cases. The main reason behind the competitive result by using proposed strategy is that it selects 10 least similar sentences out of the 30 most uncertain sentences selected using the least confidence query sampling strategy in each iteration of active learning experiment for different datasets. So the proposed strategy further improves the performance of the least confidence strategy, and hence the performance of the proposed algorithm also depends on the performance of the least confidence sampling strategy. It is clear from the above result that the performance of the proposed algorithm is better than the least confidence approach in almost all the cases and very competitive in the remaining case in terms of the annotation cost. The results of the above experiments are evident that combining the sentence similarity scores with the least confident sampling strategy further improves its performance, i.e., proposed approach further reduces the annotation cost. Also, if we compare only the performance of all the existing query sampling strategies for all the datasets, we will find that none of the existing query sampling strategies performs well in all the datasets altogether. But the performance of the proposed algorithm is either very competitive to the best performing query sampling strategy or it provides the best result itself, consistently for all the datasets in comparison to the other query sampling strategies.
So in most of the cases, the proposed algorithm performed well and required a minimum amount of data to be labeled for the training of active learner in comparison to the other sampling strategies based on active learning algorithms. In the proposed work, the similarity score of the sentences is combined with the least confidence sampling strategy and considered for selecting the most appropriate unlabeled sentences from the unlabeled train set
Conclusion
The performance of the traditional supervised machine learning algorithms depends highly on the annotated training data size. There is need of large annotated datasets so that the supervised models can perform well for different tasks. To minimize the dependency over the large annotated datasets, a new uncertainty based query sampling strategy have been proposed for active learning approach that have been evaluated for named entity recognition (NER) task. The proposed query sampling strategy is compared with other pool-based uncertainty sampling strategies, i.e., least confident, margin of confidence, ratio of confidence, and entropy-based query sampling strategies on the basis of annotation cost, which depends on appropriate instances of data selected for training of the active learner. For this purpose, various features are extracted, and a baseline Conditional Random Field (CRF) classifier is used. In order to compare the performance of the proposed query sampling strategy with the existing uncertainty query sampling strategies, the active learning algorithms keep adding sentences in training set until it reaches to the performance of the supervised model. The baseline Conditional Random Field (CRF) classifier is used in all the experiments. It has been found that the active learning algorithms based on different uncertainty sampling strategies always reaches to the performance of the traditional supervised baseline Conditional Random Field (CRF) model while reducing the annotation cost significantly.
The existing uncertainty based sampling strategies consider only the probability value of the model to determine the instances of data for selection. The proposed hybrid uncertainty based active learning algorithm combines and considers the fuzzy string matching score with the least confidence sampling strategy for selection of most informative sentences from which classifier can learn most about the unseen sentences. The proposed algorithm in most of the cases perform better than the other query sampling strategies (including the least confidence-based approach), which suggests that considering the similarity of the sentences is helpful in appropriate data selection for the active learning and can contribute further to reduce the labeling cost. The proposed active learning algorithm is easy to implement and can be used for the classification of important entities in the textual named entity recognition dataset of any domain.
In this paper, the sentence similarity score is only combined with least confidence query sampling strategy for the active learning for named entity recognition task. In future, the sentence similarity score can be combined with the other uncertainty based query sampling strategies for the above task. Also, different sentence similarity algorithms can be used for the experiment in future.
