Abstract
In recent years, information extraction from tweets has been challenging for researchers in the fields of knowledge discovery and data mining. Unlike formal text, such as news articles and pieces of longer content, tweets are of a specific nature: short, noisy, and with dynamic content. Thus, it is difficult to apply the traditional natural language processing algorithms to analyze them. Active learning is well-suited to many problems in natural language processing, especially when unlabeled data may be abundant, but labeled data is limited. The method proposed here aims to minimize annotation costs while maximizing the desired performance from the model. The method recognizes named entities from tweet streams on Twitter by using an active learning method with different query strategies. The tweets are queried for labeling by a human annotator based on query-by-committee, uncertainty-based sampling, and diversity-based sampling. The experimental evaluations of the proposed method on tweet data achieved better results than random sampling.
Introduction
Information extraction is the task of automatically extracting structured information from unstructured or semi-structured documents [15]. The extracted data may be used directly to display a list of named entities mentioned in a document, or to store them in a database for other information processing tasks. Information extraction over microblogs has only recently become an active research topic. It attracts researchers in the fields of knowledge discovery and data mining [6, 24] owing to the up-to-date information shared and their noisy, short nature.
Named entity recognition (NER), also known as entity identification and entity extraction, is a key information extraction task and the core of natural language processing (NLP) systems. NER includes two tasks: first, identifying proper nouns in a document, and second, classifying these proper nouns into predefined categories of interest such as personal names, locations, organizations, expressions of time, quantity, monetary values, percentages, etc. [19]. The extracted named entities can be utilized for various purposes in NLP applications, such as entity relation extraction [21], document summarization [9], speech recognition [8], term indexing in information retrieval systems [14], etc.
The popular NER methods use supervised learning such as maximum entropy, hidden Markov models, support vector machines, and conditional random fields (CRF) [18], which require input samples and the desired output provided by an expert in the related domain. The goal is to learn a general rule that maps input to output. The advantage of a supervised learning method is an achieved high performance if it is applied to well-formatted text with proper samples in terms of grammar and lexicons. However, the achieved results are not satisfying for short and noisy messages, such as tweets from Twitter. Another limitation of supervised learning is that it requires a large number of annotated data in order to obtain high-performance classifiers.
Twitter is a social network service that provides access to large volumes of data in real-time, but its tweets are notoriously noisy and hard to tackle with NLP. Unlike well-formatted text, processing tweets brings some challenges due to the short, noisy, and dynamic nature of the content. The current systems, trained on well-formatted text, perform poorly on tweets. For example, a performance by the Stanford NER that uses the CRF model to train a classifier for CoNLL03 data dropped from 90.8% to 45.8% when it was applied to tweets [27].
One big limitation with NER on tweets is the lack of information for interpretation in a tweet. The informal nature makes conventional features, such as grammar, part-of-speech, and capitalization, less reliable in tweets. The other is the limited length of a tweet (i.e., 140 characters at most) and without restrictions on writing style, leading to excessive abbreviations and more shorthand. Tweets exhibit much more language variation, tend to be less grammatical than longer posts, contain unorthodox capitalization, and make frequent use of emoticons, abbreviations, hashtags, and user mentions, which form an important part of the meaning [16].
Another limitation for NLP systems is the availability and massive amount of streaming data in online social networks and social media [5, 12]. Currently, Twitter has more than 320 million monthly active users, and 500 million tweets are sent per day 1 . One challenge for mining streaming data is the lack of labeled data due to rapid changes in distribution and the high cost of labeling. The traditional NER methods require prior labeling of the training data, and the NER process can be done offline. This is not suitable for online services. To deal with the brevity and dynamic content of tweets in terms of time, different methods have been proposed for NER in recent years. The majority of previous works has primarily focused on recognizing named entities in a pool of tweets [1, 27]. However, the challenge is how to effectively conduct NER in dynamic data streams.
Unlabeled data sources for NLP are often easily obtained. In sequence modeling, like an NER task, it is more difficult to obtain labeled data, since annotating these texts can be rather tedious and time-consuming. The shortage of labeled data and suitable sample selections are the obstacles to supervised learning methods in developing application systems. The problem of how to reduce the amount of training data but still obtain the desired performance is an interesting issue for researchers. Active learning (AL) is an attractive method, which addresses the shortage and incomprehensiveness of training data. Instead of relying on the labeled samples select randomly from a large corpus, AL methods choose samples to label via optimal query algorithms. By using different query strategies, AL methods may determine a much smaller subset, but a most informative one, from a large amount of unlabeled data.
This paper presents an AL method for classifying named entities from tweet streams with three different query strategies (query-by-committee, uncertainty-based sampling, and diversity-based sampling) to select the most informative tweets for the training phase. First, two classifiers trained on the CRF model and the maximum entropy model are used to classify unlabeled data in data streams. Dissimilar results are selected in order to ask a human annotator to correct them. Second, a single classifier (i.e., trained on the CRF model) is learned from the labeled data and is subsequently used to examine the unlabeled data. Samples, where the classifier is least certain about the label are queried for human annotation. Third, the tweets that contain proper nouns for which the context is the least similar when compared to the existing training data are queried for labeling. As a case study, experiments were conducted on tweet data to assess the performance of the proposed query strategies. Compared to the passive learning method, the proposed method significantly reduces the training data and gets better results.
The organization of this paper is as follows. Section 2 briefly presents related works. An introduction to the AL method is given in Section 3. Section 4 presents the proposed method, and the experiments are discussed in the subsequent section. The conclusion and suggestions for future work are presented in Section 6.
Related work
Generally, manual annotation of training data for information extraction is time-consuming and expensive. AL methods are an effective way to reduce the annotation effort. The key idea behind AL is that a machine-learning algorithm can achieve better performance with fewer labeled training data if it is allowed to select the data for training. The first work worthy of mention here was contributed by Li et al. [7], where the authors proposed a novel two-step unsupervised NER approach to recognize named entities in Twitter data based on gregarious properties of named entities in targeted tweet streams. The method dealt with streams, however, it did not determine the class of the identified entity, determining only if a phrase is an entity or not.
Settles and Craven [3] surveyed previous query selection strategies and proposed several novel algorithms for probabilistic sequence models to address shortcomings in training data. They also conducted a large-scale empirical comparison using multiple corpora and demonstrated that the proposed methods obtained results better than state-of-the-art methods. Yao et al. [17] presented an alternative AL strategy and combined this method with semi-supervised learning to reduce the labeling effort for a Chinese NER task. They utilized a strategy based on information density for the sample selection in a sequential labeling problem, which is suitable for both AL and self-training. The proposed method significantly reduced the training effort and got similar (and even better) results, in comparison to the conventional supervised learning method. By using 15,000 training sentences of Sighan bakeoff 2006 MSRA NER corpus, they achieved an F1 score of 77.4% with the proposed hybrid method.
Chen et al. [28] developed and evaluated AL methods for a clinical NER task to identify concepts in medical problems and in treatments from clinical notes. The AL experiments used a number of existing and novel algorithms in three different categories, including uncertainty-based, diversity-based, and baseline sampling strategies. The system was applied to the annotated NER corpus from the 2010 i2b2/VA NLP challenge that contained 349 clinical documents with 20,423 unique sentences. Based on the learning curve of the F1 score and the number of sentences, uncertainty sampling algorithms outperformed all other methods, and most diversity-based methods also performed better than random sampling.
Marcheggiani and Artieres [10] investigated the behavior of several AL strategies for sequence-labeling tasks tailored for Partially-Labeled CRF in a pool-based scenario on four sequence labeling tasks: phrase chunking, part-of-speech tagging, named entity recognition, and bio-entity recognition. Several AL strategies were proposed based on measures of uncertainty adapted for AL with a partially labeled sequences scenario and tailored on Partially-Labeled CRFs. In another study [13], Hassanzadeh and Keyvanpour presented a variance-based AL method for an NER task that chose informative entities to minimize the variance of the classifier currently built from labeled data. By finding entities where labeling by the current model was certain, they used self-training to resolve unlabeled samples. The CRF model was chosen as the underlying model for this experiment. The experiment applied to the CoNLL03 English corpus showed that the method used considerably fewer numbers of manually labeled samples to produce same results obtained when samples were selected in a random manner.
In previous work [22], we also had an experiment with a semi-supervised learning approach for an NER task. The proposed method was a combination of the CRF model, the hand-made rule, and the co-occurrence coefficient of feature words surrounding entities. The experimental results showed that the method achieved high performance with only a small labeled training dataset, and significantly improved the recall of the CRF model.
This paper presents an AL method to extract named entities from tweet streams. We will show that the proposed method is an effective way to resolve the NER task on Twitter.
1: Label initial data L and train classifier C
2: Apply C to unlabeled data U to obtain R
3: Select the most informative samples, R′ from R
4: Ask an annotator to correct/label samples in R′
5: Add R′ to L
6: Retrain classifier C on L
7: Repeat steps 2 through 6, until the stopping criteria are met
8: Output a classifier that is trained on L
Active learning-based approach
Active learning
Active learning is a supervised learning method in which the learner controls the selection of necessary data for the training phase. The most important issue is how to query necessary data in order to ask a human annotator to label them. The learner asks an expert in the related domain about labels of samples for which the learned model has so far made unreliable predictions. The main purpose of AL is to create a classifier as good as possible without supplying more labeled samples and incurring more human effort in annotating the data. Active learning has been successfully applied to a number of NLP tasks, such as information extraction, named entity recognition, text categorization, and so on [11]. The prototypical AL algorithm is outlined in Algorithm 1.
Scenarios
Different scenarios that have been considered in the literature. The learning system can make queries are (i) membership query synthesis, (ii) stream-based sampling, and (iii) pool-based sampling [4].
i) Membership query synthesis: The learning system asks whether a particular domain sample belongs to the unknown concept or not. The model itself generates some instances, rather than using real-world unlabeled instances [20].
ii) Stream-based sampling: In this scenario, the unlabeled samples arrive sequentially. The size of the data stream is often large in real-world applications and it is impractical to store all the unlabeled data. The data can first be sampled from the actual distribution one at a time from the data source, and then the learning system can decide whether to request labels or not. Stream-based sampling is appropriate in some special cases, such as a limitation in memory or processing power or up-to-date information processing.
iii) Pool-based sampling: Assume there is a huge pool of unlabeled data, which is usually assumed to be closed, and samples are selected for queries. The samples are queried according to an informativeness measuring technique so they can be evaluated. There is the main difference between this scenario and stream-based selective sampling. Stream-based selective sampling queries the samples sequentially at the time the data arrive, and makes the query decisions individually, whereas pool-based sampling queries on the pool of available samples; therefore, it can evaluate the entire dataset before selecting the most informative samples.
Proposed method for NER
Active learning-based framework
This section presents a brief description of the proposed AL method. This work experiments with a practical stream-based AL scenario. The workflow of the idea is illustrated in Fig. 1, and the pseudo-code of the proposed algorithm for the training phase is described in Algorithm 2. The main steps in the algorithm are as follows:
1: Set L as the set of initial labeled samples;
2: Train L giving C;
3:
4: Get arrival data - U;
5: Preprocess U;
6: Apply query strategies to unlabeled data U to obtain D;
7: Ask the annotator for labeling samples in D to obtain D′;
8: Add D′ to L;
9: Retrain classifiers C on L;
10:
11:
Initiation
The model parameters and data are initialized for the system. The model parameters are set to default and other parameters in query strategies are set as presented in Section 5.2. Initially, a set of samples is selected to annotate as the initial training data for classification models. Two models are utilized to train the classifiers: the CRF model and the maximum entropy model. This work uses the CRF model provided by Stanford 2 and the maximum entropy model provided by OpenNLP 3 .
Preprocessing
The unlabeled data are processed before being selected by the query strategies in order to eliminate elements that are not necessary. Elements such as mentions, hyperlinks, hashtags, symbols and emoticons are removed from the tweets.
Querying
To increase the training dataset, new samples are selected from data streams based on query strategies such as query-by-committee, uncertainty-based sampling, and diversity-based sampling. The tweets are examined individually and selected if they satisfy one of the following criteria: i) the classification results of the models are dissimilar, ii) the CRF model that is learned from current labeled data is least certain about samples, and iii) the context of the proper noun is the least similar to the current training data. Then they are subjected to labeling by a human annotator and added to the training data to retrain themodels.
Training
The two models (i.e., the CRF model and the maximum entropy model) are applied to the updated training data to train the two individual classifiers. They are used for mapping input samples to predefined categories. The query-by-committee strategy uses both of them whereas the uncertainty-based sampling strategy only uses the classifier of the CRF model.
Iteration
The process is iterated until the stopping criterion is met. In this experiment, the algorithm is finished when the annotator stops working or the creation time of the tweets is beyond the considered time (i.e., the considered time of the experiment for training and testing as described in Section 5.1).
Context feature vector
A part-of-speech (POS) tagger assigns the parts of speech to each word or token, such as noun, verb, adjective, etc. Some tags of POS tagging are shown in Table 1. The proper nouns are extracted based on the results of POS tagger (i.e., this work uses a publicly available POS tagger developed by Stanford 4 ).
With each proper noun in the unlabeled data and the named entity in the labeled data, a context vector is constructed to represent its context information. The size of the context vector is the size of the considered window (i.e., the number of words are examined on both sides of the proper noun). The elements of the vector are the POS tagging of words in the considered window. Proper noun PN is represented by the context vector of the POS tagging as follows:
Once a classifier is trained on the current training data, it leads to the challenging issue of how to update and evolve the model on demand. The training data can be updated to guarantee higher performance and cover wider domains in the classifier. This section presents three proposed query strategies for an active learning approach [23].
Query by committee
Query by committee is based on using a multi-classifier to select the most informative samples. The classifiers are trained on the same training data by the individual models (i.e., the CRF model and maximum entropy model). Then, they are used to classify the unlabeled data. Disagreement between the classifiers with respect to the value and category of a named entity is utilized to decide whether that sample is to be labeled by the human annotator.
Uncertainty-based sampling
The class of each word in the text is decided by the classifier based on the probability of each possible label. The confidence of each class belongs to the interval [0, 1], where 0 denotes that the classifier is not at all confident about its decision and 1 denotes that it is completely confident. The sum of the classes’ confidence scores is 1. We can get confidence levels for all predefined classes of a word and the class with the highest confidence is the output class. The samples are selected based on the confidence score and a predefined threshold. If the classifier gives high confidence to a certain class (near 1), it is not necessary to update the training data with the sample. Otherwise, if the confidence of the class is less than the threshold, the classifier is quite uncertain about its decision. The samples with low confidence should be selected for updating the model in order to increase the certainty of the classifier dealing with those samples. The threshold is defined to select low confidence samples, hence it is set low (near 0) within an interval of values [a, b]. The classifier’s performance may be directly proportional to the number of selected samples. If the threshold is a narrow interval, then the annotation effort is low. If the threshold is a wide interval, much more samples may be queried, and the annotation effort also increases. A trade-off between the number of selected samples and the system performance should be investigated in setting the confident threshold.
Diversity-based sampling
The diversity-based sampling method selects samples where the training data is the least similar. A vector model is used to measure the similarity between an unlabeled tweet and all labeled tweets in the training data. The tweets that contain a proper noun are examined for similar context between the proper noun and named entities that exist in the current training data. Each proper noun and named entity are represented by a context vector, as presented in Section 4.2. The context vectors of proper nouns are then compared with all context vectors of named entities. The tweets where similar scores are less than or equal to the maximum similar threshold are selected. They will be subjected to labeling by the human annotator and then added to the training data.
Evaluation and baseline
The performance of this task was calculated following #MSM2013’s measures [2]. Precision, recall, and F-measure were calculated for each entity category, and the final results for overall entity categories are the average performance of defined categories.
Assume that we have a test set denoted by TS and a gold standard set denoted by GS. The named entity is represented in a tuple of the form: (entity value, entity category). Let (x, y) be a set of tuples for named entities that are extracted from TS and GS. A set of true positives (TP), false positives (FP) and false negatives (FN) are defined as follows:
Strict matching is performed between results from the system and answers in the GS for both correct detection of the entity value and the entity category.
Precision (P
t
) and recall (R
t
) for a given entity category t are defined as follows.
The average values of the precision and the recall of all entity categories are and , respectively, for all entity categories.
The F-measure (also called F1 score) is the harmonic mean of and , defined as follows.
Three algorithms were implemented, which are the query-by-committee (QBC) algorithm, the uncertainty-based sampling (UBS) algorithm, and the diversity-based sampling (DBS) algorithm. The following combinations of query strategies were also applied: QBC and DBS, QBC and UBS, UBS and DBS. A random sampling strategy (i.e., passive learning) was also conducted as the baseline strategy.
Dataset
The proposed system was applied to tweets of 20 users that were collected on Twitter over two years (from January 1, 2014, to December 31, 2015) by using the public Java library for the Twitter API 5 . Data was filtered by removing unnecessary elements of the tweets, such as hashtags, mentions, hyperlinks, emoticons, or symbols, etc. Finally, 11,966 unlabeled tweets were selected. The dataset was divided into two parts: data to be queried for training and a test set for evaluation. There were 10,813 tweets selected for querying. They were created from January 1, 2014, to September 30, 2015, and sorted by creation time. The test set included 1,153 tweets created between October 1, 2015, and December 31, 2015, and annotated as the gold standard to assess performance. In addition, there were 4,716 labeled tweets used as initial labeled data. These labeled tweets were from #MSM2013’s training data and our previous work [22].
The dataset was annotated with three named entity categories: person, location, and organization. The training data of the CRF model was annotated following the IO format 6 and also converted to OpenNLP’s format 7 for the maximum entropy model. The named entity distribution over the test set is represented in Fig. 2.
Settings
In the experiments, two classification models were used in the training phase: the CRF model and the maximum entropy model. In the evaluation, however, only the CRF model was used. The parameters of the models were set to default values. For each round of the AL algorithm, the system queried tweets in a given time interval set to 20 (i.e., the tweets were queried during the next 20 days for each iteration). For the uncertainty-based sampling algorithm, the confidence scores of classes for each word were provided by the CRF model. The confidence threshold for tweet selection was set within the interval [0.1, 0.4]. If the tweet contained a word where its highest confidence score for the class belonged to [0.1, 0.4], then it was selected for labeling by the human annotator. For the diversity-based sampling algorithm, the window size of the context vector was set to 6 (i.e., three words on the left and three words on the right of the proper noun or the named entity were examined). The maximum similar threshold for measuring the similarity of the context vectors was set to 3.
Results
We implemented an AL method to compare differences between AL query strategies and the random sampling strategy for NER tasks on Twitter. Two scenarios were single-query methods (i.e., QBC, UBS, DBS) and combined-query methods (i.e., QBC+UBS, QBC+DBS, UBS+DBS), and they were conducted separately. The systems were run on the tweet data as mentioned in Section 5.1. The results for each entity category and the overall results from systems are shown in Tables 2–4. All systems were tested in the same AL framework (i.e., the same initialtraining data, model, default parameters for the models, time intervals, and test set).
The performance of all query methods outperformed the RS method, in which the selected tweets of the AL method were less than or equal to those of the RS method. The F1 score of QBC+DBS was the best in comparison with the others. Concretely, it was 67.5% better than the baseline by 7.9%. In the single-query methods, the UBS method outperformed the others. Though it required the fewest unlabeled tweets, its F1 score was the highest (i.e., the F1 score of UBS is 64.8% better than QBC, DBS, RS by 0.8%, 1.2%, 5.2%, respectively). The F1 scores of combined-query methods were better than those of single-query methods (except for UBS+DBS, its F1 score is not better than UBS).
We also computed the performance and the number of tweets selected at the end of the process for each method, as shown in Table 5. The UBS method needed only 19.7% of the unlabeled data to achieve an F1 score higher than other single-query methods. Meanwhile, the QBC method and the DBS method needed 20% and 30.1% to achieve 64% and 63.6% F1 scores, respectively. Although the F1 score of the QBC+DBS method is higher than other combined-query methods, it needed 41.4% of the unlabeled data whereas UBS+DBS needed 40.8% and QBC+UBS only needed 31.4% (i.e., to achieve 64.6% and 66.2% F1 scores, respectively).
To further demonstrate the evolution of different querying methods, the learning curves that plot the F1 score versus the number of selected tweets were generated. It is too difficult to distinguish all learning curves in one diagram. Therefore, each diagram includes a learning curve and its trend line. The learning curve and the trend line for single-query methods are shown in Figs. 3–6 and combined-query methods are shown in Figs. 7–9. The comparisons between trend lines of single-query methods and combined-query methods are shown Figs. 10–11. In general, the F1 scores increased along with intervals of time and received the best performance at the end of the process. The UBS method and the QBC+DBS method had the best F1 score at the last stages of the learning curve. The DBS method was surpassed by other single methods at every stage and the UBS+DBS method only surpassed other combined methods at early stages. The performance of the QBC method and the QBC+UBS method were not better than the others at the last stages.
Discussion
In the scenario of single-query methods, the UBS algorithm outperformed others, because it queried the most informative samples using knowledge of the trained models. The results of these methods were nearly similar (i.e., 64% for QBC, 64.8% for UBS, and 63.6% for DBS), but the numbers of selected tweets were quite different (i.e., 20% for QBC, 19.7% for UBS, and 30.1% for DBS). One concern with applying the QBC algorithm and the UBS algorithm to real tasks is that they rely on updated models, which require time for training. In the experiments, it took several minutes to fully train a model; this is also sui| in reality for the stream-based scenario. The DBS method is independent of the training models, but the experiment did not outperform the others.
Figure 12 presents a comparison between the F1 score and the number of selected tweets in the systems. The F1 scores of combined-query methods were high, but they required more tweets than single-query methods. In general, the UBS method is the best (i.e., to achieve a 64.8% F1 score, it only queries 19.7% of the unlabeled tweets). The combined-query methods were selected more tweets, but the F1 score was not as desired. Though the UBS method only selected a half of unlabeled tweets of other combined-query methods, its performance is good (i.e., the F1 score of QBC+DBS was better than UBS by 2.7%, but its selected tweets were more than UBS by 21.7%). The combined-query methods are not suitable for a task where the annotation costs are expensive.
Conclusion and future work
This paper proposed an AL method for an NER task that deals with tweet streams on Twitter.The tweets were queried for labeling by a human annotator based on different query strategies: query-by-committee, uncertainty-based sampling, and diversity-based sampling. The experiments demonstrated that the proposed AL algorithms have the potential to reduce annotation costs in building NER models and they maximized the performance of the model. When the annotation costs of samples are the same, the proposed methods achieved better performance than random sampling. The performance of the uncertainty-based sampling method is the best in terms of balancing F1 score and the selected samples.
In future work, we plan to propose more query strategies based on entropy measures of tweets and the similarity of tweets by using word embedding for NER on Twitter. In addition, a combination of active and semi-supervised learning methods will be applied to increase labeled samples.
Acknowledgment
This work was supported by the 2014 Yeungnam University Research Grant.
Footnotes
https://about.twitter.com/company, accessed 2016/5/29.
http://nlp.stanford.edu/software/CRF-NER.shtml
https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html
http://nlp.stanford.edu/software/
http://twitter4j.org
http://nlp.stanford.edu/software/crf-faq.shtml#a
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind
