Abstract
Distant supervision has become the leading method for training large-scale information extractors. It could be encoded in the form of labeling functions, which employ knowledge bases to provide labels for the data. However, most previous works use only simple labeling functions, resulting in too much noise in the training data, and the knowledge bases are far from well-explored. In this paper, in order to improve the labeling quality of the training data for distant supervision relation extraction, we propose to make use of existing knowledge bases to effectively learn labeling functions. Specifically, labeling functions are represented as Markov Logic, which can integrate various resources into a unified model naturally. Experimental results show that the training data produced by the learned labeling functions is significantly improved in quality. Different distantly supervised relation extraction models trained on the produced training data can also achieve better performances.
Introduction
For some information extraction applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. Distant supervision, also known as knowledge-based weak supervision, therefore proposes a paradigm for the heuristic creation of training set. The basic idea is that it matches relation instances from a knowledge base to the text corpus. A relation instance
In general, distant supervision consists of two main tasks, i.e., training data labeling, and extraction model training. In order to improve the final performance, existing works of distant supervision handle noisy data in different tasks. In general, they can be grouped into three categories:
Compared with the training data resulted by the labeling methods used in the related works of category 1 and 2, high quality training data means that there is less noise in the training data.
For the training data labeling task, works in different categories explore different types of domain knowledge to label the training data. Works in category 1 use only relation instances in the knowledge base. Works in category 2 use relation instances and some constraints, e.g., entity type constraints or syntactic structure constraints. Works in category 3 use more domain knowledge. For the extraction model training task, they can share the same models for relation classification. In this paper, we focus on the former task, i.e., learning high quality labeling functions for distant supervision.
Unlike the information extractors which address the problem of deterministically generating a set of candidate logical forms from natural language sentences, labeling functions address the inverse problem. Given a set of candidate logical forms, i.e., relation instances in the form
The proposed method is able to make full use of existing knowledge bases (e.g., Freebase [7], FrameNet2
In our experiments, relation instance knowledge bases such as Freebase, text corpora such as New York Times (NYT) corpus [38] and Wikipedia4
The major contributions of this paper are as follows. 1) We propose to learn labeling functions for distant supervision, which can improve the quality of the labeled training data while reducing manual work in the task. 2) We represent labeling functions as MLNs, which can integrate various knowledge bases and language processing technologies into the model naturally.
Distant supervision was first introduced by [9]. They aligned a knowledge base with the paper abstracts in PubMed, and then used the extracted sentences to train a naive Bayes extractor. Later, [23] employed distant supervision in relation extraction. They used a set of frequent relations in Freebase to train relation extractors over Wikipedia without labeled data.
Since then, a lot of works focused on relation extraction using distant supervision. They can be grouped into three categories according to how to handle noisy data.
Most of the previous works employ heuristic approaches to label training data and then handle noise when training the relation extractors. [33, 14, 39] proposed a series of graphical models to solve the problem. [1, 37] used hierarchical topic models to handle noise in training data. [44] proposed a tagging scheme that can jointly extract entities and relations to tackle the problem of erroneous delivery. [6, 2] combined active learning with distant supervision to select meaningful distantly labeled instances. [34, 10] considered the relation extraction task as a matrix factorization problem. Recent works have begun to explore additional information to facilitate the training of the relation extractors, such as the side information about rare entities [35], the prior of positive bags [22], the fine-grained entity types [16, 43, 19] , the indirect supervision knowledge [13], the human labeled data [26], and the document structure [5, 20]. However, the improvement of these methods is limited by the noise in the training data.
A variety of strategies have been proposed for correcting wrong labels before feeding the training data to the relation extractors. [40] presented a generative model to reduce wrong labels given the labeled corpus created by distant supervision. [15] used mention frequency, pointwise mutual information and mention centroids to remove noise. [41] tried to correct the most likely false negatives in the training data based on the ranking of pseudo-relevance feedback. [3] detected highly ambiguous entity pairs and removed them from training data. [36] proposed a Path Ranking Algorithm to identify possible false negatives in the training data. Although reduced, the noise in the training data still affect the performances of these methods. Deep reinforcement learning have also been applied to distant supervision relation extraction. [11] proposed an instance selector to cast the sentence selection task as a reinforcement learning problem to choose high-quality training sentence for a relation classifier, and [29] proposed a false-positive indicator to automatically recognize false positive labels and then redistribute them into negative examples.
Recently, researchers proposed to avoid noise when labeling training data. [24] encoded the world knowledge of domain experts into rules to identify positive training data. Using these rules, they can automatically generate new positive training data that simulates human experts’ annotations. The works [31, 30] proposed a paradigm for the programmatic creation of training sets called data programming in which domain experts provide a set of labeling functions, which are programs that heuristically label large subsets of data points. [18] proposed an embedding framework to resolve the conflicts generated by different handcrafted labeling functions. These works can create training data effectively, however, they rely mainly on expert designed labeling functions which are expensive to obtain and difficult to guarantee quality in various domains.
Our work falls into the third category. The main difference between our work and existing ones is that we learn labeling functions based on some general rules that are transformed from world knowledge easily instead of designing and selecting specific rules by trial and error. Furthermore, our method can make full use of the existing knowledge bases and language processing technologies to facilitate the learning of labeling functions.
Labeling function learning algorithm
This section introduces the proposed algorithm Labeling Function Learning (LFL), i.e., Algorithm 1, which consists of three main steps: observed fact extraction, initial labeling function generation and weighted labeling function learning.
Given a set
Step 1 (observed fact extraction, line 6): This step uses the method Parse to extract observed facts from each sentence in
In our experiments, we employ Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/) to do POS tagging, dependency parsing and NER tagging.
Step 2 (initial labeling function generation, lines 7 and 8): This step encodes world knowledge as first-order logic rules. The method FactExtractor extracts a set
Step 3 (weighted labeling function learning, lines 9 and 10): This step initializes an MLN using the rule set constructed in the previous step (line 9), and then learns the weights of these rules using the method WeightLearning based on
Given the above sentence, which is labeled as a training example of the relation “FOUNDER”, a set of observed facts can be extracted by the method Parse which are shown as follows:
where
Various knowledge sources can be incorporated into the labeling functions to facilitate the generation of annotated data. Knowledge bases (e.g., Freebase) which contain known relation instances can provide the knowledge for entity names and entity types. Knowledge bases (e.g., FrameNet, WordNet) which contain semantic information between words can provide the knowledge for relation indicators. Syntactical parsing information can provide the knowledge for dependency relations between words. We first introduce the pre-defined predicates that represent these knowledge sources, and then detail the generation of initial labeling functions.
Predicate for entity name knowledge
Inspired by the works in distant supervision, we assume that if the entity names of a relation instance
The predicate
Predicate for entity type knowledge
The entity types of the recognized entities in
The predicate
Predicates for relation indicator knowledge
Unlike entity matching, words that describe relation names (e.g., “FOUNDER”, “PLACE_OF_BIRTH”) in a relation instance usually cannot be directly used to match with the relation mentions in the sentences. We observed that some relations can be identified based on indicative words. For example, to express the relation “FOUNDER”, a sentence usually contain one of the following indicative words such as “founded”, “establish”, “founder”, or “co-founders”, etc.
Descriptions and alias of some Freebase relations
Descriptions and alias of some Freebase relations
Examples of relation indicator seeds and extensions of some Freebase relations
Inspired by previous works [4, 42], we use indicators in verb and noun forms to identify relation mentions in the sentences. Specifically, we employ a self-expansion method to extract the set of relation indicative words. Relation indicator seeds are extracted from the descriptions and alias of Freebase relations. Note that Freebase has been migrated to Wikidata6
Two predicates are used to represent the above knowledge.
Recent works have shown that dependency relations between entities are effective for relation classification [8, 4], and they might also be useful to identify relation mentions.
The predicate
Examples of dependency relations defined in Stanford CoreNLP
Examples of dependency relations defined in Stanford CoreNLP
Various dependency relation patterns can be used to extract facts of
See
where
see
The method RuleGenerator in Algorithm 1 produces a set of initial rules based on the pre-defined predicates. The antecedent of each rule is generated based on the combination of the predicates and contains at least one of them, the consequent of each rule is the predicate
Each labeling function combines one or more knowledge sources to label the data. However, it is unknown that which combination of multiple knowledge sources is more/less effective in labeling the data, or how to handle conflicts of different rules. We propose to employ MLNs to handle the problems. The details are introduced in the following section.
Weighted labeling function learning
We first introduce Markov Logic briefly, and then detail weight learning and label inference using MLNs.
Markov logic
The main representation challenges of the studied problem are how to encode a wide variety of knowledge bases into a unified framework, and how to model the uncertainties of the knowledge. One of the most powerful representation languages is Markov logic [32], which is a probabilistic extension of first-order logic.
A Markov logic network (MLN) is a set of weighted first-order clauses. It defines a Markov network with one node per ground atom and one feature per ground clause. The probability of a state
where
The learning and inference algorithms provided in the open-source tool Tuffy10
We observed that the most effective rules for each relation are different. To explore which rules are more effective for each relation, and also consider the efficiency of the learning and inference algorithms, we learn a separate MLN for each relation instead of learning a unified MLN for all relations.
Given a set of initial rules, we first generate an initial MLN
where
For example, the MLN learned by Algorithm 1 for the relation “FOUNDER” contains weighted rules like:
The MLN learned by Algorithm 1 for the relation “CONTAINS” contains the following weighted rules:
We can see that the most effective rules for the two relations differ a lot.
Given an MLN for each relation, we calculate the marginal probability of
Since computing marginal probability subsumes probabilistic inference, which is #P-complete, and logical inference, which is NP-complete even in finite domains, no better results can be expected. The query can be approximated using MC-SAT algorithm, which is a “slice sampling” Markov chain Monte Carlo [12] algorithm. It uses a combination of satisfiability testing and simulated annealing to sample from the slice.
For example, given an MLN for relation
Experiments
In this section, we set two experiments to evaluate the performance of the proposed method.
Exp 1: Different labeling methods are evaluated based on their generated training data for real-world relations in NYT and Wikipedia respectively. Exp 2: Different relation extraction models trained using the labeled data generated by these labeling methods are then evaluated on both human labeled and automatically labeled test data.
The hypothesis of Exp 1 is that the proposed method is capable to improve the quality of the labeled training data. The hypothesis of Exp 2 is that the performance of relation extraction models can be further improved based on high quality training data.
Here we introduce the experiment datasets, evaluation metrics and baseline approaches.
Dataset for extracting relation instances
Following the previous works in distant supervision [23, 33], we use Freebase [7] as the source of relation instance knowledge base. Relation instances from the latest version of Freebase are extracted as the source of supervision. The entity type constraints of these relations are collected from Freebase and represented in logical formulas.
Datasets for extracting relation indicators
We extract relation indicators from lexical units of frames in FrameNet, and the word synsets in WordNet. FrameNet is a knowledge base with information on the mapping of meaning to semantic frames, which can be thought of as a conceptual structure describing an event, relation, or object and the participants in it. The words that evoke a frame are called lexical units. We extract indicative words of relations from lexical units of frames. WordNet is a large lexical knowledge base of English. Nouns, verbs, adjective and adverts are grouped into sets of synsets. We extract indicative words of relations from the synsets of WordNet. Note that other resources can also be used for extracting indicative words, we choose FrameNet and WordNet because most of the indicative words in our case are verbs and nouns.
Datasets for labeling and extraction models
We use NYT and Wikipedia as text corpora for different labeling functions and relation extraction models. The two text corpora are widely used by distant supervision community. In total, the training set contains 81,168 sentences and 20,426 observed facts, and the test set contains 29,607 sentences and 11,331 observed facts. Additionally, labeled sentences are needed to train our MLN models. Since there is no existing human-labeled data, we labeled the data ourselves.11
See
Due to time and labor limitations, we labeled 1,760 sentences for 10 representative relations, including 827 positive sentences and 933 negative sentences. We select the 10 relations because the most effective knowledge sources for them are quite different, which will be further analyzed in Section 6.2.1. The detailed information of the labeled data is shown in Table 4. Take
Detailed information of the labeled dataset. #lab/#pos/#neg is the number of all/positive/negative labeled sentences. Noise ratio is calculated by #neg/#lab. In the relation /A/B/C, A and B are namespaces, C is the relation name
In Exp 1, we adopt standard precision (P), recall (R) and F
Baseline approaches
In Exp 1, we compare with three baseline approaches, which represent three different methods for training data labeling.
StandardDS employs a standard distant supervision method which is implemented using the labeling function based on entity name knowledge, i.e., rule
TypeAware employs a type-aware distant supervision method which is implemented using the labeling function based on entity name and entity type knowledge, i.e., rule
NoiseReduce employs a noise reducing method which is implemented using the labeling function based on entity name and syntactic constraint knowledge, i.e., rule
Given a set of sentences, we use the proposed method and three baseline methods to generate four different sets of labeled sentences. Then we evaluate the qualities of the generated data on the human labeled test set. Sentences of NYT and Wikipedia are evaluated separately.
In Exp 2, we compare two models for relation extraction.
MultiR [14] is a typical work based on probabilistic graphical model for multi-instance multi-label learning. It uses the perceptron algorithm for learning and a greedy search algorithm for inference. We implemented this model using the publicly available code.12
MIMLRE [39] is a typical work based on a two-layer graphical model. It is trained by using hard discriminative Expectation-Maximization. We use the publicly available code provided by the authors.13
nlp.stanford.edu/software/mimlre.shtml.
Given a set of labeled sentences as the training data, which is generated by one of the compared methods in Exp 1, we use MultiR and MIMLRE to train relation extraction models respectively. Then we evaluate the trained models on two different test sets. One is the human labeled test set, and the other is the automatically labeled test set.
Results for Exp 1
The experimental results of Exp 1 are shown in Tables 5 and 6. Specifically, Table 5 shows the precisions, recalls and F
We observe from the tables that the average F
From Table 5, we can see that the average recalls of StandardDS and TypeAware are higher than those of NoiseReduce and LFL (Ours). It is reasonable because the restrictions of StandardDS and TypeAware are loose when labeling the data, resulting in high recalls. The average precisions of the methods StandardDS and TypeAware are lower than those of NoiseReduce and LFL (Ours) because stricter restrictions are used in NoiseReduce and LFL (Ours) to avoid potential noise when labeling the data. Although stricter restrictions are used in both methods, the performance of LFL (Ours) is significantly improved compared with NoiseReduce because LFL (Ours) can learn the most effective labeling functions from a set of labeling functions.
Precision, Recall and F
-score of the methods StandardDS, TypeAware, NoiseReduce and LFL (Ours) evaluated on NYT
Precision, Recall and F
Precision, Recall and F
From Table 6, we can see that the average recalls of TypeAware and LFL (Ours) are much better than those of StandardDS and NoiseReduce. It is also reasonable because TypeAware and LFL (Ours) use short and approximate names to match entities while StandardDS and NoiseReduce do not. For some relations (e.g., relation
Furthermore, we can also observe from Table 5 that the most effective knowledge source for relations
P@K of MultiR and MIMLRE trained on 4 labeled datasets generated by 4 labeling methods. Sub-figures (a) and (b) are evaluated on the human labeled test data. Sub-figures (c) and (d) are evaluated on the automatically labeled test data by LFL (Ours).
Additionally, as shown in Table 4, among the 10 relations, the labeled positive sentences for each relation are different, e.g., relation
The experimental results of Exp 2 are shown in Fig. 1. Sub-figures (a) and (b) show the P@K results of different relation extraction models trained on the data labeled by StandardDS, TypeAware, NoiseReduce and LFL (Ours) respectively, and evaluated on the human labeled test data. Sub-figures (c) and (d) show the P@K results evaluated on the automatically generated test data. We use the test data labeled by LFL (Ours) as it is the best method according to the results of Exp 1.
From the figures, we observe that LFL (Ours)’s performances are better than those of other methods in both models when the results are steady. The performances of LFL (Ours) are stable in all cases while other methods’ performances vary in different cases.
It is interesting to see from sub-figures (a) and (b) that both MultiR and MIMLRE perform better using the training data labeled by LFL (Ours) than using that labeled by StandardDS. Note that, the original MultiR and MIMLRE methods use the StandardDS method to label the training data. We can conclude that our proposed labeling function is able to improve the quality of the training data, and thus improve the final performance of the relation extraction task.
We can get the same conclusions from sub-figures (c) and (d). The main difference is that the performance rankings of three baseline methods are different using different test data. However, the trends of each performance line in sub-figures (c) and (d) can still simulate those in sub-figures (a) and (b). That is, we can use the data labeled by our proposed method automatically as a replacement of the human labeled test data when human labeled data is expensive and difficult to obtain.
In summary, the proposed approach can produce much better labeled data than existing labeling methods by learning high quality labeling functions using MLNs based on various knowledge sources.
Conclusion
In this paper, we proposed the algorithm LFL for labeling function learning, which aims to reduce the manual work in creating high quality training data for distantly supervised relation extraction. We also explored the problem of integrating various knowledge bases and language processing technologies into a unified framework by representing a labeling function as a MLN. Experimental results show that the training data produced by our approach is significantly improved in quality. On the other hand, different distantly supervised relation extraction models trained on the produced training data can also achieve better performances. But it still has a shortcoming on the coverage of rules that are used to initialize the MLNs. This limitation leads to the failure of labeling data with unknown patterns.
In the future work, we will explore how to learn effective rules automatically for identifying entities and relations. We also plan to further study the problem of identifying indicative words for different relations.
Footnotes
Acknowledgments
This work is partially funded by the National Science Foundation of China under Grant 61170165, Grant 61702279, Grant 61602260, and Grant 61502095.
