Learning labeling functions in distantly supervised relation extraction

Abstract

Distant supervision has become the leading method for training large-scale information extractors. It could be encoded in the form of labeling functions, which employ knowledge bases to provide labels for the data. However, most previous works use only simple labeling functions, resulting in too much noise in the training data, and the knowledge bases are far from well-explored. In this paper, in order to improve the labeling quality of the training data for distant supervision relation extraction, we propose to make use of existing knowledge bases to effectively learn labeling functions. Specifically, labeling functions are represented as Markov Logic, which can integrate various resources into a unified model naturally. Experimental results show that the training data produced by the learned labeling functions is significantly improved in quality. Different distantly supervised relation extraction models trained on the produced training data can also achieve better performances.

Keywords

Relation extraction distant supervision labeling functions markov logic networks

1. Introduction

For some information extraction applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. Distant supervision, also known as knowledge-based weak supervision, therefore proposes a paradigm for the heuristic creation of training set. The basic idea is that it matches relation instances from a knowledge base to the text corpus. A relation instance $r(e_{1},e_{2},\ldots,e_{n})$ denotes some type of relation $r$ between multiple entities. For a binary relation instance $r(e_{1},e_{2})$ , distant supervision labels a sentence with the relation $r$ if the sentence contains the same entities $e_{1}$ and $e_{2}$ as in $r(e_{1},e_{2})$ . This method is straightforward and creates large training set quickly, however, as noted by [33], the created training data contains many false positive noise that hurt the performance of information extraction.

In general, distant supervision consists of two main tasks, i.e., training data labeling, and extraction model training. In order to improve the final performance, existing works of distant supervision handle noisy data in different tasks. In general, they can be grouped into three categories:

Category 1: Handle noisy data when training the extraction models. Most of the previous works in distant supervision fall into this category [33, 14, 39, 1, 37, 44, 6, 2, 34, 10]. They first employ heuristic methods and relation instances in the knowledge bases to label the training data, inevitably resulting in much noise. Then they focus on how to make use of noisy data to train classification or tagging models. However, the noise mixed in the training data become the performance bottleneck of these methods.

Category 2: Filter out noisy data before training the extracting models. Existing works in this category [40, 15, 41, 3, 36, 11, 29] employ heuristic methods and limited knowledge to label the training data. Then they filter out the noise based on some constraints before feeding the data to the extraction models. The performances of these methods can be improved since the noise is reduced. However, as the restrictions are too loose using heuristic labeling methods, the noise in the labeled data is of various kinds. It is not easy to remove the noise by filters.

Category 3: Avoid noisy data when labeling the training data. Existing works in this category [24, 31, 30, 18] focus on creating high quality training data1

¹
Compared with the training data resulted by the labeling methods used in the related works of category 1 and 2, high quality training data means that there is less noise in the training data.

by avoiding potential noise when labeling. They first encode the world knowledge of domain experts into rules, also called labeling functions. Then they use these labeling functions to generate new positive training data that simulates human experts’ annotations. However, these methods rely on high quality labeling functions, which are usually designed manually and tested by trial and error. It is difficult to guarantee quality under different circumstances. Furthermore, they also cannot make full use of existing knowledge bases and language processing technologies to facilitate the generation of labeling functions.

For the training data labeling task, works in different categories explore different types of domain knowledge to label the training data. Works in category 1 use only relation instances in the knowledge base. Works in category 2 use relation instances and some constraints, e.g., entity type constraints or syntactic structure constraints. Works in category 3 use more domain knowledge. For the extraction model training task, they can share the same models for relation classification. In this paper, we focus on the former task, i.e., learning high quality labeling functions for distant supervision.

Unlike the information extractors which address the problem of deterministically generating a set of candidate logical forms from natural language sentences, labeling functions address the inverse problem. Given a set of candidate logical forms, i.e., relation instances in the form $r(e_{1},e_{2})$ in our case, labeling functions align $r(e_{1},e_{2})$ to a sentence $s$ , such that 1) $s$ is concerned with the entities $e_{1}$ and $e_{2}$ in $r$ , and 2) $s$ represents the relation $r$ between $e_{1}$ and $e_{2}$ in $r(e_{1},e_{2})$ . For example, given a set $\mathcal{S}$ of natural language sentences, information extractors generate relation facts from $\mathcal{S}$ in the form $r(e_{1},e_{2})$ . Inversely, given a set of relation instances in the form $r(e_{1},e_{2})$ , labeling functions determine whether each sentence in $\mathcal{S}$ mentions $r(e_{1},e_{2})$ .

The proposed method is able to make full use of existing knowledge bases (e.g., Freebase [7], FrameNet2

https://framenet.icsi.berkeley.edu/fndrupal/.

and WordNet,3

http://wordnet.princeton.edu/wordnet/.

) and language processing technologies (e.g., part-of-speech tagging (POS), named entity recognition (NER) and dependency parsing) to produce high quality training data. Specifically, we represent labeling functions using Markov Logic Networks (MLNs) [32], which consist of a set of weighted first-order formulas and can integrate various knowledge bases into a unified framework easily. The weight reflects the importance of a labeling function to label the data for a relation naturally.

In our experiments, relation instance knowledge bases such as Freebase, text corpora such as New York Times (NYT) corpus [38] and Wikipedia4

⁴

https://en.wikipedia.org.

are used to produce the training data for distantly supervised relation extraction. The proposed method is evaluated on two tasks of distant supervision. Firstly, the learned labeling function and three baseline methods are evaluated based on their generated labeled data for ten real-world relations in NYT and Wikipedia respectively. Secondly, two existing relation extraction models, which are trained using the labeled datasets generated by different labeling methods respectively, are then evaluated on both human labeled and automatically labeled test data. The effectiveness of the proposed method is proved in both tasks.

The major contributions of this paper are as follows. 1) We propose to learn labeling functions for distant supervision, which can improve the quality of the labeled training data while reducing manual work in the task. 2) We represent labeling functions as MLNs, which can integrate various knowledge bases and language processing technologies into the model naturally.

2. Related works

Distant supervision was first introduced by [9]. They aligned a knowledge base with the paper abstracts in PubMed, and then used the extracted sentences to train a naive Bayes extractor. Later, [23] employed distant supervision in relation extraction. They used a set of frequent relations in Freebase to train relation extractors over Wikipedia without labeled data.

Since then, a lot of works focused on relation extraction using distant supervision. They can be grouped into three categories according to how to handle noisy data.

Most of the previous works employ heuristic approaches to label training data and then handle noise when training the relation extractors. [33, 14, 39] proposed a series of graphical models to solve the problem. [1, 37] used hierarchical topic models to handle noise in training data. [44] proposed a tagging scheme that can jointly extract entities and relations to tackle the problem of erroneous delivery. [6, 2] combined active learning with distant supervision to select meaningful distantly labeled instances. [34, 10] considered the relation extraction task as a matrix factorization problem. Recent works have begun to explore additional information to facilitate the training of the relation extractors, such as the side information about rare entities [35], the prior of positive bags [22], the fine-grained entity types [16, 43, 19] , the indirect supervision knowledge [13], the human labeled data [26], and the document structure [5, 20]. However, the improvement of these methods is limited by the noise in the training data.

A variety of strategies have been proposed for correcting wrong labels before feeding the training data to the relation extractors. [40] presented a generative model to reduce wrong labels given the labeled corpus created by distant supervision. [15] used mention frequency, pointwise mutual information and mention centroids to remove noise. [41] tried to correct the most likely false negatives in the training data based on the ranking of pseudo-relevance feedback. [3] detected highly ambiguous entity pairs and removed them from training data. [36] proposed a Path Ranking Algorithm to identify possible false negatives in the training data. Although reduced, the noise in the training data still affect the performances of these methods. Deep reinforcement learning have also been applied to distant supervision relation extraction. [11] proposed an instance selector to cast the sentence selection task as a reinforcement learning problem to choose high-quality training sentence for a relation classifier, and [29] proposed a false-positive indicator to automatically recognize false positive labels and then redistribute them into negative examples.

Recently, researchers proposed to avoid noise when labeling training data. [24] encoded the world knowledge of domain experts into rules to identify positive training data. Using these rules, they can automatically generate new positive training data that simulates human experts’ annotations. The works [31, 30] proposed a paradigm for the programmatic creation of training sets called data programming in which domain experts provide a set of labeling functions, which are programs that heuristically label large subsets of data points. [18] proposed an embedding framework to resolve the conflicts generated by different handcrafted labeling functions. These works can create training data effectively, however, they rely mainly on expert designed labeling functions which are expensive to obtain and difficult to guarantee quality in various domains.

Our work falls into the third category. The main difference between our work and existing ones is that we learn labeling functions based on some general rules that are transformed from world knowledge easily instead of designing and selecting specific rules by trial and error. Furthermore, our method can make full use of the existing knowledge bases and language processing technologies to facilitate the learning of labeling functions.

3. Labeling function learning algorithm

This section introduces the proposed algorithm Labeling Function Learning (LFL), i.e., Algorithm 1, which consists of three main steps: observed fact extraction, initial labeling function generation and weighted labeling function learning.

Given a set $\mathcal{S}$ of sentences, a set $\mathcal{I}$ of relation instances in the knowledge bases, and a set $\mathcal{K}$ of pre-defined predicates that represent various knowledge sources (lines 1–3), Algorithm 1 outputs an MLN $\mathcal{G}$ , which contains a set of weighted labeling functions (line 4).

Algorithm 1: Labeling Function Learning Algorithm
Require:
1: $\mathcal{S}$ , a small set of labeled sentences for training;
2: $\mathcal{I}$ , a set of relation instances in the knowledge bases;
3: $\mathcal{K}$ , a set of pre-defined predicates.
Ensure:
4: $\mathcal{G}$ , an MLN, which is a set of weighted labeling functions.
5: function LFL $\mathcal{S},\mathcal{I},\mathcal{K}$
6: $\mathcal{O}_{s}\leftarrow\mathrm{\textsc{Parse}}(\mathcal{S})$
7: $\mathcal{O}_{k}\leftarrow\mathrm{\textsc{FactExtractor}}(\mathcal{O}_{s},% \mathcal{I},\mathcal{K})$
8: $\Lambda\leftarrow\mathrm{\textsc{RuleGenerator}}(\mathcal{K})$
9: $\mathcal{G}_{init}\leftarrow\mathrm{\textsc{InitMLN}}(\Lambda)$ // initialize an MLN
10: $\mathcal{G}\leftarrow\mathrm{\textsc{WeightLearning}}(\mathcal{G}_{init},% \mathcal{O}_{k},\mathcal{S})$
11: return $\mathcal{G}$
12: end function

Step 1 (observed fact extraction, line 6): This step uses the method Parse to extract observed facts from each sentence in $\mathcal{S}$ . The Parse method is developed based on a set of NLP tools,5

⁵
In our experiments, we employ Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/) to do POS tagging, dependency parsing and NER tagging.

including POS tagging, NER and dependency parsing. It transforms the parsing results into logical forms, which are called observed facts, denoted as

\mathcal{O}_{s}

Step 2 (initial labeling function generation, lines 7 and 8): This step encodes world knowledge as first-order logic rules. The method FactExtractor extracts a set $\mathcal{O}_{k}$ of facts for the pre-defined predicates in $\mathcal{K}$ , which will be used in step 3 (line 7). Based on these predicates in $\mathcal{K}$ , the method RuleGenerator produces a set $\Lambda$ of rules as initial labeling functions, which can be applied to all relations (line 8).

Step 3 (weighted labeling function learning, lines 9 and 10): This step initializes an MLN using the rule set constructed in the previous step (line 9), and then learns the weights of these rules using the method WeightLearning based on $\mathcal{O}_{k}$ and $\mathcal{S}$ (line 10). The output MLN is the final labeling functions.

Example 1: Chad Hurley, co-founder and executive of YouTube. (FOUNDER)

Given the above sentence, which is labeled as a training example of the relation “FOUNDER”, a set of observed facts can be extracted by the method Parse which are shown as follows:

$\displaystyle f_{1}\textit{Pos(s, co-founder, NN).}$ $\displaystyle f_{2}\textit{Pos(s, executive, NN).}$ $\displaystyle f_{3}\textit{Person(s, Chad\ Hurley).}$ $\displaystyle f_{4}\textit{Organization(s, YouTube).}$ $\displaystyle f_{5}\textit{Dep(s, appos, Chad\ Hurley, co-founder).}$ $\displaystyle f_{6}\textit{Dep(s, nmod, co-founder, YouTube).}$

where $f_{1}$ and $f_{2}$ are lexical facts, $f_{3}$ and $f_{4}$ are named entity facts, $f_{5}$ and $f_{6}$ are syntactical dependency relation facts. $f_{1}$ means that the word “co-founder” is a noun (NN) in $s$ . $f_{3}$ means that “Chad Hurley” is recognized as a PERSON entity in $s$ . $f_{5}$ means that the entity “Chad Hurley” and the word “co-founder” have the dependency relation $a p p o s$ (appositional modifier) in $s$ . See Section 4.4 for more dependency relations. The details of steps 2 and 3 are introduced in the following two sections.

4. Initial labeling function generation

Various knowledge sources can be incorporated into the labeling functions to facilitate the generation of annotated data. Knowledge bases (e.g., Freebase) which contain known relation instances can provide the knowledge for entity names and entity types. Knowledge bases (e.g., FrameNet, WordNet) which contain semantic information between words can provide the knowledge for relation indicators. Syntactical parsing information can provide the knowledge for dependency relations between words. We first introduce the pre-defined predicates that represent these knowledge sources, and then detail the generation of initial labeling functions.

4.1 Predicate for entity name knowledge

Inspired by the works in distant supervision, we assume that if the entity names of a relation instance $i$ are matched with entity mentions in a sentence $s$ , then the relation in $i$ is a candidate label for $s$ . Most of the previous works first recognize entities using NER techniques, then use exact string matching to match entity names. However, they can mainly capture entities with full names. We capture entities by full names, short names and approximate names based on semantic distances between strings, e.g., “Chad Hurley” is linked to “Chad Meredith Hurley”.

The predicate $EntSat(s,r,e_{1},e_{2})$ is used to represent this knowledge. It is true iff the sentence $s$ contains possible names of the entities $e_{1}$ and $e_{2}$ in the given relation instance $r(e_{1},e_{2})$ . For example, if there is a relation instance founder(YouTube, Chad Hurley) in $\mathcal{I}$ , and the entities “YouTube” and “Chad Hurley” are matched with the entity mentions in Example 1, denoted as $s$ , then we have the fact EntSat(s,founder,YouTube,Chad Hurley).

4.2 Predicate for entity type knowledge

The entity types of the recognized entities in $\textit{EntSat}(s,r,e_{1},e_{2})$ based on NER techniques are usually basic or coarse-grained, other types such as “PROFESSION” might not be captured. We propose to linked entities to fine-grained entity types employing the pre-defined entity types in the relation instance knowledge bases, e.g., “philosopher” is recognized as “PROFESSION”.

The predicate $\textit{TypeSat}(s,r,e_{1},e_{2})$ is used to represent this knowledge. It is true iff the entity types of $e_{1}$ and $e_{2}$ in $s$ are the same as those in $r(e_{1},e_{2})\in\mathcal{I}$ .

4.3 Predicates for relation indicator knowledge

Unlike entity matching, words that describe relation names (e.g., “FOUNDER”, “PLACE_OF_BIRTH”) in a relation instance usually cannot be directly used to match with the relation mentions in the sentences. We observed that some relations can be identified based on indicative words. For example, to express the relation “FOUNDER”, a sentence usually contain one of the following indicative words such as “founded”, “establish”, “founder”, or “co-founders”, etc.

Table 1
Descriptions and alias of some Freebase relations

Relation names	Descriptions	Also known as (alias)
/people/person/ place_of_birth	Place where a person is born	POB, birthplace, location of birth, birth location
/organization/ organization/ founders	person who creates an institution intended to perpetuate itself after the founder’s association ends	Founder
/people/person/ children	First-degree relative, either son or daughter	Son or daughter
/people/deceased _person/place_of _death	Place where a person died	POD, location of death, deathplace

Table 2

Examples of relation indicator seeds and extensions of some Freebase relations

Relation names	Indicator seeds	Indicator extensions
/people/person/ place_of_birth	Born, birthplace, birth	bear
/organization/ organization/ founders	Founder, create	Found, start, develop, establish
/people/person/ children	Child, son, daughter	Kid, boy, girl, junior, jr, jnr, stepson, stepdaughter
/people/deceased_person/place_of _death	Die, death, deathplace	Decease, demise, asphyxiate, drown, pass, expire, starvation, perish, end, starve, suffocation

Inspired by previous works [4, 42], we use indicators in verb and noun forms to identify relation mentions in the sentences. Specifically, we employ a self-expansion method to extract the set of relation indicative words. Relation indicator seeds are extracted from the descriptions and alias of Freebase relations. Note that Freebase has been migrated to Wikidata6

⁶

https://www.wikidata.org/.

and a mapping relationship has been established between them.7

⁷

https://developers.google.com/freebase/.

Table 1 shows the descriptions and alias of some Freebase relations in Wikidata. We also map the relation names of the relation instances in

\mathcal{I}

to FrameNet frames manually (e.g., relation “PLACE_OF_BIRTH” can be mapped to “Being_born” frame in FrameNet), then we collect the words in the lexical units of these FrameNet frames as relation indicator seeds. Furthermore, we extend the set of relation indicative words by selecting the most similar words of relation indicator seeds with the help of WordNet [25]. Table 2 shows examples of relation indicator seeds and their extensions of some Freebase relations.

Two predicates are used to represent the above knowledge. $\textit{KeyInSent}(s$ , $r)$ is true iff an indicative word of relation $r$ exists in the sentence $s$ , KeyInPath $(s,r,$ $e_{1},e_{2})$ is true iff an indicative word of relation $r$ exists on the dependency path of the sentence $s$ between the entities $e_{1}$ and $e_{2}$ .

4.4 Predicate for syntactic constraint knowledge

Recent works have shown that dependency relations between entities are effective for relation classification [8, 4], and they might also be useful to identify relation mentions.

The predicate $\textit{DepSat}(s,e_{1},e_{2})$ is used to represent the dependency relation constraint knowledge. It is true iff there is a dependency path of $s$ between entities $e_{1}$ and $e_{2}$ according to the dependency parsing results.

Table 3
Examples of dependency relations defined in Stanford CoreNLP

Relation names	Descriptions
Nmod	A noun (or noun phrase) functioning as a non-core (oblique) argument or adjunct
Appos	A nominal immediately following the first noun that serves to define or modify that noun
Amod	An adjectival phrase that serves to modify noun phrases
Prep	A prepositional phrase that serves to modify a verb, adjective, or noun
Nsubj	A noun phrase which is the syntactic subject of a clause
Csubj	A clausal syntactic subject of a clause
Xsubj	The relation between the head of an open clausal complement and the external subject of that clause
Dobj	The noun phrase which is the direct object of the verb
Iobj	The noun phrase which is the indirect object of the verb
Conj	Connected by a conjunction

Various dependency relation patterns can be used to extract facts of $\textit{DepSat}(s,$ $e_{1},e_{2})$ . The work [4] found that eight lexico-syntactic formulas cover 95% of relation phrases in their corpus. Thus, we use the following patterns8

⁸

See https://github.com/guiyaocheng/DistantSupervision-LabelingFunction/tree/master/logic for all patterns used in this paper.

to generate

\textit{DepSat}(s,e_{1},e_{2})

, such as:

$\displaystyle\textit{Dep}(s,\textit{Nmod},w,e_{1})\wedge\textit{Dep}(s,\textit% {Appos},e_{2},w)\Rightarrow\textit{DepSat}(s,e_{1},e_{2})$ $\displaystyle\textit{Dep}(s,\textit{Appos},e_{1},e_{2})\Rightarrow\textit{% DepSat}(s,e_{1},e_{2})$

where $\textit{Dep}(s,d,w_{1},w_{2})$ is true iff the words $w_{1}$ and $w_{2}$ have the dependency relation $d$ in sentence $s$ . For example, according to the facts $f_{5}$ , $f_{6}$ and the above rules, we can get the fact $\textit{DepSat}(s,\textit{YouTube},\textit{Chad\ Hurley})$ . We use the dependency relations defined in Stanford CoreNLP,9

⁹

see http://universaldependencies.org/docsv1/u/dep/index.html for descriptions of all dependency relations defined in Stanford CoreNLP.

some examples are shown in Table 3.

4.5 Labeling function generation

The method RuleGenerator in Algorithm 1 produces a set of initial rules based on the pre-defined predicates. The antecedent of each rule is generated based on the combination of the predicates and contains at least one of them, the consequent of each rule is the predicate $\textit{Label}(s,r)$ , which means that $s$ is labeled with the relation $r$ . Examples of the initial rules are shown as follows:

$\displaystyle\lambda_{1}∼{}∼{}∼{}\textit{EntSat}(s,r,e_{1},e_{2})\Rightarrow% \textit{Label}(s,r)$ $\displaystyle\lambda_{2}∼{}∼{}∼{}\textit{EntSat}(s,r,e_{1},e_{2})\wedge\textit% {TypeSat}(s,r,e_{1},e_{2})\Rightarrow\textit{Label}(s,r)$ $\displaystyle\lambda_{3}∼{}∼{}∼{}\textit{EntSat}(s,r,e_{1},e_{2})\wedge\textit% {KeyInSent}(s,r)\Rightarrow\textit{Label}(s,r)$ $\displaystyle\lambda_{4}∼{}∼{}∼{}\textit{EntSat}(s,r,e_{1},e_{2})\wedge\textit% {KeyInPath}(s,r,e_{1},e_{2})\Rightarrow\textit{Label}(s,r)$ $\displaystyle\lambda_{5}∼{}∼{}∼{}\textit{EntSat}(s,r,e_{1},e_{2})\wedge\textit% {DepSat}(s,e_{1},e_{2})\Rightarrow\textit{Label}(s,r)$

Each labeling function combines one or more knowledge sources to label the data. However, it is unknown that which combination of multiple knowledge sources is more/less effective in labeling the data, or how to handle conflicts of different rules. We propose to employ MLNs to handle the problems. The details are introduced in the following section.

5. Weighted labeling function learning

We first introduce Markov Logic briefly, and then detail weight learning and label inference using MLNs.

5.1 Markov logic

The main representation challenges of the studied problem are how to encode a wide variety of knowledge bases into a unified framework, and how to model the uncertainties of the knowledge. One of the most powerful representation languages is Markov logic [32], which is a probabilistic extension of first-order logic.

A Markov logic network (MLN) is a set of weighted first-order clauses. It defines a Markov network with one node per ground atom and one feature per ground clause. The probability of a state $x$ in such a network is given by

$\displaystyle P(x)=\frac{1}{Z}\exp\left(\sum_{i}w_{i}f_{i}(x)\right),$ (1)

where $Z$ is a normalization constant, $w_{i}$ is the weight of the $i$ th clause, $f_{i}=1$ if the $i$ th clause is true, and $f_{i}=0$ otherwise. MLNs have been successfully applied to many different tasks, such as supervised coreference resolution [28], relational pattern clustering [17], etc.

5.2 Weight learning

The learning and inference algorithms provided in the open-source tool Tuffy10

¹⁰
http://i.stanford.edu/hazy/tuffy/home.

are used in this paper. Specifically, we performed inference using MC-SAT algorithm [27], and weight learning using Diagonal Newton algorithm [21].

We observed that the most effective rules for each relation are different. To explore which rules are more effective for each relation, and also consider the efficiency of the learning and inference algorithms, we learn a separate MLN for each relation instead of learning a unified MLN for all relations.

Given a set of initial rules, we first generate an initial MLN $\mathcal{G}_{init}$ , then learn weights for each rule using the method $\mathrm{\textsc{WeightLearning}}(\mathcal{G}_{init},\mathcal{O}_{k},\mathcal{S})$ . Note that $\mathcal{S}$ can be considered as a development set, which is a small set of labeled sentences used to learn weights of labeling functions. Compared with works that use large amount of training set or write specific rules for each relation/domain by human experts, a smaller development set is economical and much easier to obtain. The output MLN of $\mathrm{\textsc{WeightLearning}}(\mathcal{G}_{init},\mathcal{O}_{k},\mathcal{S})$ using Tuffy consists of a set of weighted rules in the form:

$\displaystyle w:\textit{body}(b_{1},\ldots,b_{n})\Rightarrow\textit{head}(h_{1% },\ldots,h_{2})$

where $w$ is the weight of the rule that reflects how strong a constraint is: the higher the weight, the greater the difference in log probability between a world that satisfies the rule and one that does not, other things being equal.

For example, the MLN learned by Algorithm 1 for the relation “FOUNDER” contains weighted rules like:

$\displaystyle-0.9:\textit{EntSat}(s,r,e_{1},e_{2})\Rightarrow\textit{Label}(s,r)$ $\displaystyle 7.1:\textit{EntSat}(s,r,e_{1},e_{2})\wedge\textit{KeyInPath}(s,r% ,e_{1},e_{2})\Rightarrow\textit{Label}(s,r)$ $\displaystyle 0.7:\textit{EntSat}(s,r,e_{1},e_{2})\wedge\textit{DepSat}(s,r,e_% {1},e_{2})\Rightarrow\textit{Label}(s,r)$

The MLN learned by Algorithm 1 for the relation “CONTAINS” contains the following weighted rules:

$\displaystyle 0.2:\textit{EntSat}(s,r,e_{1},e_{2})\wedge\textit{TypeSat}(s,r,e% _{1},e_{2})\Rightarrow\textit{Label(s,r)}$ $\displaystyle-0.1:\textit{EntSat}(s,r,e_{1},e_{2})\wedge\textit{KeyInSent}(s,r% )\Rightarrow\textit{Label(s,r)}$ $\displaystyle 0.7:\textit{EntSat}(s,r,e_{1},e_{2})\wedge\textit{DepSat}(s,r,e_% {1},e_{2})\Rightarrow\textit{Label}(s,r)$

We can see that the most effective rules for the two relations differ a lot.

5.3 Label inference

Given an MLN for each relation, we calculate the marginal probability of $\textit{Label}(s,r)$ . It is then used as the confidence of labeling results.

Since computing marginal probability subsumes probabilistic inference, which is #P-complete, and logical inference, which is NP-complete even in finite domains, no better results can be expected. The query can be approximated using MC-SAT algorithm, which is a “slice sampling” Markov chain Monte Carlo [12] algorithm. It uses a combination of satisfiability testing and simulated annealing to sample from the slice.

For example, given an MLN for relation $r$ and a set of sentences ( $s_{1}$ to $s_{5}$ ), the marginal probabilities for these sentences are shown as follows. We set the confidence threshold as 0.5, thus accordingly sentences $s_{1}$ to $s_{4}$ are labeled as training examples of $r$ .

$\displaystyle 1.00\textit{Label}(s_{1},r)$ $\displaystyle 0.79\textit{Label}(s_{2},r)$ $\displaystyle 0.69\textit{Label}(s_{3},r)$ $\displaystyle 0.51\textit{Label}(s_{4},r)$ $\displaystyle 0.42\textit{Label}(s_{5},r)$

6. Experiments

In this section, we set two experiments to evaluate the performance of the proposed method.

•
Exp 1: Different labeling methods are evaluated based on their generated training data for real-world relations in NYT and Wikipedia respectively.
•
Exp 2: Different relation extraction models trained using the labeled data generated by these labeling methods are then evaluated on both human labeled and automatically labeled test data.

The hypothesis of Exp 1 is that the proposed method is capable to improve the quality of the labeled training data. The hypothesis of Exp 2 is that the performance of relation extraction models can be further improved based on high quality training data.
6.1 Experimental setting

Here we introduce the experiment datasets, evaluation metrics and baseline approaches.

6.1.1 Dataset for extracting relation instances

Following the previous works in distant supervision [23, 33], we use Freebase [7] as the source of relation instance knowledge base. Relation instances from the latest version of Freebase are extracted as the source of supervision. The entity type constraints of these relations are collected from Freebase and represented in logical formulas.

6.1.2 Datasets for extracting relation indicators

We extract relation indicators from lexical units of frames in FrameNet, and the word synsets in WordNet. FrameNet is a knowledge base with information on the mapping of meaning to semantic frames, which can be thought of as a conceptual structure describing an event, relation, or object and the participants in it. The words that evoke a frame are called lexical units. We extract indicative words of relations from lexical units of frames. WordNet is a large lexical knowledge base of English. Nouns, verbs, adjective and adverts are grouped into sets of synsets. We extract indicative words of relations from the synsets of WordNet. Note that other resources can also be used for extracting indicative words, we choose FrameNet and WordNet because most of the indicative words in our case are verbs and nouns.

6.1.3 Datasets for labeling and extraction models

We use NYT and Wikipedia as text corpora for different labeling functions and relation extraction models. The two text corpora are widely used by distant supervision community. In total, the training set contains 81,168 sentences and 20,426 observed facts, and the test set contains 29,607 sentences and 11,331 observed facts. Additionally, labeled sentences are needed to train our MLN models. Since there is no existing human-labeled data, we labeled the data ourselves.11

¹¹
See https://github.com/guiyaocheng/DistantSupervision-LabelingFunction/tree/master/data for the labeled training data.

Due to time and labor limitations, we labeled 1,760 sentences for 10 representative relations, including 827 positive sentences and 933 negative sentences. We select the 10 relations because the most effective knowledge sources for them are quite different, which will be further analyzed in Section 6.2.1. The detailed information of the labeled data is shown in Table 4. Take $r_{1}$ for example, there are 207 (#lab) sentences labeled as the training examples of $r_{1}$ , however, only 81 (#pos) sentences are true positive training examples, and the noise ratio is 61% (#neg/#lab). 50% of the labeled sentences are used to train our MLN models for weight learning, and the remaining 50% sentences are used for testing.

Table 4

Detailed information of the labeled dataset. #lab/#pos/#neg is the number of all/positive/negative labeled sentences. Noise ratio is calculated by #neg/#lab. In the relation /A/B/C, A and B are namespaces, C is the relation name

ID	Relations	#lab	#pos	#neg	Noise ratio
$r_{1}$	/people/deceased_person/cause_of_death	207	81	126	61%
$r_{2}$	/location/location/contains	193	95	98	51%
$r_{3}$	/organization/organization/founders	193	52	141	73%
$r_{4}$	/people/person/place_of_birth	187	36	151	81%
$r_{5}$	/people/person/profession	183	135	48	26%
$r_{6}$	/people/deceased_person/place_of_death	174	27	147	84%
$r_{7}$	/people/person/children	172	113	59	34%
$r_{8}$	/organization/organization/place_founded	164	60	104	63%
$r_{9}$	/people/place_lived/location	155	107	48	31%
$r_{10}$	/location/location/nearby_airports	132	121	11	8%

6.1.4 Evaluation metrics

In Exp 1, we adopt standard precision (P), recall (R) and F ${}_{1}$ -score (F ${}_{1}$ ) to evaluate the quality of the data generated by each labeling method based on the human labeled test dataset. In Exp 2, the performances of the relation extraction models, which are trained using the datasets generated by different labeling methods, are evaluated by precision at top $K$ (i.e., P@ $K$ ) metric with respect to the ranked lists we get from different models on a held-out set of facts.

6.1.5 Baseline approaches

In Exp 1, we compare with three baseline approaches, which represent three different methods for training data labeling.

StandardDS employs a standard distant supervision method which is implemented using the labeling function based on entity name knowledge, i.e., rule $\lambda_{1}$ . The difference is that StandardDS uses only full name to match entities. It represents the training data labeling methods used in related works of category 1.

TypeAware employs a type-aware distant supervision method which is implemented using the labeling function based on entity name and entity type knowledge, i.e., rule $\lambda_{2}$ . It represents the training data labeling methods used in related works of category 2.

NoiseReduce employs a noise reducing method which is implemented using the labeling function based on entity name and syntactic constraint knowledge, i.e., rule $\lambda_{5}$ . The difference is that NoiseReduce uses only full name to match entities. It represents the training data labeling methods used in related works of category 3. We do not compare with the new work [30] because it requires specific labeling functions for each relation written by human experts, which is not publicly available.

Given a set of sentences, we use the proposed method and three baseline methods to generate four different sets of labeled sentences. Then we evaluate the qualities of the generated data on the human labeled test set. Sentences of NYT and Wikipedia are evaluated separately.

In Exp 2, we compare two models for relation extraction.

MultiR [14] is a typical work based on probabilistic graphical model for multi-instance multi-label learning. It uses the perceptron algorithm for learning and a greedy search algorithm for inference. We implemented this model using the publicly available code.12

¹²
www.cs.washington.edu/ai/raphaelh/mr/.

MIMLRE [39] is a typical work based on a two-layer graphical model. It is trained by using hard discriminative Expectation-Maximization. We use the publicly available code provided by the authors.13

¹³

nlp.stanford.edu/software/mimlre.shtml.

Given a set of labeled sentences as the training data, which is generated by one of the compared methods in Exp 1, we use MultiR and MIMLRE to train relation extraction models respectively. Then we evaluate the trained models on two different test sets. One is the human labeled test set, and the other is the automatically labeled test set.

6.2 Experimental results

6.2.1 Results for Exp 1

The experimental results of Exp 1 are shown in Tables 5 and 6. Specifically, Table 5 shows the precisions, recalls and F ${}_{1}$ -scores of four compared approaches, i.e., StandardDS, TypeAware, NoiseReduce and LFL (Ours) evaluated on NYT. Table 6 shows the results evaluated on Wikipedia.

We observe from the tables that the average F ${}_{1}$ -score of LFL (Ours) is markedly better than those of other methods, and the F ${}_{1}$ -score of LFL (Ours) for each relation is also steadily better. It demonstrates the effectiveness of our approach in labeling high quality data by selecting proper knowledge sources for different relations in different data.

From Table 5, we can see that the average recalls of StandardDS and TypeAware are higher than those of NoiseReduce and LFL (Ours). It is reasonable because the restrictions of StandardDS and TypeAware are loose when labeling the data, resulting in high recalls. The average precisions of the methods StandardDS and TypeAware are lower than those of NoiseReduce and LFL (Ours) because stricter restrictions are used in NoiseReduce and LFL (Ours) to avoid potential noise when labeling the data. Although stricter restrictions are used in both methods, the performance of LFL (Ours) is significantly improved compared with NoiseReduce because LFL (Ours) can learn the most effective labeling functions from a set of labeling functions.

Table 5
Precision, Recall and F ${}_{1}$ -score of the methods StandardDS, TypeAware, NoiseReduce and LFL (Ours) evaluated on NYT

ID	StandardDS			TypeAware			NoiseReduce			LFL (Ours)
	P	R	F ${}_{1}$	P	R	F ${}_{1}$	P	R	F ${}_{1}$	P	R	F ${}_{1}$
$r_{1}$	0.44	1.00	0.61	0.44	1.00	0.61	0.43	0.48	0.45	0.89	0.81	0.85
$r_{2}$	0.33	1.00	0.50	0.34	0.75	0.47	0.71	0.75	0.73	0.94	0.75	0.83
$r_{3}$	0.21	1.00	0.34	0.35	0.90	0.50	0.23	0.50	0.31	0.67	1.00	0.80
$r_{4}$	0.02	1.00	0.04	0.02	1.00	0.04	0.12	0.40	0.18	0.83	1.00	0.91
$r_{5}$	0.80	1.00	0.89	0.87	1.00	0.93	0.93	0.85	0.89	0.85	1.00	0.92
$r_{6}$	0.28	1.00	0.44	0.32	1.00	0.49	0.25	0.18	0.21	0.83	0.91	0.87
$r_{7}$	0.51	1.00	0.68	0.53	1.00	0.69	0.53	0.47	0.50	0.92	0.58	0.71
$r_{8}$	0.61	1.00	0.76	0.63	0.89	0.74	0.72	0.68	0.70	0.68	0.89	0.77
$r_{9}$	0.76	1.00	0.86	0.77	0.94	0.85	0.88	0.60	0.71	0.76	1.00	0.86
$r_{10}$	0.86	1.00	0.92	0.86	1.00	0.92	0.89	0.44	0.59	0.86	1.00	0.92
avg	0.48	1.00	0.60	0.51	0.95	0.62	0.57	0.54	0.53	0.82	0.89	0.84

Table 6

Precision, Recall and F ${}_{1}$ -score of the methods StandardDS, TypeAware, NoiseReduce and LFL (Ours) evaluated on Wikipedia

ID	StandardDS			TypeAware			NoiseReduce			LFL (Ours)
	P	R	F ${}_{1}$	P	R	F ${}_{1}$	P	R	F ${}_{1}$	P	R	F ${}_{1}$
$r_{1}$	0.25	1.00	0.41	0.13	1.00	0.23	0.08	0.15	0.11	0.73	0.85	0.79
$r_{2}$	0.71	0.97	0.82	0.38	0.97	0.54	0.88	0.66	0.75	0.72	0.97	0.83
$r_{3}$	0.31	1.00	0.47	0.17	1.00	0.29	0.38	0.31	0.34	0.63	0.94	0.75
$r_{4}$	0.34	1.00	0.51	0.16	1.00	0.28	0.50	0.44	0.47	0.80	1.00	0.89
$r_{5}$	0.83	0.15	0.26	0.35	1.00	0.52	1.00	0.09	0.17	0.88	0.85	0.86
$r_{6}$	0.11	0.50	0.17	0.07	1.00	0.14	0.20	0.14	0.17	0.83	0.71	0.77
$r_{7}$	0.69	1.00	0.82	0.36	1.00	0.53	0.64	0.29	0.40	0.77	0.97	0.86
$r_{8}$	0.70	0.50	0.58	0.22	1.00	0.36	1.00	0.21	0.35	0.54	0.93	0.68
$r_{9}$	0.93	0.52	0.67	0.32	0.96	0.48	0.91	0.40	0.56	0.69	0.96	0.80
$r_{10}$	0.94	0.73	0.82	0.36	0.98	0.53	0.86	0.30	0.44	0.95	0.98	0.96
avg	0.58	0.74	0.55	0.25	0.99	0.39	0.65	0.30	0.38	0.75	0.92	0.82

From Table 6, we can see that the average recalls of TypeAware and LFL (Ours) are much better than those of StandardDS and NoiseReduce. It is also reasonable because TypeAware and LFL (Ours) use short and approximate names to match entities while StandardDS and NoiseReduce do not. For some relations (e.g., relation $r_{5}$ ), using short or approximate names to match entities is required since entity mentions vary a lot in the sentences of Wikipedia. The average precision of LFL (Ours) is higher than that of TypeAware because entity type restrictions are still not enough to label data precisely. More information should be integrated in the model.

Furthermore, we can also observe from Table 5 that the most effective knowledge source for relations $r_{8}$ , $r_{9}$ and $r_{10}$ in NYT is entity name knowledge since the F ${}_{1}$ -scores of StandardDS, which uses only entity name knowledge is almost the same as those of LFL (Ours), which selects proper knowledge sources from multiple ones. Similarly, we can conclude that the most effective knowledge source for relations $r_{5}$ and $r_{7}$ in NYT is entity type knowledge, the most effective knowledge source for relation $r_{2}$ in NYT is syntactic constraint knowledge, and the most effective knowledge source for the rest of the 10 relations in NYT is relation indicator knowledge. The most effective knowledge source for each relation in Wikipedia is different from those in NYT according the results in Table 6. The proposed approach, i.e., LFL (Ours) can achieve the best results for all relations in different datasets.

Figure 1.

P@K of MultiR and MIMLRE trained on 4 labeled datasets generated by 4 labeling methods. Sub-figures (a) and (b) are evaluated on the human labeled test data. Sub-figures (c) and (d) are evaluated on the automatically labeled test data by LFL (Ours).

Additionally, as shown in Table 4, among the 10 relations, the labeled positive sentences for each relation are different, e.g., relation $r_{5}$ has 135 positive sentences while relation $r_{6}$ has only 27 positive sentences. As we can see from the results in Tables 5 and 6, the proposed approach is still effective with fewer (positive) training relation instances.

6.2.2 Results for Exp 2

The experimental results of Exp 2 are shown in Fig. 1. Sub-figures (a) and (b) show the P@K results of different relation extraction models trained on the data labeled by StandardDS, TypeAware, NoiseReduce and LFL (Ours) respectively, and evaluated on the human labeled test data. Sub-figures (c) and (d) show the P@K results evaluated on the automatically generated test data. We use the test data labeled by LFL (Ours) as it is the best method according to the results of Exp 1.

From the figures, we observe that LFL (Ours)’s performances are better than those of other methods in both models when the results are steady. The performances of LFL (Ours) are stable in all cases while other methods’ performances vary in different cases.

It is interesting to see from sub-figures (a) and (b) that both MultiR and MIMLRE perform better using the training data labeled by LFL (Ours) than using that labeled by StandardDS. Note that, the original MultiR and MIMLRE methods use the StandardDS method to label the training data. We can conclude that our proposed labeling function is able to improve the quality of the training data, and thus improve the final performance of the relation extraction task.

We can get the same conclusions from sub-figures (c) and (d). The main difference is that the performance rankings of three baseline methods are different using different test data. However, the trends of each performance line in sub-figures (c) and (d) can still simulate those in sub-figures (a) and (b). That is, we can use the data labeled by our proposed method automatically as a replacement of the human labeled test data when human labeled data is expensive and difficult to obtain.

In summary, the proposed approach can produce much better labeled data than existing labeling methods by learning high quality labeling functions using MLNs based on various knowledge sources.

7. Conclusion

In this paper, we proposed the algorithm LFL for labeling function learning, which aims to reduce the manual work in creating high quality training data for distantly supervised relation extraction. We also explored the problem of integrating various knowledge bases and language processing technologies into a unified framework by representing a labeling function as a MLN. Experimental results show that the training data produced by our approach is significantly improved in quality. On the other hand, different distantly supervised relation extraction models trained on the produced training data can also achieve better performances. But it still has a shortcoming on the coverage of rules that are used to initialize the MLNs. This limitation leads to the failure of labeling data with unknown patterns.

In the future work, we will explore how to learn effective rules automatically for identifying entities and relations. We also plan to further study the problem of identifying indicative words for different relations.

Footnotes

Acknowledgments

This work is partially funded by the National Science Foundation of China under Grant 61170165, Grant 61702279, Grant 61602260, and Grant 61502095.

References

Alfonseca

Filippova

Delort

J.-Y.

and Garrido

, Pattern learning for relation extraction with a hierarchical topic model, In Proceedings of ACL-IJCNLP ’12, 2012, pp. 54–59.

Angeli

Tibshirani

, Wu

J.Y.

and Manning

C.D.

, Combining distant and partial supervision for relation extraction, In Proceedings of EMNLP ’14, 2014, pp. 1556–1567.

Augenstein

, Seed selection for distantly supervised web-based relation extraction, In Proceedings of Workshop on SWAIE ’14, 2014, pp. 17–24.

Banko

and Etzioni

, The tradeoffs between open and traditional relation extraction, In Proceedings of ACL-HLT ’08, 2008, pp. 28–36.

Bing

Ling

Wang

R.C.

and Cohen

W.W.

, Distant ie by bootstrapping using lists and document structure, In Proceedings of AAAI ’16, 2016, pp. 2899–2905.

Bobic

and Klinger

, Committee-based selection of weakly labeled instances for learning relation extraction, In Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics, 2013, pp. 187–197.

Bollacker

Evans

Paritosh

Sturge

and Taylor

, Freebase: A collaboratively created graph database for structuring human knowledge, In Proceedings of the ACM SIGMOD ’08, 2008, pp. 1247–1250.

Bunescu

and Mooney

, Learning to extract relations from the web using minimal supervision, In Proceedings of ACL ’07, 2007, pp. 576–583.

Craven

and Kumlien

, Constructing biological knowledge bases by extracting information from text sources, In Proceedings of ISMB ’99, 1999, pp. 77–86.

10.

Fan

Zhao

Zhou

Liu

Zheng

T.F.

and Chang

E.Y.

, Distant supervision for relation extraction with matrix completion, In Proceedings of ACL ’14, 2014, pp. 839–849.

11.

Feng

Huang

Zhao

Yang

and Zhu

, Reinforcement learning for relation classification from noisy data, In Proceedings of AAAI ’18, 2018.

12.

Gilks

Richardson

and Spiegelhalter

, Markov Chain Monte Carlo in Practice, Chapman & Halls, 1996.

13.

Han

and Sun

, Global distant supervision for relation extraction, In Proceedings of AAAI ’16, 2016, pp. 2950–2956.

14.

Hoffmann

Zhang

Ling

Zettlemoyer

and Weld

D.S.

, Knowledge-based weak supervision for information extraction of overlapping relations, In Proceedings of ACL ’11, 2011, pp. 541–550.

15.

Intxaurrondo

Surdeanu

de Lacalle

O.L.

and Agirre

, Removing noisy mentions for distant supervision, Procesamiento Del Lenguaje Natural 51 (2013), 41–48.

16.

Koch

Gilmer

Soderland

and Weld

D.S.

, Type-aware distantly supervised relation extraction with linked arguments, In Proceedings of EMNLP ’14, 2014, pp. 1891–1901.

17.

Kok

and Domingos

, Extracting semantic networks from text via relational clustering, In Proceedings of ECML PKDD ’08, 2008, pp. 624–639.

18.

Liu

Ren

Zhu

Zhi

Gui

and Han

, Heterogeneous supervision for relation extraction: A representation learning approach, In Proceedings of EMNLP ’17, 2017, pp. 46–56.

19.

Liu

and Zhao

, Exploring fine-grained entity type constraints for distantly supervised relation extraction, In Proceedings of COLING ’14, 2014, pp. 2107–2166.

20.

Lockard

Dong

X.L.

Einolghozati

and Shiralkar

, Ceres: distantly supervised relation extraction from the semi-structured web, Proceeding of VLDB ’18, 2018, pp. 1084–1096.

21.

Lowd

and Domingos

, Efficient weight learning for markov logic networks, In Proceedings of PKDD ’07, 2007, pp. 200–211.

22.

Min

Grishman

Wan

Wang

and Gondek

, Distant supervision for relation extraction with an incomplete knowledge base, In Proceedings of NAACL-HLT ’13, 2013, pp. 777–782.

23.

Mintz

Bills

Snow

and Jurafsky

, Distant supervision for relation extraction without labeled data, In Proceedings of ACL-IJCNLP ’09, 2009, pp. 1003–1011.

24.

Natarajan

Picado

Khot

Kersting

and Shavlik

J.W.

, Effectively creating weakly labeled training examples via approximate domain knowledge, In Proceedings of ILP ’14, 2014, pp. 92–107.

25.

Pedersen

Patwardhan

and Michelizzi

, Wordnet: similarity – measuring the relatedness of concepts, In National Conference on Artifical Intelligence, 2004.

26.

Pershina

Min

and Grishman

, Infusion of labeled data into distant supervision for relation extraction, In Proceedings of ACL ’14, 2014, pp. 732–738.

27.

Poon

and Domingos

, Sound and efficient inference with probabilistic and deterministic dependencies, In National Conference on Artificial Intelligence, 2006, pp. 458–463.

28.

Poon

and Domingos

, Joint unsupervised coreference resolution with markov logic, In Proceedings of EMNLP ’08, 2008, pp. 650–659.

29.

Qin

and Wang

W.Y.

, Robust distant supervision relation extraction via deep reinforcement learning, In arXiv preprint arXiv:1805.09927, 2018.

30.

Ratner

Bach

S.H.

Ehrenberg

Fries

and Ré

, Snorkel: Rapid training data creation with weak supervision, Proceedings of the VLDB Endowment 11(3) (2017), 269–282.

31.

Ratner

C.D.

Selsam

and Ré

, Data programming: Creating large training sets, quickly, Neural Information Processing Systems, 2016, pp. 3567–3575.

32.

Richardson

and Domingos

P.M.

, Markov logic networks, Machine Learning 62(1) (2006), 107–136.

33.

Riedel

Yao

and McCallum

, Modeling relations and their mentions without labeled text, 2010, pp. 148–163.

34.

Riedel

Yao

McCallum

and Marlin

B.M.

, Relation extraction with matrix factorization and universal schemas, In Proceedings of NAACL-HLT ’13, 2013, pp. 74–84.

35.

Ritter

Zettlemoyer

Etzioni

et al., Modeling missing data in distant supervision for information extraction, Transactions of the Association for Computational Linguistics 1 (2013), 367–378.

36.

Roller

Agirre

Soroa

and Stevenson

, Improving distant supervision using inference learning, In Proceedings of the ACL-IJCNLP ’15, July 2015, pp. 273–278.

37.

Roth

and Klakow

, Combining generative and discriminative model scores for distant supervision, In Proceedings of EMNLP ’13, 2013, pp. 24–29.

38.

Sandhaus

, The new york times annotated corpus, Linguistic Data Consortium, Philadelphia 6(12) (2008).

39.

Surdeanu

Tibshirani

Nallapati

and Manning

C.D.

, Multi-instance multi-label learning for relation extraction, In Proceedings of EMNLP ’12, 2012, pp. 455–465.

40.

Takamatsu

Sato

and Nakagawa

, Reducing wrong labels in distant supervision for relation extraction, In Proceedings of ACL ’12, 2012, pp. 721–729.

41.

Hoffmann

Zhao

and Grishman

, Filling knowledge base gaps for distant supervision of relation extraction, In Proceedings of ACL ’13, 2013, pp. 665–670.

42.

Yahya

Whang

Gupta

and Halevy

, Renoun: Fact extraction for nominal attributes, In Proceedins of EMNLP ’14, 2014, pp. 325–335.

43.

Zhang

Zeng

Yan

Chen

and Sui

, Towards accurate distant supervision for relational facts extraction, In Proceedings of ACL ’13, 2013, pp. 810–815.

44.

Zheng

Wang

Bao

Hao

Zhou

and Xu

, Joint extraction of entities and relations based on a novel tagging scheme, In Proceedings of ACL ’17, 2017, pp. 1227–1236.

Learning labeling functions in distantly supervised relation extraction

Abstract

Keywords

1. Introduction

1 Compared with the training data resulted by the labeling methods used in the related works of category 1 and 2, high quality training data means that there is less noise in the training data.

3. Labeling function learning algorithm

5 In our experiments, we employ Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/) to do POS tagging, dependency parsing and NER tagging.

4.1 Predicate for entity name knowledge

4.2 Predicate for entity type knowledge

4.3 Predicates for relation indicator knowledge

Table 1 Descriptions and alias of some Freebase relations

Table 3 Examples of dependency relations defined in Stanford CoreNLP

5. Weighted labeling function learning

5.1 Markov logic

10 http://i.stanford.edu/hazy/tuffy/home.

6. Experiments

6.1.1 Dataset for extracting relation instances

6.1.2 Datasets for extracting relation indicators

6.1.3 Datasets for labeling and extraction models

11 See https://github.com/guiyaocheng/DistantSupervision-LabelingFunction/tree/master/data for the labeled training data.

6.1.5 Baseline approaches

12 www.cs.washington.edu/ai/raphaelh/mr/.

6.2.1 Results for Exp 1

Table 5 Precision, Recall and F 1 -score of the methods StandardDS, TypeAware, NoiseReduce and LFL (Ours) evaluated on NYT

7. Conclusion

Footnotes

Acknowledgments

References

¹
Compared with the training data resulted by the labeling methods used in the related works of category 1 and 2, high quality training data means that there is less noise in the training data.

⁵
In our experiments, we employ Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/) to do POS tagging, dependency parsing and NER tagging.

Table 1
Descriptions and alias of some Freebase relations

Table 3
Examples of dependency relations defined in Stanford CoreNLP

¹⁰
http://i.stanford.edu/hazy/tuffy/home.

¹¹
See https://github.com/guiyaocheng/DistantSupervision-LabelingFunction/tree/master/data for the labeled training data.

¹²
www.cs.washington.edu/ai/raphaelh/mr/.

Table 5
Precision, Recall and F ${}_{1}$ -score of the methods StandardDS, TypeAware, NoiseReduce and LFL (Ours) evaluated on NYT