Abstract
Distant supervision for relation extraction aims to automatically obtain a large number of relational facts as training data, but it often leads to noisy label problem. In this paper, we propose a self-directed confidence learning based latent-label denoising method for distantly supervised relation extraction. Concretely, a self-directed algorithm that combines the semantic information of model prediction and distant supervision is designed to predict the confidence score of latent labels. Since this mechanism utilizes the obtained latent labels of easy examples to produce the latent labels of hard examples step by step, it is a robust and reliable learning process. Besides, it facilitates dynamic exploration of the confidence space to achieve better denoising performance. Moreover, to cope with the common imbalance problem in large corpus where the negative instances account for a much larger percentage, we introduce a discriminative loss function to solve the misclassification between non-relational and relational instances. Empirically, in order to verify the generality of the proposed denoising method, we use different neural models – CNN, PCNN and BiLSTM for representation learning. Experimental results show that our method can correct the noisy labels with high accuracy and outperform the state-of-the-art relation extraction systems.
Introduction
The target of Relation Extraction (RE) is to identify semantic relations between entity pairs from plain texts, which can benefit many Natural Language Processing (NLP) applications such as knowledge graph construction [1, 2] and question-answering [3, 4]. Considering that supervised RE methods are costly to generate a large amount of annotated training data, a distant supervision strategy [5] is introduced to automatically label the large-scale training data. Distantly supervised RE aligns triples in Knowledge Bases (KBs) with sentences in text corpus, and learns a relation extractor with the alignment. Though the distant supervision strategy can easily scale to larger corpus, it has a tendency to produce noisy labels (false negatives and false positives) due to the incompleteness and biases of the exploited KBs. As illustrated in Fig. 1, the relation between Helmut Panke and BMW is missed in KB, so their relation in text mention is labeled as NA label, thus this is a false negative case. For another false positive case, the relation between George W. Bush and Texas is labeled as /business/person/company because the triple exists in KB, actually the sentence cannot express the relation. These relation labels are viewed as noisy labels, thus a lot of researchers make significant efforts to deal with the wrongly labeling issue.
A case of our label-level denoising method.
Recent work [6, 7] focus on label-level noise, while most of the previous work pay attention to sentence-level noise. Sentence-level denoising methods [8, 9, 10] treat the distantly supervised labels as the ground-truth classification target, ignoring the effect of noisy labels. Subsequently, a soft-label method [6] is introduced to solve label-level denoising problem by replacing noisy label with a soft label, which is obtained by presetting static confidence for distantly supervised label. The important idea of label-level denoising is to identify relation labels between entity pairs through similar contextual patterns in raw texts. As shown in Fig. 1, given relational triples for /business/person/company relation in KB, contextual patterns (grey fonts) of true positives (orange box) are utilized to correct the noisy labels of false positives (blue box) and false negatives (green box) based on our label-level denoising method. Concretely, the true positive instances tend to have similar relation patterns such as “A, the chief executive at/of B” and “A, B ’s chairman”. Thus it is possible to correct the noisy labels of false negatives and false positives depending on whether or not the relation patterns are matched. A basic premise of label-level denoising method is that the overwhelming majority of distantly supervised labels are correct [6], otherwise it is difficult to capture valid contextual patterns.
One of the biggest challenges of label-level denoising is how to correct the noisy labels without any human involvement and external prior knowledge. Rather than searching for the optimal static confidence for distantly supervised labels proposed by previous work [6], we hope to find a generalized method to gradually guide the label correction process. Fortunately, we are inspired by self-paced learning [11, 12], whose core idea is to utilize the relatively easy examples of a task at the beginning, then gradually deal with the hard examples according to what the model has learned. It provides us with a heuristics thought of correcting noisy labels of easy relation patterns firstly to serve subsequent corrections of hard patterns. Thus we bridge the noisy label corrections and self-paced learning, and design a self-directed confidence learning method to memorize the gradual prediction of latent labels at different training steps. The self-directed learning makes our model more robust because it handles easy relation patterns which have high prediction probability preferentially. In addition, instead of a static confidence parameter, our method enables the exploration of more confidence space to achieve optimal model performance.
Another key challenge is the non-relational (NA) instances are far more than positive relational instances, thus it is easy to predict the positive labels as NA label. Luckily, a discriminative loss [13], which emphasizes discriminative feature learning between non-relational and relational instances, is proved to be effective for our task. Regardless of the difference of different positive classes, the loss focuses on the misclassification between whole positive classes and negative class. Through the experimental results, we found that discriminative loss could affect the latent label prediction effectively.
The main contributions of our paper are to:
propose a latent-label denoising method for relation extraction with self-directed confidence learning. The self-directed learning mechanism of label confidence can model the gradual prediction of latent label at different training steps, which not only corrects noisy labels by gradually exploiting easy patterns to serve later hard patterns, but also achieves dynamic exploration of the confidence space. introduce a discriminative loss to emphasize the misclassification between NA and positive labels. The mechanism helps to solve the dominance problem of non-relational instances, and thus improve the latent label prediction more effectively. conduct experiments on different neural networks – PCNN, CNN and BiLSTM to verify the generality of our proposed method. The experimental results show that our approach is insensitive to the model specificities and outperforms the state-of-the-art methods.
The remainder of our paper is structured as follows. Section 2 introduces the related work. Section 3 presents our latent-label denoising method in detail. Section 4 shows our experimental results, analysis and case study. Section 5 gives a conclusion and the future work.
In order to address the issue of large-scale training data annotation in supervised RE tasks, a distant supervision strategy [5] was originally proposed. It generates training set automatically by matching triples in KB to sentences in text corpus, and trains an extractor to predict the relation between entities by collecting features from all sentences. The strong assumption that all sentences can express the relation inevitably suffers from the wrong labels. To solve this problem, multi-instance learning [14] is introduced to predict the relation between an entity pair within all the sentences that mention them. And a relaxed at-least-one assumption [14, 15, 16] in multi-instance learning is applied, which holds that at least one sentence containing two entities can express their relation. Subsequently, multi-instance multi-label learning [15, 17] is presented to address the multi-relation classification issue. However, these handcrafted feature based methods are restricted by labour and time for designing fine features, and easily lead to error propagation from existing NLP tools.
Fortunately, to handle these challenges above, deep learning is proposed as a promising approach for automatically extracting features. Piecewise Convolutional Neural Network (PCNN) [8] is introduced to learn the sentence representation for relation extraction. Following at-least-one assumption, this method [8] just utilizes the information of the most valid sentence to predict the relation, which obviously neglects the information of other sentences. Afterwards, different attention mechanisms [9, 10] are proposed to use the information of multiple valid sentences. More recently, researchers have successfully applied generative adversarial network [18] and reinforcement learning [19, 20, 21] to reduce the noisy sentences by removing or redistributing them. Jat et al. [22] use two word attention models to achieve word-level denoising. And Vashishth et al. [23] encode syntactic information of plain text with Graph Convolution Networks (GCN), and utilize side information including entity type and relation alias to improve RE performance.
In addition to the sentence-level denoising methods above, label-level denoising methods [6, 7] are presented to replace the labels from distant supervision with new labels as the target of classification. Liu et al. [6] propose to combine the relation patterns learned by model and distant labels to reduce noisy labels. More specifically, they learn a soft label by presetting a confidence vector for distantly supervised labels during training, which is effective but not flexible. Then with respect to the noise of different levels in distantly supervised RE, Sun et al. [7] propose three impacting factors for label denoising. This work shows that partial confidence for distantly supervised labels is an effective way to improve the RE task.
To make the label denoising process more dynamic, we are influenced by the method of Luo et al. [24]. It characterizes the noise in distantly supervised data with dynamic transition matrix, which can be trained with a curriculum learning based method. Hence we borrow some ideas from curriculum learning [25] and self-paced learning [11, 12]. These learning strategies deal with the easy examples at first, then introduce more complex examples to learn the model, which can speed the convergence of the training process [25].
As for another problem that negative instances are more than positive instances in RE task, we study a series of work about class imbalance solutions [26, 27, 28], which put more attention on minority positive classes. Luckily, a positive-sharing loss [13], which ignores the difference of different positive classes and only emphasizes the discriminative misclassification between whole positive classes and negative class, is proved to benefit our proposed denoising method significantly.
Our proposed latent-label denoising architecture.
As illustrated in Fig. 2, we propose a latent-label denoising architecture for relation extraction with self-directed confidence learning. Instead of noisy label obtained by distant supervision, a latent label is learned as the ground-truth relying on three modules – confidence, discriminative and representation learning. Through initialization of distantly supervised labels, the neural network has stable prediction ability to capture contextual patterns of relation mentions by representation learning. Then a self-directed confidence learning algorithm is proposed to model the gradual latent-label prediction at different training steps. Besides, we use a discriminative loss function to focus on the misclassification between NA and positive labels. Our proposed model allows the three modules to interact with each other for better denoising performance.
Preliminary
To describe the goal of our approach more clearly, we give some definitions. In multi-instance learning paradigm, given an entity pair, all the sentences that mentioned both two entities are called relation mention [29]. Concretely, the relation mention of an entity pair includes one or multiple sentences and each sentence is called an instance, the relation type between two entities in KB is called relation. In other words, the objective of relation extraction is to predict the relation of an entity pair given their relation mention.
A label-level denoising method for relation extraction is to use a new label as the ground-truth of relation extractor, thus correct the noisy labels generated by distant supervision. The relation label
Given a training set containing
Next, we provide a brief description of the symbols in the paper. Matrices and vectors are bold capital and lowercase respectively, scalars and indexes are common lowercase. For example, given a vector
Confidence learning
In order to obtain a latent label as the ground-truth for label denoising, we consider to exploit the information of distant supervision effectively. While previous method [6] directly sets a static confidence for distant labels, it is difficult to extend and fail to explore more confidence space. Thus, instead of an explicit confidence hyper-parameter for distant labels, we give a new concept – label confidence, which denotes the confidence score for latent labels. Concretely, a latent label can be gradually obtained based on the label confidence at each training step. In addition to distant labels, we can make full use of the relation patterns predicted by relation extraction model. Therefore, the core issue turns to how to dynamically trade off the distant label and relation patterns depending on the label confidence learning.
We are inspired by curriculum and self-paced learning [11, 12], and intend to handle the relation patterns in an orderly manner. Following the general terms, easy patterns represent the common contextual patterns of relation mentions which tend to have relatively high prediction probabilities and are easy to identify. Correspondingly hard patterns refer to those special and difficult patterns with relatively low prediction probabilities. In gereral, we firstly deal with the latent labels of the easy patterns, then utilize what the model has learned to gradually handle the hard patterns subsequently.
According to these intuitions, we design an iterative self-directed learning strategy of label confidence, which combines relation patterns to obtain the latent labels at each training step. As shown in Fig. 3, we firstly learn the latent labels of the examples with easy patterns, then gradually correct the distant labels of those examples with relatively hard patterns. At the same time, the learning direction of latent labels keeps consistent with the change direction of label confidence from high to low, thus this self-directed mechanism makes the confidence space exploration possible. Concretely, latent labels of easy relation patterns are prone to retain or modify during training due to their high prediction probability. Then later iteration introduces more hard patterns depending on what the model has already learned from the preceding iteration. With the training proceeding, extra noise may be generated from the model prediction, thus the dynamic exploration of confidence space ensures the optimal performance.
An overview of self-directed confidence learning process.
Next, we describe the proposed confidence learning algorithm in detail. At the beginning of training, we initialize the model with distant labels and obtain the prediction ability of relation patterns. Once the model has stable prediction ability, it means there is no new information we can learn from the distant labels, because noisy labels will hinder the model performance. At this time, we take the model prediction into account to gradually correct the noisy labels. More specifically, label confidence at current training step can be viewed as a combination of preceding confidence and current model prediction. Let the number of labels
where
To be specific, we simply use a linear combination as the self-directed function, which reduces the effects of noisy labels and explores the label confidence space in the meanwhile. Confidence score of the
The initial confidence
As a whole, the self-directed learning retains the memory of latent-label prediction during training. By replacing confidence
Further, the self-directed function
The goal of training a neural network is to maximize the probability of the correct class, which is generally achieved by minimizing the cross-entropy loss. Given an entity pair, the cross-entropy loss
where
However, the cross-entropy loss equally penalizes the classification error for each class, which is not appropriate for learning the discriminative features between relational and non-relational instances. In our case, the loss caused by the incorrect estimation between the positive label and negative label should be more concerned, because the negative instances account for a large proportion in our corpus. That is to say, the losses for the positive labels are shared among positive instances, and we neglect to treat positive labels as absolutely distinct classes. Thus, we employ a discriminative loss [13] by adding a new regularized term
where
Let
where
The experimental results show that using discriminative loss can help to explore more discriminative features than cross-entropy loss function. It makes the latent-label learning more effective and increases model discriminative power.
In our paper, we use an end-to-end model training procedure, which is shown in Algorithm 3.3, all the model parameters are updated through Adam optimizer [30].
[h] Latent-label Denoising Algorithm[1] Training set
As a key component of relation extraction, representation learning gives us the basis of automatically extracting features from the relation mentions of entity pairs. The representation learning procedure is shown as Fig. 2, including 1) word and position encoder, encodes the word and entity position information into basic representation unit. 2) sentence encoder, uses different neural networks to learn a sentence representation. 3) relation-mention encoder, combines the representations of multiple sentences through attention mechanism to form relation-mention representation.
Word and position encoder
In order to capture semantic contextual information of a word, pre-trained word embeddings [31, 32] are used to map a word into a real-valued vector. Besides, we indicate entity information with the widely used position embeddings [33, 34], which encode the relative distance between a word and each of the two entities as a vector respectively. Then the word embedding of the
Sentence encoder
All kinds of neural networks can be used to encode sentences into fixed-size vector representations, but there is not clear conclusion which one performs better for our task. To verify the generality of our proposed approach, we compare three different neural networks to learn the sentence representations, including standard Convolutional Neural Network (CNN) [33], Piecewise CNN (PCNN) [8] and Bidirectional Long Short-Term Memory (BiLSTM) [35].
For standard CNN, we use convolution filters with an activation function tanh to obtain the feature maps, then the output of max-pooling layer is regarded as the sentence representation. In addition to CNN, we also make an experiment on the widely used architecture PCNN [8]. To capture the structure and other latent information, each feature map of convolution layer is divided into three pieces by the positions of two entities. Then a piecewise max-pooling layer is applied by selecting the maximum value of each piece, instead of the traditional max-pooling operation over the whole sentence. The sentence representation is the concatenation of the three pieces.
Basically, LSTM uses several gates to control the memory cell of the current word to pass and forget information to the next word in a sentence. Further, BiLSTM [35] considers both the past and future contexts of a word, including forward LSTM and backward LSTM. For a sentence of
To generate a fix-size feature vector which does not depend on the varying sentence length
In this layer, to reduce the impact of noisy sentences within relation mentions for an entity pair, a selective attention mechanism [9] is applied to arrange higher weight for more valid sentences. More specifically, we will compute the weighted average vector
where
Where
In this section, we will evaluate and compare the performance of our proposed approach with the state-of-the-art distantly supervised RE methods. Besides, we analyze the effects of three learning modules in detail. At last, we study some typical cases of the latent-label denoising method.
Dataset and evaluation
We evaluate our method on a popular benchmark dataset developed by Riedel et al. [14], which is widely used for distantly supervised RE task. The dataset was generated by aligning relational triples of Freebase with the sentences in New York Times (NYT) corpus, which contains 52 relation classes and a NA class. There are 522,611 sentences, 281,270 entity pairs and 18,252 relational triples in the training set, and 172,448 sentences, 96,678 entity pairs and 1,950 relational triples in the test set. Following previous work [9, 36, 6, 24], we employ the held-out evaluation with aggregate precision-recall curves and top N Precision (P@N) results.
Parameter settings
We adopt cross-validation to determine the model parameters. For detailed settings, we adopt the 50 dimensions pre-trained word embeddings released by Lin et al. [9] and 5 dimensions position embeddings. The window size and the number of feature maps for CNN and PCNN are set to 3 and 230. For BiLSTM, the number of hidden units is set to 230. The self-directed factor
Results and discussion
In order to evaluate our proposed method, we use a variety of methods to compare. Mintz [5] is a traditional distantly supervised RE method, MultiR [15] introduces multi-instance learning and MIMLRE [17] presents multi-instance multi-label learning. All of the above methods are based on feature engineering, and the results come from their papers. PCNN [9] is regarded as the baseline of neural network based systems. PCNN
Comparison with previous methods
Figure 4 presents the precision-recall curves of our proposed PCNN
Precisions of various methods for different recalls
Precisions of various methods for different recalls
Top N precision (P@N) for relation extraction in the entity pairs with different number of sentences
Performance comparison of various methods with precision-recall curves.
Table 1 shows precisions of various models for different recalls
Following previous work [9, 6], we evaluate top N precision (P@N) for One, Two and All settings. In these settings, we randomly choose one, two and all sentences from the entity pairs with more than one sentence for testing. Table 2 shows top 100/200/300 results and their mean values for different settings. We can observe that combining latent-label can effectively improve the baselines of different neural models, while the PCNN-based denoising model obtains better results than the other two models.
Precision-recall curves for different learning modules.
Figure 5 presents the precision-recall curves of our different learning modules – Confidence Learning (CL) and Discriminative Learning (DL) and Representation Learning (RL). RL module is based on CNN, PCNN and BiLSTM model respectively. Clearly, our latent-label mechanism combining CL and DL modules has a significant improvement on different RL modules. Specifically, only adding CL module for PCNN has higher precision than PCNN
The left part of Fig. 6 depicts the precision-recall curves of different training epochs for PCNN model, from which we can verify the impact of the CL module. As the latent labels are used after 5 epochs, we will analyze later 5 epochs
Precision-recall curves of different training epochs for PCNN and PCNN
If we add DL module as the right part of Fig. 6, the model will correct 2,710/5,876/7,683 distant labels for the same epochs. And we find that the least corrections of distant labels perform better. As the situations are similar for other neural networks, it is clear that DL module can accelerate the process of label corrections. With the training proceeding, adding DL module makes the performance improve considerably, thus improve the results to some extent. The self-directed denoising process ensures that latent label of relatively easy relational patterns can be exploited to gradually correct labels of the hard patterns, and the case study will be shown in Section 4.3.3.
To verify whether DL can alleviate the misclassification problem between negative and positive classes, we present the number of wrong predictions (false positives/false negatives) for extracting top N relation mentions in Table 3. For top 100/500/1,000/2,000 ranked relation mentions, we find the number of false positives and false negatives is reduced when adding the DL module to PCNN model. Here we ignore the wrong predictions between positive classes due to the discriminative learning. As a result, the number of wrong predictions reduces more sharply with respect to PCNN
The number of wrong predictions (false positives/false negatives) for extracting top N relation mentions
We manually examine latent-label corrections of 200 random relation mentions of our PCNN
Case study
Case study
Firstly, we focus on the right label corrections in the upper part of Table 4. And we observe some false positives can be corrected by latent labels, such as distant labels location_contains and place_lived. For the first case, there is a wrong labeled triple location_contains (Iverson, Minnesota) in KB, because Iverson and Minnesota are regarded as two locations, and actually they have no direct relation. With respect to entity pair (Shohei Imamura, Tokyo), the sentence fails to express their place_lived relation. Luckily, we can correct the relation as place_of_death with latent label according to the relation patterns “A …died …in B”. In addition, we can use latent labels to correct false negatives. For example, while the relation between Tom Arnold and TerraPass is missed in KB, we can discover their latent relation company depending on the patterns mentioned before. During the first few training epochs, the noisy labels of these examples can be corrected immediately, and we treat their relation patterns as easy patterns.
As for entity pair (Rem Koolhaas, Rotterdam), it is obvious that the sentence cannot indicate the relation place_lived. In this case, we find that its latent label corrects gradually owing to the self-directed learning mechanism. At first, the latent label remains the same with its distant label. Then the relation is corrected as place_of_birth in the next training epoch, and identified as NA at last. The relation patterns of these examples, whose latent labels need to be gradually learned slowly, are relatively special patterns and regarded as hard patterns. This case shows that our proposed self-directed learning is effective for latent-label denoising.
As shown in the lower part of Table 4, few wrong corrections mainly come from the strong pattern-consistency assumptions. In practice, contextual patterns of relation mentions are very diverse, hence it is a big challenge to recognize the special patterns and distinguish the similar patterns. For instance, as the relation mention of entity pair (Rolling Stone, Annie Leibovitz) has a different pattern from other common patterns of company in Fig. 1, our model fails to identify the relation. For another case of entity pair Schengen and North America, our model matches the patterns of their mention to “A in B” and regards both entities as locations, thus wrongly correct their relation as location_contains. Fortunately, the wrong corrections are relatively few compared with the right corrections. On the whole, our approach remains the effectiveness and robustness.
In this paper, we propose a latent-label denoising method for distantly supervised relation extraction with self-directed confidence learning. This mechanism makes the obtained latent labels of relatively easy patterns are exploited to predict later latent labels of hard patterns, and also performs dynamic exploration of the confidence space. Besides, we use a discriminative loss to solve the misclassification between NA and positive labels, which greatly benefits the heuristic latent label prediction. Experimentally, we verify the generality of the proposed label denoising method on PCNN, CNN and BiLSTM based models. The results and case study demonstrate that our approach outperforms the state-of-the-art systems on the popular evaluation dataset and corrects the noisy labels effectively.
In the future, we will plan to utilize prior knowledge from external knowledge sources or some manual annotations to measure the label confidence with respect to each relation mention. Moreover, we will consider to explore the latent multi-label learning.
Footnotes
Acknowledgments
This work is supported by National Natural Science Foundation of China under Grant No. 61421061, No. 61602048, No. 61601046 and No. 61520106007, Project of Hainan Passenger Behavior Intelligence Analysis Platform and Precise Service Mining Prediction under Grant No. ZDKJ 201808, and BUPT-SICE Excellent Graduate Students Innovation Funds, 2016.
