A latent-label denoising method for relation extraction with self-directed confidence learning

Abstract

Distant supervision for relation extraction aims to automatically obtain a large number of relational facts as training data, but it often leads to noisy label problem. In this paper, we propose a self-directed confidence learning based latent-label denoising method for distantly supervised relation extraction. Concretely, a self-directed algorithm that combines the semantic information of model prediction and distant supervision is designed to predict the confidence score of latent labels. Since this mechanism utilizes the obtained latent labels of easy examples to produce the latent labels of hard examples step by step, it is a robust and reliable learning process. Besides, it facilitates dynamic exploration of the confidence space to achieve better denoising performance. Moreover, to cope with the common imbalance problem in large corpus where the negative instances account for a much larger percentage, we introduce a discriminative loss function to solve the misclassification between non-relational and relational instances. Empirically, in order to verify the generality of the proposed denoising method, we use different neural models – CNN, PCNN and BiLSTM for representation learning. Experimental results show that our method can correct the noisy labels with high accuracy and outperform the state-of-the-art relation extraction systems.

Keywords

Distant supervision relation extraction latent label confidence learning discriminative loss

1. Introduction

The target of Relation Extraction (RE) is to identify semantic relations between entity pairs from plain texts, which can benefit many Natural Language Processing (NLP) applications such as knowledge graph construction [1, 2] and question-answering [3, 4]. Considering that supervised RE methods are costly to generate a large amount of annotated training data, a distant supervision strategy [5] is introduced to automatically label the large-scale training data. Distantly supervised RE aligns triples in Knowledge Bases (KBs) with sentences in text corpus, and learns a relation extractor with the alignment. Though the distant supervision strategy can easily scale to larger corpus, it has a tendency to produce noisy labels (false negatives and false positives) due to the incompleteness and biases of the exploited KBs. As illustrated in Fig. 1, the relation between Helmut Panke and BMW is missed in KB, so their relation in text mention is labeled as NA label, thus this is a false negative case. For another false positive case, the relation between George W. Bush and Texas is labeled as /business/person/company because the triple exists in KB, actually the sentence cannot express the relation. These relation labels are viewed as noisy labels, thus a lot of researchers make significant efforts to deal with the wrongly labeling issue.

Figure 1.

A case of our label-level denoising method.

Recent work [6, 7] focus on label-level noise, while most of the previous work pay attention to sentence-level noise. Sentence-level denoising methods [8, 9, 10] treat the distantly supervised labels as the ground-truth classification target, ignoring the effect of noisy labels. Subsequently, a soft-label method [6] is introduced to solve label-level denoising problem by replacing noisy label with a soft label, which is obtained by presetting static confidence for distantly supervised label. The important idea of label-level denoising is to identify relation labels between entity pairs through similar contextual patterns in raw texts. As shown in Fig. 1, given relational triples for /business/person/company relation in KB, contextual patterns (grey fonts) of true positives (orange box) are utilized to correct the noisy labels of false positives (blue box) and false negatives (green box) based on our label-level denoising method. Concretely, the true positive instances tend to have similar relation patterns such as “A, the chief executive at/of B” and “A, B ’s chairman”. Thus it is possible to correct the noisy labels of false negatives and false positives depending on whether or not the relation patterns are matched. A basic premise of label-level denoising method is that the overwhelming majority of distantly supervised labels are correct [6], otherwise it is difficult to capture valid contextual patterns.

One of the biggest challenges of label-level denoising is how to correct the noisy labels without any human involvement and external prior knowledge. Rather than searching for the optimal static confidence for distantly supervised labels proposed by previous work [6], we hope to find a generalized method to gradually guide the label correction process. Fortunately, we are inspired by self-paced learning [11, 12], whose core idea is to utilize the relatively easy examples of a task at the beginning, then gradually deal with the hard examples according to what the model has learned. It provides us with a heuristics thought of correcting noisy labels of easy relation patterns firstly to serve subsequent corrections of hard patterns. Thus we bridge the noisy label corrections and self-paced learning, and design a self-directed confidence learning method to memorize the gradual prediction of latent labels at different training steps. The self-directed learning makes our model more robust because it handles easy relation patterns which have high prediction probability preferentially. In addition, instead of a static confidence parameter, our method enables the exploration of more confidence space to achieve optimal model performance.

Another key challenge is the non-relational (NA) instances are far more than positive relational instances, thus it is easy to predict the positive labels as NA label. Luckily, a discriminative loss [13], which emphasizes discriminative feature learning between non-relational and relational instances, is proved to be effective for our task. Regardless of the difference of different positive classes, the loss focuses on the misclassification between whole positive classes and negative class. Through the experimental results, we found that discriminative loss could affect the latent label prediction effectively.

The main contributions of our paper are to:

•

propose a latent-label denoising method for relation extraction with self-directed confidence learning. The self-directed learning mechanism of label confidence can model the gradual prediction of latent label at different training steps, which not only corrects noisy labels by gradually exploiting easy patterns to serve later hard patterns, but also achieves dynamic exploration of the confidence space.

•

introduce a discriminative loss to emphasize the misclassification between NA and positive labels. The mechanism helps to solve the dominance problem of non-relational instances, and thus improve the latent label prediction more effectively.

•

conduct experiments on different neural networks – PCNN, CNN and BiLSTM to verify the generality of our proposed method. The experimental results show that our approach is insensitive to the model specificities and outperforms the state-of-the-art methods.

The remainder of our paper is structured as follows. Section 2 introduces the related work. Section 3 presents our latent-label denoising method in detail. Section 4 shows our experimental results, analysis and case study. Section 5 gives a conclusion and the future work.

2. Related work

In order to address the issue of large-scale training data annotation in supervised RE tasks, a distant supervision strategy [5] was originally proposed. It generates training set automatically by matching triples in KB to sentences in text corpus, and trains an extractor to predict the relation between entities by collecting features from all sentences. The strong assumption that all sentences can express the relation inevitably suffers from the wrong labels. To solve this problem, multi-instance learning [14] is introduced to predict the relation between an entity pair within all the sentences that mention them. And a relaxed at-least-one assumption [14, 15, 16] in multi-instance learning is applied, which holds that at least one sentence containing two entities can express their relation. Subsequently, multi-instance multi-label learning [15, 17] is presented to address the multi-relation classification issue. However, these handcrafted feature based methods are restricted by labour and time for designing fine features, and easily lead to error propagation from existing NLP tools.

Fortunately, to handle these challenges above, deep learning is proposed as a promising approach for automatically extracting features. Piecewise Convolutional Neural Network (PCNN) [8] is introduced to learn the sentence representation for relation extraction. Following at-least-one assumption, this method [8] just utilizes the information of the most valid sentence to predict the relation, which obviously neglects the information of other sentences. Afterwards, different attention mechanisms [9, 10] are proposed to use the information of multiple valid sentences. More recently, researchers have successfully applied generative adversarial network [18] and reinforcement learning [19, 20, 21] to reduce the noisy sentences by removing or redistributing them. Jat et al. [22] use two word attention models to achieve word-level denoising. And Vashishth et al. [23] encode syntactic information of plain text with Graph Convolution Networks (GCN), and utilize side information including entity type and relation alias to improve RE performance.

In addition to the sentence-level denoising methods above, label-level denoising methods [6, 7] are presented to replace the labels from distant supervision with new labels as the target of classification. Liu et al. [6] propose to combine the relation patterns learned by model and distant labels to reduce noisy labels. More specifically, they learn a soft label by presetting a confidence vector for distantly supervised labels during training, which is effective but not flexible. Then with respect to the noise of different levels in distantly supervised RE, Sun et al. [7] propose three impacting factors for label denoising. This work shows that partial confidence for distantly supervised labels is an effective way to improve the RE task.

To make the label denoising process more dynamic, we are influenced by the method of Luo et al. [24]. It characterizes the noise in distantly supervised data with dynamic transition matrix, which can be trained with a curriculum learning based method. Hence we borrow some ideas from curriculum learning [25] and self-paced learning [11, 12]. These learning strategies deal with the easy examples at first, then introduce more complex examples to learn the model, which can speed the convergence of the training process [25].

As for another problem that negative instances are more than positive instances in RE task, we study a series of work about class imbalance solutions [26, 27, 28], which put more attention on minority positive classes. Luckily, a positive-sharing loss [13], which ignores the difference of different positive classes and only emphasizes the discriminative misclassification between whole positive classes and negative class, is proved to benefit our proposed denoising method significantly.

Figure 2.

Our proposed latent-label denoising architecture.

3. Methodology

As illustrated in Fig. 2, we propose a latent-label denoising architecture for relation extraction with self-directed confidence learning. Instead of noisy label obtained by distant supervision, a latent label is learned as the ground-truth relying on three modules – confidence, discriminative and representation learning. Through initialization of distantly supervised labels, the neural network has stable prediction ability to capture contextual patterns of relation mentions by representation learning. Then a self-directed confidence learning algorithm is proposed to model the gradual latent-label prediction at different training steps. Besides, we use a discriminative loss function to focus on the misclassification between NA and positive labels. Our proposed model allows the three modules to interact with each other for better denoising performance.

3.1 Preliminary

To describe the goal of our approach more clearly, we give some definitions. In multi-instance learning paradigm, given an entity pair, all the sentences that mentioned both two entities are called relation mention [29]. Concretely, the relation mention of an entity pair includes one or multiple sentences and each sentence is called an instance, the relation type between two entities in KB is called relation. In other words, the objective of relation extraction is to predict the relation of an entity pair given their relation mention.

A label-level denoising method for relation extraction is to use a new label as the ground-truth of relation extractor, thus correct the noisy labels generated by distant supervision. The relation label $y$ distantly supervised by KB is called distant label. The wrongly labeled distant label is assumed as noisy label, and we define the label corrected by our model as latent label $\hat{y}$ .

Given a training set containing $N$ examples $\mathscr{D}=\{(x_{i},y_{i})\}_{i=1}^{N}$ , where $x_{i}$ is the $i$ -th input observation that denotes all the sentences (relation mention) collected for an entity pair, $y_{i}\in\{0,1,\ldots,K\}$ is the relation label of this entity pair in KB. If $y_{i}=0$ , then $x_{i}$ includes non-relational negative instances and $y_{i}$ is negative ( $N A$ ) label. If $y_{i}=k>0$ , then $x_{i}$ includes relational positive instances, and $y_{i}$ is the $k$ -th positive label. In this paper, we will ignore the subscript and simplify an example as ( $x, y$ ).

Next, we provide a brief description of the symbols in the paper. Matrices and vectors are bold capital and lowercase respectively, scalars and indexes are common lowercase. For example, given a vector x, x ${}_{i}$ is its $i$ -th element.

3.2 Confidence learning

In order to obtain a latent label as the ground-truth for label denoising, we consider to exploit the information of distant supervision effectively. While previous method [6] directly sets a static confidence for distant labels, it is difficult to extend and fail to explore more confidence space. Thus, instead of an explicit confidence hyper-parameter for distant labels, we give a new concept – label confidence, which denotes the confidence score for latent labels. Concretely, a latent label can be gradually obtained based on the label confidence at each training step. In addition to distant labels, we can make full use of the relation patterns predicted by relation extraction model. Therefore, the core issue turns to how to dynamically trade off the distant label and relation patterns depending on the label confidence learning.

We are inspired by curriculum and self-paced learning [11, 12], and intend to handle the relation patterns in an orderly manner. Following the general terms, easy patterns represent the common contextual patterns of relation mentions which tend to have relatively high prediction probabilities and are easy to identify. Correspondingly hard patterns refer to those special and difficult patterns with relatively low prediction probabilities. In gereral, we firstly deal with the latent labels of the easy patterns, then utilize what the model has learned to gradually handle the hard patterns subsequently.

According to these intuitions, we design an iterative self-directed learning strategy of label confidence, which combines relation patterns to obtain the latent labels at each training step. As shown in Fig. 3, we firstly learn the latent labels of the examples with easy patterns, then gradually correct the distant labels of those examples with relatively hard patterns. At the same time, the learning direction of latent labels keeps consistent with the change direction of label confidence from high to low, thus this self-directed mechanism makes the confidence space exploration possible. Concretely, latent labels of easy relation patterns are prone to retain or modify during training due to their high prediction probability. Then later iteration introduces more hard patterns depending on what the model has already learned from the preceding iteration. With the training proceeding, extra noise may be generated from the model prediction, thus the dynamic exploration of confidence space ensures the optimal performance.

Figure 3.

An overview of self-directed confidence learning process.

Next, we describe the proposed confidence learning algorithm in detail. At the beginning of training, we initialize the model with distant labels and obtain the prediction ability of relation patterns. Once the model has stable prediction ability, it means there is no new information we can learn from the distant labels, because noisy labels will hinder the model performance. At this time, we take the model prediction into account to gradually correct the noisy labels. More specifically, label confidence at current training step can be viewed as a combination of preceding confidence and current model prediction. Let the number of labels $K+1$ , including $K$ positive labels and a NA label, $\mathbf{c}^{(\tau)}$ denotes a label confidence vector of $K+1$ dimensions at the $\tau$ -th training step, then the latent label $\hat{y}^{(\tau)}$ at the ${\tau}$ -th training step is determined as follows:

$\displaystyle\hat{y}^{(\tau)}={\mathrm{arg}\,{\max}}\{\mathbf{c}^{(\tau)}_{0},% \mathbf{c}^{(\tau)}_{1},\ldots,\mathbf{c}^{(\tau)}_{K}\}$ (1) $\displaystyle\mathbf{c}^{(\tau)}=H(\mathbf{c}^{(\tau-1)},\mathbf{p}^{(\tau)})$

where $\mathbf{p}^{(\tau)}$ represents the relation probability predicted by the neural network at the $\tau$ -th training step, which is computed in Section 3.4. $H(\cdot)$ denotes a self-directed function, which models the correlation between label confidence scores at different training steps and memorize the effect of latent labels from preceding training steps. This self-directed learning algorithm can gradually correct noisy labels during training, and guide our model to search for more confidence space for better generality.

To be specific, we simply use a linear combination as the self-directed function, which reduces the effects of noisy labels and explores the label confidence space in the meanwhile. Confidence score of the $j$ -th label at the $\tau$ -th training step $\mathbf{c}^{(\tau)}_{j}$ is iteratively calculated as follows:

$\displaystyle\mathbf{c}^{(\tau)}_{j}=H(\mathbf{c}^{(\tau-1)}_{j},\mathbf{p}^{(% \tau)}_{j})=\frac{\alpha*\mathbf{c}^{(\tau-1)}_{j}+(1-\alpha)*\mathbf{p}^{(% \tau)}_{j}}{\sum_{k=0}^{K}(\alpha*\mathbf{c}^{(\tau-1)}_{k}+(1-\alpha)*\mathbf% {p}^{(\tau)}_{k})}\quad(\tau>0)$ (2) $\displaystyle\mathbf{c}^{(0)}=\mathbf{y}$

The initial confidence $\mathbf{c}^{(0)}$ is fed to a one-hot vector $\mathbf{y}$ of the distant label $y$ , which memorizes the entire information of distant supervision. Then the calculated label confidence is normalized to ensure its variation range. $\alpha\in(0,1]$ is a self-directed factor, which determines how much information to memorize about preceding label prediction. If $\alpha$ is greater, the model will correct the noisy labels more slowly. When $\alpha=1$ , the model still uses distant labels and never learns a corrected latent label.

As a whole, the self-directed learning retains the memory of latent-label prediction during training. By replacing confidence $\mathbf{c}^{(\tau)}$ with a normalized value ( $\alpha*\mathbf{c}^{(\tau-1)}+(1-\alpha)*\mathbf{p}^{(\tau)}$ ) at each iteration, preceding latent label $y^{(\tau-1)}$ affects the prediction of current latent label $y^{(\tau)}$ , which actually achieves the consistent utilization of the easy patterns to more harder patterns. More specifically, with respect to easy patterns, it is more likely to retain or change the initial distant labels because of the high relation probability, thus the latent labels of easy relation patterns are easy to obtain and retain during training. As for hard patterns, their low prediction probabilities make the labels relatively difficult to modify at front training iterations. This simple iterative updating ensures that our model can gradually utilize the predicted latent-label of easy patterns to serve later latent-label prediction of hard patterns, which make the self-directed learning process more reliable.

Further, the self-directed function $H(\cdot)$ is flexible and can be replaced by different computing methods. If there exists some external prior knowledge as partial supervision, we assume that using complex neural networks such as RNN to model the latent-label memory is valid. To the best of our knowledge, the self-directed confidence learning for relation extraction is firstly proposed and obviously distinguishes from prior work. Experimentally, the self-directed learning process has been proved to be robust and significantly correct noisy labels.

3.3 Discriminative learning

The goal of training a neural network is to maximize the probability of the correct class, which is generally achieved by minimizing the cross-entropy loss. Given an entity pair, the cross-entropy loss $l_{ce}(\mathbf{z},\mathbf{p})$ aims to minimize the following loss function:

$\displaystyle l_{ce}(\mathbf{z},\mathbf{p})=-\sum_{j=0}^{K}(\mathbf{z}_{j}\log% \mathbf{p}_{j})$ (3)

where $\mathbf{p}$ is the predicted relation probability of neural network, and $\mathbf{z}$ is the one-hot vector of the desired output, i.e., distant label $\mathbf{y}$ or latent label $\mathbf{\hat{y}}$ .

However, the cross-entropy loss equally penalizes the classification error for each class, which is not appropriate for learning the discriminative features between relational and non-relational instances. In our case, the loss caused by the incorrect estimation between the positive label and negative label should be more concerned, because the negative instances account for a large proportion in our corpus. That is to say, the losses for the positive labels are shared among positive instances, and we neglect to treat positive labels as absolutely distinct classes. Thus, we employ a discriminative loss [13] by adding a new regularized term $l_{d}(\mathbf{z},\mathbf{p})$ .

$\displaystyle l_{d}(\mathbf{z},\mathbf{p})=-\left[\mathbf{z}_{0}\log\mathbf{p}% _{0}+\sum_{j=1}^{K}(\mathbf{z}_{j}\log(1-\mathbf{p}_{0}))\right]$ (4)

where $\mathbf{p}_{0}$ indicates the probability of NA, and ( $1-\mathbf{p}_{0}$ ) is the probability summation of the whole positive labels. The regularized loss can be regarded as an objective function of handling the binary classification problem.

Let $\mathscr{T}$ be the boundary of training steps which represents whether our model can stably predict the relation patterns. At the beginning of training steps $\tau<\mathscr{T}$ , distant label $\mathbf{y}$ is employed as the ground-truth. The discriminative loss is given by:

$\displaystyle l\left(\mathbf{y},\mathbf{p}^{(\tau)}\right)=l_{ce}\left(\mathbf% {y},\mathbf{p}^{(\tau)}\right)+\beta*l_{d}\left(\mathbf{y},\mathbf{p}^{(\tau)}\right)$ (5)

where $\beta$ is a weighted factor of discriminative term. Once our model has stable prediction ability, we begin to use latent label $\mathbf{\hat{y}}^{(\tau)}$ instead of distant label when $\tau\geqslant\mathscr{T}$ , the discriminative loss is replaced as follows:

$\displaystyle l\left(\mathbf{\hat{y}}^{(\tau)},\mathbf{p}^{(\tau)}\right)=l_{% ce}\left(\mathbf{\hat{y}}^{(\tau)},\mathbf{p}^{(\tau)}\right)+\beta*l_{d}\left% (\mathbf{\hat{y}}^{(\tau)},\mathbf{p}^{(\tau)}\right)$ (6)

The experimental results show that using discriminative loss can help to explore more discriminative features than cross-entropy loss function. It makes the latent-label learning more effective and increases model discriminative power.

In our paper, we use an end-to-end model training procedure, which is shown in Algorithm 3.3, all the model parameters are updated through Adam optimizer [30].

[h] Latent-label Denoising Algorithm[1] Training set $\mathscr{D}$ , training threshold $\mathscr{T}$ Model parameters $\theta$ epoch $\tau=1$ to $N$ $\tau<\mathscr{T}$ Train the representation learning module with distant label y by Eq. (5), predict the current relation probability $\mathbf{p}^{(\tau)}$ $\tau=\mathscr{T}$ Initialize label confidence $\mathbf{c}^{(\mathscr{T})}=\mathbf{y}$ Compute label confidence $\mathbf{c}^{(\tau)}=H\left(\mathbf{c}^{(\tau-1)},\mathbf{p}^{(\tau)}\right)$ Compute latent label $\hat{y}^{(\tau)}={\mathrm{arg}\,{\max}}\left\{\mathbf{c}^{(\tau)}_{0},\mathbf{% c}^{(\tau)}_{1},\ldots,\mathbf{c}^{(\tau)}_{K}\right\}$ Train the representation learning module with latent label $\hat{y}^{(\tau)}$ by Eq. (6), predict the current relation probability $\mathbf{p}^{(\tau)}$ Update $\theta$ by Adam optimizer

3.4 Representation learning

As a key component of relation extraction, representation learning gives us the basis of automatically extracting features from the relation mentions of entity pairs. The representation learning procedure is shown as Fig. 2, including 1) word and position encoder, encodes the word and entity position information into basic representation unit. 2) sentence encoder, uses different neural networks to learn a sentence representation. 3) relation-mention encoder, combines the representations of multiple sentences through attention mechanism to form relation-mention representation.

3.4.1 Word and position encoder

In order to capture semantic contextual information of a word, pre-trained word embeddings [31, 32] are used to map a word into a real-valued vector. Besides, we indicate entity information with the widely used position embeddings [33, 34], which encode the relative distance between a word and each of the two entities as a vector respectively. Then the word embedding of the $i$ -th word in the sentence and its corresponding two position embeddings are concatenated as a vector w ${}_{i}$ . Let sentence length be $T$ , the sentence can be represented as a sequence {w ${}_{1}$ , w ${}_{2}$ , …, w ${}_{T}$ }.

3.4.2 Sentence encoder

All kinds of neural networks can be used to encode sentences into fixed-size vector representations, but there is not clear conclusion which one performs better for our task. To verify the generality of our proposed approach, we compare three different neural networks to learn the sentence representations, including standard Convolutional Neural Network (CNN) [33], Piecewise CNN (PCNN) [8] and Bidirectional Long Short-Term Memory (BiLSTM) [35].

For standard CNN, we use convolution filters with an activation function tanh to obtain the feature maps, then the output of max-pooling layer is regarded as the sentence representation. In addition to CNN, we also make an experiment on the widely used architecture PCNN [8]. To capture the structure and other latent information, each feature map of convolution layer is divided into three pieces by the positions of two entities. Then a piecewise max-pooling layer is applied by selecting the maximum value of each piece, instead of the traditional max-pooling operation over the whole sentence. The sentence representation is the concatenation of the three pieces.

Basically, LSTM uses several gates to control the memory cell of the current word to pass and forget information to the next word in a sentence. Further, BiLSTM [35] considers both the past and future contexts of a word, including forward LSTM and backward LSTM. For a sentence of $T$ words, we can obtain a set of $T$ hidden representations H $=$ {h ${}_{1}$ , h ${}_{2}$ , …, h ${}_{T}$ } through BiLSTM. For $t\in\{1,2,\ldots,T\}$ , the hidden representation h ${}_{t}$ of $t$ -th word is computed by concatenating the left and right hidden state vectors:

$\displaystyle\overrightarrow{\mathbf{h}_{t}}=\overrightarrow{\textit{LSTM}}(% \mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{T})$ $\displaystyle\overleftarrow{\mathbf{h}_{t}}=\overleftarrow{\textit{LSTM}}(% \mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{T})$ (7) $\displaystyle\mathbf{h}_{t}=[\overrightarrow{\mathbf{h}_{t}},\overleftarrow{% \mathbf{h}_{t}}]$

To generate a fix-size feature vector which does not depend on the varying sentence length $T$ , a max-pooling mechanism [33] is employed to select the maximum value over $T$ on H. Finally, we use the feature vector as the sentence representation.

3.4.3 Relation-mention encoder

In this layer, to reduce the impact of noisy sentences within relation mentions for an entity pair, a selective attention mechanism [9] is applied to arrange higher weight for more valid sentences. More specifically, we will compute the weighted average vector m of all sentence representations for an entity pair, and the weight $\alpha_{i}$ of $i$ -th sentence is calculated as below:

$\displaystyle\mathbf{m}=\sum_{i}\alpha_{i}\mathbf{s}_{i}$

(8) $\displaystyle\alpha_{i}=\frac{\exp(\mathbf{s}_{i}\mathbf{Ar})}{\sum_{k}\exp(% \mathbf{s}_{k}\mathbf{Ar})}$

where A and r are the weighted matrix and vector respectively. The we use a softmax function to model the predicted probability p ${}_{j}$ of the $j$ -th label as follows:

$\displaystyle\mathbf{p}_{j}=\frac{\exp(\mathbf{Wm}_{j}+\mathbf{b})}{\sum_{k}% \exp(\mathbf{Wm}_{k}+\mathbf{b})}$ (9)

Where W and b are the training parameters respectively.
4. Experiments

In this section, we will evaluate and compare the performance of our proposed approach with the state-of-the-art distantly supervised RE methods. Besides, we analyze the effects of three learning modules in detail. At last, we study some typical cases of the latent-label denoising method.

4.1 Dataset and evaluation

We evaluate our method on a popular benchmark dataset developed by Riedel et al. [14], which is widely used for distantly supervised RE task. The dataset was generated by aligning relational triples of Freebase with the sentences in New York Times (NYT) corpus, which contains 52 relation classes and a NA class. There are 522,611 sentences, 281,270 entity pairs and 18,252 relational triples in the training set, and 172,448 sentences, 96,678 entity pairs and 1,950 relational triples in the test set. Following previous work [9, 36, 6, 24], we employ the held-out evaluation with aggregate precision-recall curves and top N Precision (P@N) results.

4.2 Parameter settings

We adopt cross-validation to determine the model parameters. For detailed settings, we adopt the 50 dimensions pre-trained word embeddings released by Lin et al. [9] and 5 dimensions position embeddings. The window size and the number of feature maps for CNN and PCNN are set to 3 and 230. For BiLSTM, the number of hidden units is set to 230. The self-directed factor $\alpha$ of confidence learning module and the weighted factor $\beta$ of discriminative learning module are set to 0.5 and 1 respectively. Dropout [37] is used to avoid overfitting, which is applied to the output of max-pooling layer for CNN and PCNN, and the non-recurrent connections for BiLSTM [38]. We set dropout rate 0.5 and $l_{2}$ constraint 0.0001. In addition, we set learning rate 0.001 for Adam optimizer [30] and batch size 50. In particular, we use latent label after 5 epochs. At test phase, the distant labels are still regarded as the evaluation metrics.

4.3 Results and discussion

In order to evaluate our proposed method, we use a variety of methods to compare. Mintz [5] is a traditional distantly supervised RE method, MultiR [15] introduces multi-instance learning and MIMLRE [17] presents multi-instance multi-label learning. All of the above methods are based on feature engineering, and the results come from their papers. PCNN [9] is regarded as the baseline of neural network based systems. PCNN $+$ oft-label [6] is a label denoising method for comparison, and the results are reproduced depending on their settings.

4.3.1 Comparison with previous methods

Figure 4 presents the precision-recall curves of our proposed PCNN $+$ latent-label method and the previous methods. When the recall is less than 0.5, our model notably achieves higher precisions than all compared methods. Moreover, our model performs relatively stable on the whole range of recall while the PCNN $+$ soft-label [6] model drops unstable when the recall is less than 0.05 and greater than 0.4. Therefore, it is obvious that our model outperforms the state-of-art RE systems remarkably.

Table 1
Precisions of various methods for different recalls

Method	$R$ _0.1	$R$ _0.2	$R$ _0.3	$R$ _0.4	$R$ _0.5	AP
Mintz [5]	0.399	0.286	0.168	–	–	0.106
MultiR [15]	0.609	0.364	–	–	–	0.126
MIMLRE [17]	0.607	0.338	–	–	–	0.120
PCNN	0.733	0.6	0.489	0.410	0.317	0.345
$+$ soft-label [10]	0.777	0.682	0.553	0.39	0.225	0.331
$+$ latent-label	0.859	0.757	0.607	0.444	0.333	0.374
CNN	0.684	0.582	0.495	0.405	0.328	0.341
$+$ latent-label	0.799	0.734	0.65	0.507	0.348	0.354
BiLSTM	0.699	0.587	0.502	0.415	0.334	0.354
$+$ latent-label	0.813	0.728	0.618	0.501	0.321	0.371

Table 2

Top N precision (P@N) for relation extraction in the entity pairs with different number of sentences

Settings	One				Two				All
P@N	100	200	300	Mean	100	200	300	Mean	100	200	300	Mean
CNN	0.77	0.715	0.65	0.712	0.76	0.715	0.667	0.714	0.77	0.74	0.7	0.737
$+$ latent-label	0.81	0.785	0.757	0.784	0.84	0.815	0.777	0.811	0.86	0.83	0.807	0.832
PCNN	0.76	0.68	0.627	0.689	0.79	0.735	0.663	0.729	0.79	0.735	0.68	0.735
$+$ soft-label [10]	0.81	0.74	0.667	0.739	0.87	0.78	0.713	0.788	0.85	0.82	0.743	0.804
$+$ latent-label	0.86	0.825	0.717	0.801	0.91	0.835	0.773	0.839	0.89	0.86	0.817	0.856
BiLSTM	0.79	0.69	0.633	0.704	0.81	0.725	0.663	0.733	0.78	0.735	0.7	0.738
$+$ latent-label	0.84	0.795	0.72	0.785	0.86	0.825	0.78	0.822	0.87	0.845	0.813	0.843

Figure 4.

Performance comparison of various methods with precision-recall curves.

Table 1 shows precisions of various models for different recalls $R=$ 0.1/0.2/0.3/0.4/0.5 and Average Precision (AP) which corresponds to the area under the precision-recall curve. We can see that adding latent-label makes the performance improve significantly for CNN, PCNN and BiLSTM based model, which also exceeds the soft-label method [6]. Especially for the CNN baseline, combining with latent-label makes it achieve higher precision than PCNN and BiLSTM baselines when recall $R=$ 0.3/0.4/0.5. As for BiLSTM, adding latent-label mechanism makes it perform better significantly, which verifies the generality of the proposed label denoising method of RNN based model. In particular, AP of PCNN $+$ soft-label method [6] is small because the performance drops sharply when the recall is greater than 0.4, which also indicates its instability. On the whole, PCNN $+$ latent-label achieves the best AP performance compared with other models.

Following previous work [9, 6], we evaluate top N precision (P@N) for One, Two and All settings. In these settings, we randomly choose one, two and all sentences from the entity pairs with more than one sentence for testing. Table 2 shows top 100/200/300 results and their mean values for different settings. We can observe that combining latent-label can effectively improve the baselines of different neural models, while the PCNN-based denoising model obtains better results than the other two models.

Figure 5.

Precision-recall curves for different learning modules.

4.3.2 Effects of different learning modules

Figure 5 presents the precision-recall curves of our different learning modules – Confidence Learning (CL) and Discriminative Learning (DL) and Representation Learning (RL). RL module is based on CNN, PCNN and BiLSTM model respectively. Clearly, our latent-label mechanism combining CL and DL modules has a significant improvement on different RL modules. Specifically, only adding CL module for PCNN has higher precision than PCNN $+$ soft-label model when the recall is greater than 0.3. However, we assume that PCNN $+$ CL will perform better if we choose more fine-grained self-directed factor $\alpha$ , because the static confidence for PCNN $+$ soft-label model is elaborately selected and different weights for NA and positive classes are set to model the discriminative features. As for different RL modules, adding CL or DL modules can enhance the performance of the baselines to some extent. In the following, we will discuss the effects of CL and DL modules respectively in detail.

The left part of Fig. 6 depicts the precision-recall curves of different training epochs for PCNN model, from which we can verify the impact of the CL module. As the latent labels are used after 5 epochs, we will analyze later 5 epochs $e=$ 5/6/7/8/9. When the epoch $e=$ 6, the latent label remains the same with the distant label actually. For $e=$ 7/8/9, our model corrects 3,067/4,850/6,053 distant labels correspondingly for PCNN model. From the left part of Fig. 6, we can observe that the model performs better when training epoch $e>$ 6, which proves the effectiveness of our CL module. As the training proceeds, the performance may become worse because distant supervision information gradually decreases and extra noise may be introduced from special hard patterns. The results demonstrate that the label corrections will be uncontrollable when we ignore the distant labels to some extent. At the same time, we explore the confidence space to obtain the optimal model performance.

Figure 6.

Precision-recall curves of different training epochs for PCNN and PCNN $+$ DL model.

If we add DL module as the right part of Fig. 6, the model will correct 2,710/5,876/7,683 distant labels for the same epochs. And we find that the least corrections of distant labels perform better. As the situations are similar for other neural networks, it is clear that DL module can accelerate the process of label corrections. With the training proceeding, adding DL module makes the performance improve considerably, thus improve the results to some extent. The self-directed denoising process ensures that latent label of relatively easy relational patterns can be exploited to gradually correct labels of the hard patterns, and the case study will be shown in Section 4.3.3.

To verify whether DL can alleviate the misclassification problem between negative and positive classes, we present the number of wrong predictions (false positives/false negatives) for extracting top N relation mentions in Table 3. For top 100/500/1,000/2,000 ranked relation mentions, we find the number of false positives and false negatives is reduced when adding the DL module to PCNN model. Here we ignore the wrong predictions between positive classes due to the discriminative learning. As a result, the number of wrong predictions reduces more sharply with respect to PCNN $+$ CL model, which proves its discrimination between NA and positive labels.

Table 3

The number of wrong predictions (false positives/false negatives) for extracting top N relation mentions

Top N	100	500	1,000	2,000
PCNN	25/1,875	173/1,623	478/1,428	1,188/1,138
$+$ DL	10/1,860	147/1,597	428/1,378	1,183/1,133
$+$ CL	17/1,867	153/1,603	425/1,377	1,174/1,139
$+$ CL $+$ DL	9/1,860	112/1,566	390/1,347	1,145/1,112

4.3.3 Case study

We manually examine latent-label corrections of 200 random relation mentions of our PCNN $+$ latent-label model. And the correction accuracy of the first 3 training epochs is 86%, 84.5% and 81% respectively, which shows that our approach can achieve label denoising with high precision. Besides, Table 4 presents some typical cases of latent-label corrections.

Table 4
Case study

Type	Distant label	Latent label	Relation mention
Right corrections	location_contains	NA	Iverson has indicated that a trade to Minnesota, where he would be with forward Kevin Garnett, would be attractive to …
	place_lived	place_of_death	Shohei Imamura, one of the most significant filmmakers of Japan’s postwar generation, …, died yesterday in Tokyo.
	NA	company	Climate change is a six-and-a-half-gigaton problem, Tom Arnold, chief executive of TerraPass, said.
	place_lived	place_lived $\rightarrow$ place_of_birth $\rightarrow$ NA	…, it may have seemed a comedown after working together for years at OMA, Rem Koolhaas’s big firm in Rotterdam.
Wrong corrections	company	NA	Finally, Rolling Stone had sent Annie Leibovitz to photograph us.
	NA	location_contains	A variation of the Schengen approach in North America deserves a serious look.

Firstly, we focus on the right label corrections in the upper part of Table 4. And we observe some false positives can be corrected by latent labels, such as distant labels location_contains and place_lived. For the first case, there is a wrong labeled triple location_contains (Iverson, Minnesota) in KB, because Iverson and Minnesota are regarded as two locations, and actually they have no direct relation. With respect to entity pair (Shohei Imamura, Tokyo), the sentence fails to express their place_lived relation. Luckily, we can correct the relation as place_of_death with latent label according to the relation patterns “A …died …in B”. In addition, we can use latent labels to correct false negatives. For example, while the relation between Tom Arnold and TerraPass is missed in KB, we can discover their latent relation company depending on the patterns mentioned before. During the first few training epochs, the noisy labels of these examples can be corrected immediately, and we treat their relation patterns as easy patterns.

As for entity pair (Rem Koolhaas, Rotterdam), it is obvious that the sentence cannot indicate the relation place_lived. In this case, we find that its latent label corrects gradually owing to the self-directed learning mechanism. At first, the latent label remains the same with its distant label. Then the relation is corrected as place_of_birth in the next training epoch, and identified as NA at last. The relation patterns of these examples, whose latent labels need to be gradually learned slowly, are relatively special patterns and regarded as hard patterns. This case shows that our proposed self-directed learning is effective for latent-label denoising.

As shown in the lower part of Table 4, few wrong corrections mainly come from the strong pattern-consistency assumptions. In practice, contextual patterns of relation mentions are very diverse, hence it is a big challenge to recognize the special patterns and distinguish the similar patterns. For instance, as the relation mention of entity pair (Rolling Stone, Annie Leibovitz) has a different pattern from other common patterns of company in Fig. 1, our model fails to identify the relation. For another case of entity pair Schengen and North America, our model matches the patterns of their mention to “A in B” and regards both entities as locations, thus wrongly correct their relation as location_contains. Fortunately, the wrong corrections are relatively few compared with the right corrections. On the whole, our approach remains the effectiveness and robustness.

5. Conclusions

In this paper, we propose a latent-label denoising method for distantly supervised relation extraction with self-directed confidence learning. This mechanism makes the obtained latent labels of relatively easy patterns are exploited to predict later latent labels of hard patterns, and also performs dynamic exploration of the confidence space. Besides, we use a discriminative loss to solve the misclassification between NA and positive labels, which greatly benefits the heuristic latent label prediction. Experimentally, we verify the generality of the proposed label denoising method on PCNN, CNN and BiLSTM based models. The results and case study demonstrate that our approach outperforms the state-of-the-art systems on the popular evaluation dataset and corrects the noisy labels effectively.

In the future, we will plan to utilize prior knowledge from external knowledge sources or some manual annotations to measure the label confidence with respect to each relation mention. Moreover, we will consider to explore the latent multi-label learning.

Footnotes

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant No. 61421061, No. 61602048, No. 61601046 and No. 61520106007, Project of Hainan Passenger Behavior Intelligence Analysis Platform and Precise Service Mining Prediction under Grant No. ZDKJ 201808, and BUPT-SICE Excellent Graduate Students Innovation Funds, 2016.

References

Dong

Gabrilovich

Heitz

Horn

Lao

Murphy

Strohmann

Sun

and Zhang

, Knowledge vault: A web-scale approach to probabilistic knowledge fusion, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2014, pp. 601–610.

Luan

Ostendorf

and Hajishirzi

, Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction, 2018, 3219–3232.

Fader

Zettlemoyer

and Etzioni

, Open question answering over curated and extracted knowledge bases, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2014, pp. 1156–1165.

Reddy

Feng

Huang

and Zhao

, Question answering on freebase via relation extraction and textual evidence, in: Proceedings of ACL, 2016, pp. 2326–2336.

Mintz

Bills

Snow

and Jurafsky

, Distant supervision for relation extraction without labeled data, in: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, Association for Computational Linguistics, 2009, pp. 1003–1011.

Liu

Wang

Chang

and Sui

, A soft-label method for noise-tolerant distantly supervised relation extraction, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1791–1796.

Sun

Zhang

and Ji

, Factors impacting the label denoising of neural relation extraction, in: Proceedings of the 12th International Conference on Algorithmic Aspects in Information and Management (AAIM), Springer, 2018, pp. 12–23.

Zeng

Liu

Chen

and Zhao

, Distant supervision for relation extraction via piecewise convolutional neural networks, in: EMNLP, 2015, pp. 1753–1762.

Lin

Shen

Liu

Luan

and Sun

, Neural relation extraction with selective attention over instances, in: ACL, 2016.

10.

Liu

and Zhao

, Distant supervision for relation extraction with sentence-level attention and entity descriptions, in: AAAI, 2017, pp. 3060–3066.

11.

Kumar

M.P.

Packer

and Koller

, Self-paced learning for latent variable models, in: Advances in Neural Information Processing Systems, 2010, pp. 1189–1197.

12.

Jiang

Meng

Zhao

Shan

and Hauptmann

A.G.

, Self-paced curriculum learning, in: AAAI, Vol. 2, 2015, p. 6.

13.

Shen

Wang

Bai

and Zhang

, Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3982–3991.

14.

Riedel

Yao

and McCallum

, Modeling relations and their mentions without labeled text, Machine learning and knowledge discovery in databases, 2010, 148–163.

15.

Hoffmann

Zhang

Ling

Zettlemoyer

and Weld

D.S.

, Knowledge-based weak supervision for information extraction of overlapping relations, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Association for Computational Linguistics, 2011, pp. 541–550.

16.

Ritter

Zettlemoyer

Etzioni

et al., Modeling missing data in distant supervision for information extraction, Transactions of the Association for Computational Linguistics 1 (2013), 367–378.

17.

Surdeanu

Tibshirani

Nallapati

and Manning

C.D.

, Multi-instance multi-label learning for relation extraction, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, 2012, pp. 455–465.

18.

Qin

and Wang

W.Y.

, Dsgan: Generative adversarial training for distant supervision relation extraction, arXiv preprint arXiv:1805.09929, 2018.

19.

Zeng

Liu

and Zhao

, Large scaled relation extraction with reinforcement learning, in: AAAI, Vol. 2, 2018, p. 3.

20.

Feng

Huang

Zhao

Yang

and Zhu

, Reinforcement learning for relation classification from noisy data, in: Proceedings of AAAI, 2018.

21.

Qin

and Wang

W.Y.

, Robust distant supervision relation extraction via deep reinforcement learning, arXiv preprint arXiv:1805.09927, 2018.

22.

Jat

Khandelwal

and Talukdar

, Improving distantly supervised relation extraction using word and entity based attention, arXiv preprint arXiv:1804.06987, 2018.

23.

Vashishth

Joshi

Prayaga

S.S.

Bhattacharyya

and Talukdar

, Reside: Improving distantly-supervised neural relation extraction using side information, in: EMNLP, 2018, pp. 1257–1266.

24.

Luo

Feng

Wang

Zhu

Huang

Yan

and Zhao

, Learning with noise: enhance distantly supervised relation extraction with dynamic transition matrix, arXiv preprint arXiv:1705.03995, 2017.

25.

Bengio

Louradour

Collobert

and Weston

, Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 41–48.

26.

Kukar

Kononenko

et al., Cost-sensitive learning with neural networks, in: ECAI, 1998, pp. 445–449.

27.

Khan

S.H.

Bennamoun

Sohel

and Togneri

, Cost sensitive learning of deep feature representations from imbalanced data, arXiv preprint arXiv:1508.03422, 2015.

28.

Lin

T.-Y.

Goyal

Girshick

and Dollár

, Focal loss for dense object detection, arXiv preprint arXiv:1708.02002, 2017.

29.

Weston

Bordes

Yakhnenko

and Usunier

, Connecting language and knowledge bases with embedding models for relation extraction, arXiv preprint arXiv:1307.7973, 2013.

30.

Kingma

and Ba

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.

31.

Mikolov

Yih

and Zweig

, Linguistic regularities in continuous space word representations, in: HLT-NAACL, Vol. 13, 2013, pp. 746–751.

32.

Pennington

Socher

and Manning

C.D.

, Glove: Global vectors for word representation, in: EMNLP, Vol. 14, 2014, pp. 1532–1543.

33.

Collobert

Weston

Bottou

Karlen

Kavukcuoglu

and Kuksa

, Natural language processing (almost) from scratch, Journal of Machine Learning Research 12(Aug) (2011), 2493–2537.

34.

Zeng

Liu

Lai

Zhou

Zhao

et al., Relation classification via convolutional deep neural network, in: COLING, 2014, pp. 2335–2344.

35.

Graves

Mohamed

and Hinton

, Speech recognition with deep recurrent neural networks, in: Acoustics, Speech and Signal Processing (icassp), 2013 Ieee International Conference on, IEEE, 2013, pp. 6645–6649.

36.

Jiang

Wang

and Wang

, Relation extraction with multi-instance multi-label convolutional neural networks, in: COLING, 2016, pp. 1471–1480.

37.

Srivastava

Hinton

G.E.

Krizhevsky

Sutskever

and Salakhutdinov

, Dropout: a simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15(1) (2014), 1929–1958.

38.

Zaremba

Sutskever

and Vinyals

, Recurrent neural network regularization, arXiv preprint arXiv:1409.2329, 2014.

A latent-label denoising method for relation extraction with self-directed confidence learning

Abstract

Keywords

1. Introduction

3.1 Preliminary

3.2 Confidence learning

3.4.1 Word and position encoder

3.4.2 Sentence encoder

4.1 Dataset and evaluation

4.2 Parameter settings

4.3 Results and discussion

4.3.1 Comparison with previous methods

Table 1 Precisions of various methods for different recalls

Table 4 Case study

Footnotes

Acknowledgments

References

Table 1
Precisions of various methods for different recalls

Table 4
Case study