Abstract
Distant Supervision is an approach that allows automatic labeling of instances. This approach has been used in Relation Extraction. Still, the main challenge of this task is handling instances with noisy labels (e.g., when two entities in a sentence are automatically labeled with an invalid relation). The approaches reported in the literature addressed this problem by employing noise-tolerant classifiers. However, if a noise reduction stage is introduced before the classification step, this increases the macro precision values. This paper proposes an Adversarial Autoencoders-based approach for obtaining a new representation that allows noise reduction in Distant Supervision. The representation obtained using Adversarial Autoencoders minimize the intra-cluster distance concerning pre-trained embeddings and classic Autoencoders. Experiments demonstrated that in the noise-reduced datasets, the macro precision values obtained over the original dataset are similar using fewer instances considering the same classifier. For example, in one of the noise-reduced datasets, the macro precision was improved approximately 2.32% using 77% of the original instances. This suggests the validity of using Adversarial Autoencoders to obtain well-suited representations for noise reduction. Also, the proposed approach maintains the macro precision values concerning the original dataset and reduces the total instances needed for classification.
Introduction
Relation Extraction (RE) is concerned with “detecting and classifying predefined relationships between entities identified in-text” [15]. In RE, a text sentence is analyzed to retrieve two named entities of interest and a specific association between them. Often, there is an interest associations collection with a rich set of entities participating in the associations. In the RE classification task, the classes are different association labels, and the classification consists of assigning the most likely label expressing the relation between the entities. Variants of RE may depart from the raw text or pre-extracted entities or a combination of them, but the one departing from the pre-extracted entities is the purest form of the RE problem.
RE has been addressed in different ways [17], including Distant Supervision (DS) [13]. According to [17], “Distant supervision combines the benefits of semisupervised and unsupervised relation extraction approaches and leverages a knowledge base as a source of training data”. The main idea of DS is automatically labeling a dataset leveraging on existing knowledge; generally, in the form of knowledge bases [13]. This automatic labeling requires a heuristic or assumption to annotate the instances in the dataset construction. There are two main assumptions in the literature. The first one was proposed by Mintz et al. [13] (we will name it Mintz assumption) who assumed that “if two entities participate in a relation, any sentence that contains those two entities might express that relation”. Riedel et al. [16] relaxed this assumption (we will name it Riedel assumption), instead of assuming that “if two entities participate in a relation, at least one sentence that mentions these two entities might express that relation”. Both assumptions are inadequate in many cases, such as when no sentence expresses the relation, leading to the introduction of false positives (noise in the labels) in the annotation of the dataset. Often, a pair of entities in a sentence does not imply a relationship or may express several relations concomitantly depending on the context, as depicted in Fig. 1.

In this example, it is included a pair of entities that not express the same relation. Considering the founders relation, the first one will be correctly labeled. While the second will not. Example reproduced from [21].
When the outcome of the annotation of a dataset by DS is later served to a classification problem, noise in the labels might have a detrimental effect. Whether the noise in the labels arises from the failure of the assumption made by the DS labeling or any other different process is circumstantial. The original noise is irrelevant for the classifier, but the classifier has to deal with it nonetheless. In general, a method is robust if it can operate (e.g., find the correct solution) in the presence of noise and/or outliers. Still, it is worth noting that robustness is never universal, and all robust methods have a critical limit of noise that they can tolerate before failing. Regardless, several works reported in the literature have a certain amount of tolerance to noise by combining the Riedel assumption with Deep Neural Networks [9, 20–22]. However, perhaps the most obvious way to improve the performance of the classifiers is to use cleaning methods in a previous step. This is used to alleviate the noise presence in the class labels [4], with the additional benefit that in any case, this solution can be combined with the use of robust classifiers to achieve a good classification. Besides, defining a separate task for explicitly cleaning the dataset can yield cleaned datasets useful for purposes other than classifying.
The main contribution of this paper is to obtain a new data representation for noise reduction in DS using Adversarial Autoencoders. The proposed data representation will allow obtaining datasets with less noise which implies that the macro precision will be improved. As a direct consequence, once an instance is classified with a relation, it is more likely to be correct. To validate this hypothesis, we use noise-tolerant classifier BGWA (BiGRU-based word attention model) [9] to measure the macro precision on the new datasets obtained.
Nowadays, there are many noise-tolerant methods in DS [9, 20–22]. One of the earliest approaches based on DNN was the Piecewise Convolutional Neural Networks (PCNN) proposed by Zeng et al. [21]. This network builds bags of instances from the entities pairs that are considered correct if at least one of the sub-networks labels it positively (Multi-instance Learning). In other DDN-based approaches [9, 22] different attention mechanisms have been incorporated to deal with noise. Examples are sentence-level attention [10], word-level attention to dynamically highlight important parts of the sentence [22], attention over words to identify such key phrases is used (BGWA) [9], and intra-bag and inter-bag attention [20].
In addition, noisy labels are also frequently dealt with using data cleaning methods [4]. In principle, they may be evident to a potential subsequent classification exercise, but when classification follows an assessment by wrapping, it often guides the cleaning exercise. Depending on how conservative they are, data cleaning methods can eliminate too few or too many instances, thus reducing the performance of the potential subsequent classifiers [12]. Brodley and Friedl [2] advocate that it is preferable to eliminate several instances correctly labeled than to maintain instances with noisy labels. However, this is only possible or convenient when the acquisition of instances is cheap, and the instances are abundant, which might not always be the case. Notwithstanding, the cleaning or filtering of labels with current methods does not guarantee the total elimination of noise. Complete elimination of noise is certainly achievable, even if silly. It suffices with deannotating every instance. However, such an extreme approach has no practical application for obvious reasons. Therefore in practice, a compromise between confidence intervals and false-positive acceptance rate ought to be sought. According to [14] the Autoencoders (AE) can be used when there are noisy instances.
Sentence embeddings
Sentence Embeddings (SE), like Word Embeddings (WE), are real values that contain the semantic meaning, in this case, of the complete sentence, distributed in a k-dimensional vector [3]. As with WE, there are pre-trained models [3]. Pre-trained SE proposed by [3]: Two models are presented that are optimized for texts that have more than one word, such as sentences or paragraphs. Given a certain text, these models return their corresponding vector. One model is based on the use of transformers (Transformers) that presents greater accuracy at the cost of greater consumption of resources and a more complex model [19]. The other model is based on Deep Averaging Networks, which has lower accuracy but higher efficiency [8]. The model was trained using sources from Wikipedia, web news, question and answer pages, and discussion forums. In this work, we will refer to these SE as TRANSF
1
and DAN
2
respectively. Pre-trained SE proposed by [1]: An architecture is proposed to learn joint multilingual sentence representations for 93 languages. This architecture uses a language-agnostic BiLSTM encoder to build the SE, coupled with an auxiliary decoder and trained on parallel corpora. The authors named this SE as LASER
3
.
Autoencoders
Autoencoders (see Fig. 2a) are models that projects the input into the output [5]. Perhaps the most classical AE is Principal Component Analysis (PCA). AE transforms the input data X to a latent space Z using a function f (the encoder), and returns from Z using a decoder function g. If f is invertible (∃ f-1 : g = f-1), the recovery will be errorless. Otherwise, the goal is to reconstruct the input minimizing the error L = |X - g (f (X)) |2 in reconstruction, or some variant of L e.g. regularized, generalized projections, etc.
Adversarial AE (AAE) (see Fig. 2b) are a particular AE coupled with a Generative Adversarial Networks (GAN) [6]. AAE is trained to fulfill two objectives: (1) minimizing the error L in the reconstruction of the input, while (2) fitting the vectors of the latent space Z to a previously known distribution [11]. AAE can be used for semi-supervised classification, unsupervised clustering, dimensionality reduction, and data visualization [11]. Furthermore, adjusting the representation to a known distribution allows us to detect instances far from this distribution and consider them as noisy.

General architecture of the AE and AAE (reproduced and adapted from https://deepnotes.io/deep-clustering respectively).
Let: A set of sentences A set of labels (relations) A set of observations A partition over
The problem of classification in RE is given a sentence A partition over the training set
We define an encoder for each
We define a cleaning function for each

Overview of the methodology proposed in this work.
The functions encoder
j
used for obtaining the new representations f_laser, f_ae_laser, and f_aae_laser: We only used LASER embeddings, and we use it as input for AE (Fig. 4a) and AAE (Fig. 4b), respectively. f_dan, f_ae_dan, and f_aae_dan: We only used DAN embeddings, and we use it as input for AE (Fig. 4a) and AAE (Fig. 4b), respectively. f_transf, f_ae_transf, and f_aae_transf: We only used TRANSF embeddings, and we use it as input for AE (Fig. 4a) and AAE (Fig. 4b), respectively.

Architectures used in this work.
The AE and AAE architectures are composed of two dense layers (1000 units and a ReLu-like activation function), both in the encoder f
θ
and the decoder g
θ
(see Fig. 4a and 4b). The input is a vector representing all the sentences in the text. This vector is obtained using some available sentence representation such as LASER [1], DAN [3], or TRANSF [3] pre-trained embeddings. Our proposal as encoder
j
is the AAE under the assumption that an observation (s
i
, r
j
), where the relation r
j
is noisy, will not fit correctly to the distribution of the rest of the observations, and will remain far away. The discriminator input is one-third of each
We obtained the following base_laser, aae_laser, and ae_laser: We obtain these sets using f_laser, f_ae_laser, and f_aae_laser like encoder
j
function. base_dan, aae_dan, and ae_dan: We obtain the sets using f_dan, f_ae_dan, and f_aae_dan like encoder
j
function. base_transf, aae_transf, and ae_transf: We obtain these sets f_transf, f_ae_transf, and f_aae_transf like encoder
j
function.
In this section, we present the experiments conducted to demonstrate the validity of the proposed representation.
Assessment of the model representation capability
We performed an evaluation of the intra-cluster distances over each

Intra-cluster distances over each
Figure 6 compares the representations obtained with f_laser (Fig. 6a), f_ae_laser (Fig. 6b) and f_aae_laser (Fig. 6c) on a subset of instances. The visual representation is achieved using PCA maintaining 3 principal components. This subset has a total of 4, 000 randomly chosen instances, where 2, 000 were taken from /people/person/nationality relation and the other 2, 000 were taken from the remaining relations. It can be observed how the representations obtained with f_aae_laser (Fig. 6c) tend to form 2 clusters, while the representations that used f_laser (Fig. 6a) and f_ae_laser (Fig. 6b) are concentrated in the same region of the embedded manifold.

PCA projections of a instances subset with the generated representations using the functions f_laser, f_ae_laser and f_aae_laser. [
To determine the convenience of using the noise reduction approach proposed in this research, the BGWA method was used as the evaluation classifier considering the macro precision measure as evaluation metric (Table 1). We take as a baseline the results of the BGWA on the original NYT2010 (
Macro precision values after 5 executions of BGWA on each dataset (Cosine distance)
Macro precision values after 5 executions of BGWA on each dataset (Cosine distance)
The Anova One Way test was applied on macro precision for determining if there exist significant differences. In this case significant differences were found (Anova: F (9, 40) =5.68, p < 4.75e-05) with an effect size of η2 = 0.561. In the pairwise post-hoc comparisons with the Holm Correction [7] and t-test we only find significant differences between base_dan and aae_laser. The results obtained on the
The function cleaning
j
considers fewer instances as noise over the representations obtained with f_aae_laser. Besides, BGWA obtains higher macro precision values over aae_laser with respect to
From Table 2 it can be noticed that the Cosine distance performs better than the Euclidean distance. This is caused because the macro precision obtained using the Euclidean distance in any dataset outperforms the macro precision of the baseline. The same happens with other similarity distances evaluated.
Macro precision values after 5 executions of BGWA on each dataset (Euclidean distance)
The representations obtained with AAE-based functions allow grouping instances according to their relations in more compact groups. This allows reducing potentially noisy instances in the original dataset.
The proposed noise reduction approach obtains similar macro precision values, considering fewer instances, concerning the original dataset using the BGWA classifier as the evaluator. We only found significant differences between base_dan and aae_laser. The macro precision obtained over the original dataset was improved using 77% of the instances in this last dataset. This fulfilled the objective proposed in this work. The obtained results verify the importance of using the proposed data representation for noise-cleaning before classifying relations in the DS task. Furthermore, it suggests the usefulness of AAE for obtaining representations for noise reduction without significantly lowering macro precision values.
We are currently working on the AAE architecture to obtain a dataset that improves the resulting measures of the original dataset.
Footnotes
Acknowledgments
The present work was supported by CONACyT/México (scholarship 937210 and grant CB-2015-01-257383). Additionally, the authors thank CONACYT for the computer resources provided through the INAOE Supercomputing Laboratory’s Deep Learning Platform for Language Technologies. Finally, we would like to thank Dr. Miguel Á. Álvarez-Carmona from CICESE-UT3 for his comments and suggestions.
Other distances were evaluated, but the best results were obtained using the cosine distance.
Mechanical Turk, MTurk, is a human annotation service provided by Amazon.
