Abstract
To extract structured knowledge from unstructured text sources we need to understand the semantic relationships between entities. State-of-the-art relation extraction techniques take advantage of the abundance of data on the web. However, in domains with sparse data such as social networks which have limited occurrences of entities and relationship patterns, bootstrapping techniques and pattern detection methods are inefficient and inaccurate. In this paper, we introduce REDS, a Relation Extraction approach based on Distant Supervision. REDS extracts the named entities from text documents and assigns a fingerprint to each potential relationship among the named entities. Then, it queries a knowledge repository for similar matches to each fingerprint. An assessor uses the query results and the data statistics to measure the validity of the relationships corresponding to the queried fingerprints, and labels each potential relationship with the predicted type. In addition to handling the relation extraction in presence of data sparsity, REDS uses an information retrieval framework that makes it scalable and capable of dealing with noisy data. We implement and test REDS on a non-English historical archive consisting of unstructured notarial acts and structured civil registers; By means of manual evaluations REDS achieves precision of 0.90.
Introduction
Named Entity Recognition (NER) and relation extraction are among the main sub-problems of information retrieval [14]. NER locates a word or a phrase that refers to a particular named entity within a text [36], and relation extraction focuses on recognizing relations among the named entities in unstructured text [3]. Both sub-problems have been studied extensively in literature [36, 3, 48] where supervised and unsupervised techniques have been used to tackle these problems. Among these techniques, supervised learning is capable of extracting both the named entities and relations from unstructured data [30, 61, 54]. However, the need for manually labeled data renders its application inefficient in real-world settings. An alternative approach is to benefit from the abundance of data on the web. Data redundancy in the web allows for accurate statistical estimates [2, 46]; for instance, the state-of-the-art open information extraction tools such as KnowItAll [27] and Open IE [5, 26] are designed to collect huge amounts of data from the web. They use bootstrapping techniques and/or data statistics to precisely extract information. For instance, as shown by Ravichandran and Hovey [46] by searching the web for “Mozort 1756” or “Newton 1642” a large amount of patterns indicating the year of birth of a person are extracted without the need to a training set. Another example is how KnowItAll [27] searches the web for specific phrases like “… such as …” or “… including …” for extracting new classes of objects.
In contrast to data sources such as Wikipedia and Freebase which contain vast amount of data with high redundancy, sparse datasets have limited occurrences of entities and textual patterns. In absence of manually labeled training sets and data redundancy, distant supervision uses additional knowledge repositories to perform information extraction. This approach is very promising in dealing with sparse data [34], though there exists challenges in dealing with uncertainties. For instance, a single entity might be referred to with different names due to spelling errors and name variations, or multiple entities might be referred with identical names. Therefore, it is mandatory to use entity resolution techniques [17, 18] and in particular identity resolution approaches [57, 8, 6] prior to extraction of relations.
In this paper, we introduce the notion of relationship fingerprint; a relationship fingerprint, or in short a fingerprint, is a noise-tolerant condensed representation of a relationship between two or more individuals; each fingerprint maps the important attributes of the individuals involved in the relationship to a unique tuple. Fingerprints can be indexed and searched in an efficient way. REDS uses a combination of orthographic and dictionary lookup features to extract named entities from the unstructured data. It constructs the candidate relationships from the tuples of named entities existing in each single unstructured document. Fingerprints corresponding to the candidate relationships are generated, and the existing knowledge repository is queried for each fingerprint. If the knowledge repository contains any Levenshtein neighbor of the fingerprint the relation between the named entities is revealed. Besides, the statistics of data are used to assess the confidence level of the discovered relation and predict its type. As an additional step, by using the extracted relations as a training set, we learn a classifier to extract relations from the portions of unstructured data for which no knowledge repository exists. In short, the contributions of this paper can be summarized as follows: 1) Augmentation of relation extraction procedure with an identity resolution method; 2) Scalability of relation extraction for large datasets due to using an inverted index; 3) High precision of relation extraction due to using relationship fingerprints, and 4) no need to any initial training set and being language independent.
The rest of this paper is organized as follows: In Section 2, we introduce the existing work on the identity resolution and relation extraction approaches. Section 3 defines, formally, the problem we are addressing and also introduces the data setup used for examples and experiments. Then, the architecture of REDS approach and its components are described in Section 4 by means of various examples. In Section 5, we report on the experiments and evaluations of implementing REDS on a large corpus; Section 6 discusses the advantages and limitations of this work in practice. Section 7 concludes.
Related work
This work is mainly based on two concepts of identity resolution and relation extraction. We, first introduce related work on identity resolution and discuss the scalability problem in many of the existing work. Then we review some of the existing work on relation extraction, and position our work with respect to the established methods.
Identity resolution
Dealing with the identity resolution problem is the first step in mining and analysis of various social networks. Due to appearence of multiple Online Social Networks (OSNs) in recent two decades, integrating these networks has become an interesting but challenging research topic. Applications of the identity resolution in OSNs include targeted advertisements [10, 53], building expertise networks [39, 40] and refining the individuals’ contact lists [6]. In addition to the OSNs which provide networked data, the genealogical data sets also provide information about a large group of social agents and their interactions in the course of the centuries [49]. Prior to analysis of historical archives, the implementation of identity resolution is a mandatory task, as social agents are mentioned in different documents in absence of any unique identification number. The name variations, errors and missing data are among the challenges in dealing with such datasets.
The first intuitive approach in identifying a group of so-called references which refer to the same entity is to compare the reference attributes. In a social network setting, such attributes might include the given name and family name, gender, dates and locations. Vosecky et al. [56], and Motoyama and Varghese [35] studied such biographical attributes for searching and matching individuals across multiple OSNs. Köpcke et al. [31] gave an extensive comparison of some of the non-learning and learning based approaches which mainly use the biographical attributes. Efremova et al. [23] studied a baseline approach based on the biographical attributes to show that in presence of name variations, missing data and errors, the precision of matchings is very low. In their work, they witnessed different entities having very similar and in many cases identical bibliographic attributes. Li et al. [32] considered historical profiles from different unreliable resources and proposed a source-aware temporal matching algorithm for the identity resolution task. Balaji et al. [4], first, discussed the currently used blocking methods in data cleaning phase of entity resolution task and then, proposed a new ensemble blocking method to combine the results of different blocking methods. Kong et al. [15] proposed an unsupervised approach, called EMAN, to match entities across two or more heterogeneous data sources using locality sensitive hashing.
To improve the accuracy of identity resolution, researchers have exploited the information on social relationships in the network. Considering the neighbor nodes in a social network can significantly improve the precision of the matchings. For instance, Bartunov et al. [6] combined the usage of profile attributes and social relationships to identify all references to an entity. Buccafurry et al. [13] also used the common neighbors of two references to predict location of the me links that connect two references to the same entity. Rahmani et al. [43] used random-walk agents to count the number of paths in the network between two similar references which pass similar neighbor nodes. Existence of such paths corresponds to a high probability that two references refer to the same entity. Bhattacharya and Getoor [7] proposed a relational clustering algorithm to use both the reference attributes and the connections among them.
Using the network relationships increases precision of the identity resolution algorithms; however can cause computational issues. For instance, Bartunov et al. [6] explained that they can apply their identity resolution algorithm in small subgraphs, and face a major challenge to scale it up to large social graphs. This type of limitation is studied in detail by Christen and Gayler [18]. They propose the use of inverted indexing techniques, as a common technique in information retrieval, for real-time, fast and scalable identification of entities. They show that inverted index approaches are up-to one hundred times faster than some of the traditional entity matching techniques.
In this work, similar to the work by Christen and Gayler [18] we address the identity resolution problem in the information retrieval framework. However, instead of matching the references we aim at matching the relationships between them. By choosing the relationships as the targets of the identity matching problem, we achieve a low computationally complexity and still preserve a high accuracy.
Relation extraction
A vast amount of information available to us comes in form of the unstructured data, which can not be efficiently linked to the other sources of information in its original form [58, 33]. Therefore, most of the identity resolution approaches are not applicable in multi source applications which have both structured and unstructured data coexisting. It is important to develop techniques to convert the unstructured data to structured form by extracting the named entities and existing relations. For this, the unstructured data, which is in form of text, needs to be segmented into meaningful chunks of strings [16], and in turn the chunks which correspond to named entities should be identified (i.e., NER [36, 55, 45, 51]). In the second step, the relations between these entities should be extracted (i.e., the topic of relation extraction [60, 29, 19]).
Using supervised learning for extracting information from a text imposes many restrictions such as the human effort needed to prepare the training set, and being domain specific. In order to solve these restrictions for text segmentation, Agichtein and Ganti [1] proposed the use of existing structured data in the data warehouse to learn an automatic segmentation system called CRAM. A very similar approach was used by Borkar et al. [11] in developing a tool called DATAMOLD which was able to automatically segment text. Unsupervised NER has also been well researched as in [28, 37, 21], where important information of the text can be extracted in a recursive manner.
However, solving the relation extraction in the unstructured data, which is the main focus of this paper, is more challenging than the text segmentation and NER. We divide the existing work on relation extraction, which is proposed to overcome the restrictions of supervised relation extraction, into three groups:
In
In
While both
The work proposed in this paper contributes to
Preliminaries
In this section, we first formally define the problem at hand, and then we introduce the datasets that are used in different experiments throughout this paper.
Problem definition
Consider an unstructured dataset
In this paper the term relation refers to the way in which two or more references are connected (e.g., marriage relation, or co-occurrence relation) and is different from the concepts of relations in relational databases.
Additionally, there exists a knowledge repository
We assume that
In order to solve this problem, we propose a novel method for interconnecting the two datasets
Brief description of notations used in our formal problem definition
The genealogical dataset used in this paper consists of two different corpora: 1) the corpus of civil registers in form of structured data, and 2) the corpus of notarial acts in form of unstructured data. These corpora are provided by the Brabant Historical Information Center (BHIC).
Features of each certificate type in corpus of civil registers
Features of each certificate type in corpus of civil registers
This corpus has been very dynamic in the sense that BHIC is continuously digitizing the handwritten archives by using the help of many volunteers. For instance the number of digitized documents has increased from 1,600,000 documents in 2012 to more than 2,000,000 documents in 2015. This amount is increasing everyday.
Figure 1 shows the number of digitized certificates per year for the most recent version of data in 2015. As can be seen in this figure, the amount of birth certificates has been less than number of marriage and death certificates. One of the reasons is the policies of BHIC and the fact that the regions for which marriage and death certificates are turned into structured data spans a larger area than the ones corresponding to birth certificates.
Number of civil registers per year of issue.
We use a Relational database model [20] to integrate and persist the three discussed certificate types considering Entity, Referential and Domain integrity constraints [25]. We choose the Relational database model since this model is widely used, easy to apply and is highly maintainable. [25] discusses in details the Relational database model and its advantages over other databases models.
This dataset contains 226,751 records related the the period 1458 to 1900. The record contain written narration of facts drawn up by civil-law notaries, notary public or an alderman (Schepen in Dutch) authenticated by signature and official seal. Each record contains the main text of narration, place and date of issue and some other details.
Figure 2 shows the number of notarial acts per year. 86,000 notarial acts (i.e., 37% of all notarial acts) are issued after 1800 and overlap with the year of issue in civil registers.
Number of notarial acts per year of issue.
The REDS components used for extracting relations from the unstructured data.
Figure 3 illustrates the inputs, outputs and internal components, C1 to C8, of the proposed relation extraction approach. As its input, REDS has access to a corpus of unstructured data, on the left side of Fig. 3 and a knowledge repository, on the right side of Fig. 3. For an arbitrary text document chosen form the unstructured data corpus, references are extracted in C1 and potential relations are constructed in C2, followed by fingerprint generation in C3. Then, in C4, the knowledge repository is queried for each fingerprint. The result of this query can be used to detect the relations in text after being assessed by the assessor component C5, and also in combination with surface features of the unstructured data that are extracted in C6, it can be fed into a classifier for training purposes handled in C7. In C8 the classifier can be used to extract relations by using surface features of the text.
Next, we formally define each of these components C1 to C8 and use examples from the MiSS2
Mining Social Structures from Genealogical Data, part of the CATCH program funded by the Netherlands Organization for Scientific Research (NWO).
We first describe how to build an inverted index for the knowledge repository
A person full name might contain prefixes before the given name or family name, middle names and other additional components. Each of these components are subject to change in presence of uncertainties and name variations. Therefore we use the notion of condensed representation of a reference that eliminates the unnecessary information of a full name.
.
Let
Depending on the quality of data and the domain of use, the condensed representation of a reference in Definition 1 can be further customized. For instance, each given name or family name can be standardized using a list of standard names. However, such customizations are beyond the scope of this paper; the interested reader can refer to the work of [9, 22].
For each condensed representation, we store all the relation ids that the corresponding references appear in. Also, as we need to keep one additional information for each relation, the type of relation is stored too; this will help us with extracting the type of relations at the query time.
Therefore, to build the inverted index, for each relation from the knowledge repository we do the following. For each reference in the relation, we compute the condensed name of the reference. Then, we build the inverted index for the relation by adding the condensed name to the index
Example: Table 3 shows an example of three civil registers. The knowledge repository and inverted index corresponding to these civil registers are shown in Table 4, in which the “husband-wife” relations are listed and indexed in
Three civil certificates each containing “husband-wife” and “parent-child” relations. Two family names “Cornelesen” and “Cornelessen” are variations of the same family name. Some extra information such as place and date are not shown
Three civil certificates each containing “husband-wife” and “parent-child” relations. Two family names “Cornelesen” and “Cornelessen” are variations of the same family name. Some extra information such as place and date are not shown
A knowledge repository consisting of 7 “husband-wife” relations (in short ‘HW’) shown in dictionary format. In total 11 keys are extracted for the inverted index
Consider a text document
Example: Let
The translation of this Dutch text in English would be: Johana Gijsbers widow of Cornelius Cornelesen living in Erp has sold to Hendrik Geert van der Steen a land in Erp in the Melvert sections A. 89 and 90.
“Johana Gijsbers weduwe Cornelius Cornelesen wonende te Erp heeft verkocht aan Hendrik Geert van der Steen land te Erp in de Melvert sectie A. 89 en 90.”
This text can be broken up into words by using punctuations and whitespaces into 27 simple tokens, as shown below.
Johana_(t1) Gijsbers_(t2) weduwe_(t3) Cornelius_(t4) Cornelesen_(t5) wonende_(t6) te_(t7) Erp_(t8) heeft_(t9) verkocht_(t10) aan_(t11) Hendrik_(t12) Geert_(t13) van_(t14) der_(t15) Steen_(t16) land_(t17) te_(t18) Erp_(t19) in_(t20) de_(t21) Melvert_(t22) sectie_(t23) A._(t24) 89_(t25) en_(t26) 90_(t27)
We introduce the part of sentence tags (GFN) for Given and Family Names, (FNP) for Family Name Prefixes, (LOC) for Locations, (LOP) for Location Prefixes and (UNK) for words with Unknown part of sentence. A Person Reference chucker (PR-Chunker) can detect the person references in the text by searching for the chunks that correspond to a person references, by mainly following the sequence of (GFN) and (FNP) tags. As a full name contains at least two words as the given and family names, no singular (GFN) tag can be accepted as a PR-chunk. Therefore, a full name is considered as a sequence of (GFN) tags or two of such sequences connected via one or more (FNP) tags.
Therefore
Johana_(GFN) Gijsbers_(GFN) weduwe_(UNK) Cornelius_(GFN) Cornelesen_(GFN) wonende _(UNK) te_(LOP) Erp_(LOC) heeft_(UNK) verkocht_(UNK) aan_(UNK) Hendrik_(GFN) Geert_(GFN) van_(FNP) der_(FNP) Steen_(GFN) land_(LOP) te_(LOP) Erp_(LOC) in_ (UNK) de_(FNP) Melvert_(UNK) sectie_(UNK) A._(UNK) 89_(UNK) en_(UNK) 90_(UNK)
For each subset of references that are in consecutive order of each other a potential relation rel can be considered. Depending on the application at hand, relations can consist of two, three or more references in consecutive order.
Example: In analysis of MiSS dataset the focus is very often on “husband-wife” relations that consist of two references. We assume that every two consecutive person references, in the unstructured data, potentially have relations with each other. Therefore, for each unstructured document
The PR-Chunks in this example are
Generating fingerprints (C3)
Using Definition 1, we define the fingerprint of a relation rel as following.
.
Let rel be a relation over a set of references
is defined as the set of condensed representation of each reference
Example: Using Definitions 1 and 2, REDS generates the following two fingerprints
REDS needs to answer two query types: 1) the exact matching queries: These queries contain some strings, and we want to find all the relations which contain references with condensed name equal to one of those strings. 2) the Levenshtein matching queries. These queries contain some strings and an integer a distance threshold
.
Let
where
The Assessor component uses the statistics from data to measure the confidence level of the relation and also predicts its type by using the indexed knowledge repository
.
Let
then we define
Let
We use the support of fingerprint elements and the set of common levenshtein neighbors in Definition 4 to compute the confidence level and predict the type of a relation. These two will be defined as following.
.
Let
Let
where the
We define the threshold level
Algorithm 5 describes how REDS approach can be implemented to extract relations in an unstructured dataset
[!ht] LeftleftThisthisUpup Each
Example: Having an inverted index built for the knowledge repository, we can resolve the stream of queries that are generated to predict the relations in the unstructured data. For each fingerprint
Extracting surface features (C6)
For generating the features for the training set we use the surface features of each document
.
Let rel be a candidate relation defined over the
returns the
Example: We use the extracted relations by REDS to build a training set for classification of new potential relations as either positive (i.e., a “husband-wife” relation) or negative (i.e., no relation extracted). First, every relation fingerprint
Training the classifier (C7) and classifying relations (C8)
REDS assesses the validity of potential relations in the unstructured data
Using Definition 6 and the outputs of Algorithm 5 we can apply a proper machine learning method on the training instances to train classifier DT. DT can then predict the type of a relation based on the surface features of text. Once DT is trained, it can be applied on the text documents to extract relation with no need to distant supervision.
Example: Among the off-the-shelf machine learning methods, we apply the Decision Tree classifier since the patterns discovered by this method are easily interpretable by the domain experts [44]. The decision tree is built top-down from the very top node by using ID3 algorithm [41, 42].4
The C4.5 algorithm, an extension of the ID3 algorithm, can also be used for this purpose. However, for this application we don’t expect any significant change in the outcome.
We use all the 10,000 Positive instances and then we select, randomly, 10,000 of the Negative instances to build a training set. We extract the surface features for each relation instance, and train a decision tree classier for predicting new relations. Figure 4 shows the decision tree with termination criterion of 12 leaves (12 leaves are chosen to keep the decision tree visualizable and interpretable. In practice a much larger decision tree can be built).
Every path from root to leaf represents a single pattern. This tree generates 12 patterns for labeling a relation as Positive (leading to green leaf) for a husband-wife relation or Negative (leading to red leaf) otherwise. For example, left most path indicates that if
A Decision tree built based on the ID3 algorithm to minimize the entropy. This tree generates 12 patterns for labeling a relation as Positive for a “husband-wife” relation or Negative otherwise. The discovered patterns show that in addition to the contents of 
Algorithm 4 shows how DT can be used to find the relations for which due to absence of evidence in
[ht] LeftleftThisthisUpup Each
In this section, we report the results of implementing REDS on the dataset introduced in Subsection 3.2.
Effectiveness of fingerprints
First, we study the effectiveness of using proposed fingerprints by looking at data statistics. Consider a real person as entity
However, as no ground truth exists for this dataset, we can just make an estimation of number of times
The CCDF,
We choose 1000 “husband-wife” relations randomly from the knowledge repository at hand (i.e., the relations in the civil registers). For each relation the fingerprint consisting of a pair of condensed names is extracted. Then, for each condensed name which appears in the fingerprint, we compute its support. We compute the supports for four different Levenshtein distance thresholds
The CCDF for the support of single and paired condensed names. In (a) by increasing the Levenshtein distance threshold 
To further study the changes in maximum support value, for the Levenshtein distance threshold
The changes of maximum support of single and paired condensed names for knowledge repositories of different sizes. By increasing the size of knowledge repository the number of Levenshtein neighbors of single condensed names increases while the maximum number of paired condensed names in the same Levenshtein neighborhood are very robust to changes in size of the knowledge repository.
Following the NER approach discussed previously, we extract, on average 4.3 PR-chunks, from 20,000 notarial acts. For more than 14,000 of these notarial acts at least one candidate relation is generated. In total 83,000 relations are extracted, among which about 12,000 receive positive confidence level and are labeled as Positive instances and 71,000 are labeled as Negative. Figure 7 shows the distribution of scores for these 83,000 extracted reference pairs.
Distribution of confidence scores described based on Definition 5. The reference pairs with zero score are labeled as Negative (i.e., left side of the dashed line) and the reference pairs with non-zero score are labeled as Positive (i.e., right side of the dashed line). The number of Negative instances is almost 6 times larger than the number of Positive instances.
Effect of confidence level threshold on precision of extracted relations and overall number of extracted relations. By increasing the threshold of confidence level 
In order to evaluate the discovered relations by REDS, experts in historical information are asked to check the extracted household relations from notarial acts and decide whether the pair of references have a “husband-wife” relation with each other or not. Manual evaluation of the extracted relations is a very time-consuming process for the domain experts. Clearly it is not possible for the domain experts to evaluate all the 80,000 discovered relations, in an acceptable time. Thus we randomly select 1,000 positive instances, while the confidence level threshold
Evaluation of classifier relations
Next, we evaluate the decision tree DT. First, we apply DT to the
Measuring the accuracy of REDS and DT for extracted relations from the unstructured text. #Ins., #TP, #FP, #TN and #FN are the number of evaluated, true positive, false positive, true negative and false negative instances, respectively
Measuring the accuracy of REDS and DT for extracted relations from the unstructured text. #Ins., #TP, #FP, #TN and #FN are the number of evaluated, true positive, false positive, true negative and false negative instances, respectively
The analysis provided in the previous section, showed that REDS is a precise approach which extracts 90% of the relations correctly and has the potential to generate a training set for supervised learning of a classifier. The classifier can extract new relations with precision of 0.75 and recall of 0.79. In Section 5.3, we showed that in our application the confidence level acts as a binary assessor, such that for non-negative levels we can accept the validity of the relation. We consider this as a result of data sparsity; the entities that are mentioned in the text documents are referred to for a handful of times. Therefore, co-occurrence of two references in at least one relation of the knowledge repository is an indication of a high probability for the validity of the potential relation which includes the same two references. We see this behavior of REDS as an advantage that allows Algorithm 5 to provide a fine-grained filtering of results such that there is no need to the ranking of results.
In Section 5.4, we also saw about 25% False Positives as the result of applying the trained decision tree on a test set. Here we mention two main reasons for prediction of invalid relations: First, NER module can make mistakes in forming the PR-Chunks, and in turn the queries on fingerprints return invalid results. Second and more importantly, due to the dynamics of the data in course of centuries, the text grammar, word collections and topics of the documents change; the patterns that the decision tree is trained to detect do not appear in the text documents in the test set, thus the classifier is not capable of predicting correct relations. This is another proof for the importance of using distant supervision in detecting relations.
The proposed REDS approach doesn’t make any assumption on quality of the knowledge repository. Although, a deduplicated and cleaned knowledge repository might improve the accuracy of REDS, the use of relation fingerprints provides a fast dedupliation of the knowledge repository on the fly. Thus, building an inverted index for the knowledge repository is the only preprocessing requirement of REDS.
Conclusions
In this paper, we addressed the problem of extracting relations from unstructured data. We discussed how the existing relation extraction approaches had limitations such as dependency on manually built training sets, being language dependent or requiring the abundance of data. Considering these limitations, we used a knowledge repository to provide distant supervision to a relation extraction engine. Additionally, unstructured datasets usually suffer from uncertainties such as spelling errors, name variations etc. To resolve these uncertainties, first, we introduced relationship fingerprint as a noise-tolerant condensed representation of a relationship between two or more individuals and then, we used Levenshtein distance to match the fingerprints in the external repository. By using an inverted index in the knowledge repository the proposed method in this paper, called REDS, is very fast and scalable. The proposed approach was implemented on a case study of genealogy, in which a collection of structured civil registers were used as the knowledge repository. The relations in civil registers provided distant supervision to the relation extraction in the unstructured notarial acts. In order to evaluate the precision of the proposed approach, the domain experts, manually, evaluated the outcomes and confirmed the high precision of this technique (precision
Footnotes
Acknowledgments
This research has been supported under the NWO CATCH program in the MISS project (project no. 640.005.003). The authors are grateful to the BHIC center for the support in data gathering, evaluations and direction.
