Abstract
The problem of cross-document person profiling aimed at identifying and linking person entities across Web pages and extracting their relevant structured information. In this paper, we specifically focus on the core task of person profiling problem, namely the attribute extraction task. For attribute extraction, the existing approaches face several challenges that two important of them include (i) syntactic and structure variation, and (ii) cross-sentence and cross-document information extraction. To alleviate these deficiencies and improve performance of existing methods, we propose a semantic attribute extraction approach relying on probabilistic reasoning. Our approach produces structured, meaningful profiles in which the resulting textual facts are linked to their possible actual meaning in a distant ontology. We evaluate our approach on standard profile extraction datasets. Experimental results demonstrate that our approach achieves better results when compared with several baselines and state of the art counterparts. The results justify that our approach is a promising solution to the problem of person profiling.
Introduction
Currently, people can easily publish data on the Web and various social media. This results a vast amount of valuable data, which a significant part of them are unstructured, free text documents written in various natural languages. These data can include information about persons, locations, organizations, governments, facilities, vehicles and many other entities. The huge volume of data on the Web and social media brings several challenges such as lost in hyperspace and information overload [13]. One of the potential solutions to the chaos of Web is to transform unstructured data into structured format, which would be helpful to provide machine-readable knowledge bases, and hence ultimately improving application interoperability, data integration, sharing, reusing and availability. Entity profiling (EP) as a subtask of information extraction (IE), Web mining, and natural language processing (NLP), offers a promising way to map free-structured Web text documents into structured and machine-readable knowledge bases.
In this article, as an element of EP research, we investigate a specific variant of the general EP problem, namely the cross-document person profiling in Web. In cross-document person profiling, we are given a set of Web pages
Many research efforts have been made to solve the problem of Web person AE. The existing approaches suffer from several fundamental problems. The first is the lake of appropriate text processing tools specific for Web data. Many of the available text processing tools are often developed for homogeneous corpus, particularly news corpora, and their adoption to Web data is not a trivial task. The second problem is that many existing AE approaches mainly focused on homogeneous unstructured-style news corpora [44] or the Web pages that are in the same style (e.g., Wikipedia articles or individual homepages) [3,61,66,67,74], and their adoption for general Web data is so challenging. The third is the problem of structure and syntactic variation, presenting the same meaning with different surface linguistic expression forms in different structures. Web pages use boundless vocabulary and structure, and composition style to present approximately similar content [1]. This makes it hard for any Web AE to cover all writing pattern and structure variations. The fourth is the implicit information scattered within and across the sentences and even documents. Most of the existing AE systems fail to extract this sort of information as they focused on extracting the information existing only within a sentence. These AE approaches use different constraints and filtering techniques to improve the performance. However, these constraints are executed in several different stages and cannot be applied collectively. The fifth and perhaps the most important problem is the low performance of exiting Web person AE methods. These challenges show that the existing solutions are not enough for Web data. These observations promoted us to developing a semantic AE approach to alleviate the deficiencies of structure and syntactic variation, and cross-sentence and cross-document IE. Our approach improves the performance of AE compared to the counterparts.
To summarize, our contributions in this article are as follows:
For Web person AE, we propose a semantic approach, which is effective for rich profile IE from various kinds of Web pages, where they are in various domains, may contain a lot of noise, and written in different styles and structures. Our approach is able to extract explicit and implicit structured information within and across sentences in a collective manner by utilizing Markov logic networks (MLNs) [58]. Our AE approach links the resulting textual surface facts to their possible actual meaning in a distant ontology. By this way, the majority of information contained in a person profile is meaningful, and can simply translate to other languages. This is very helpful for multi-lingual text processing.
In AE stage, to recognize class of attributes, most previous AE work used pre-compiled manually collected seed keywords and verbs. We propose an automatic supervised method to extract the seed keywords and verbs by utilizing PMI weighting schema [25] and a co-reference resolution system. This is efficient in practice and alleviates the need for manual human engineering effort.
We perform extensive evaluation of our proposed approach by using real, standard datasets. We show that our method outperform baseline approaches and state of the art counterparts.
Having this short introduction, the rest of this paper continues as follows. Section 2 is devoted to literature review, and presents an overview of the related work of attribute extraction. Section 3 describes the working principle of our approach. Then, in Section 4, the proposed algorithm is evaluated on benchmark datasets and the results are compared to the baseline methods and state of the art approaches. Finally, Section 5 makes conclusions and discusses some future works.
Related work
An important task in cross-document entity profiling is AE. In the following, we review the most significant research work for AE, present their limitations and compare our approach with them. This short discussion highlights the need for developing new and more robust AE approaches.
The task of AE is to distil valid filler values for a set of pre-defined attributes of a given entity from text documents. This task is also known as slot filling [66] or relation extraction [10]. In recent years, many attempts have been made to solve the AE problem. Early AE systems were based on lexico-syntactic rule-based methods and are domain dependent. For example, Li et al. [44] used a multi-level rule-based approach, which relies on various linguistic IE patterns to extract relations from English corpora. A similar approach is presented in [33] for AE in Frasi texts. However, such approaches to IE are limited by the availability of domain knowledge, the difficulty in designing rules for different types of text, and less accurate results under noisy setting. Moreover, these approaches focused on extracting attributes only within a sentence, which is not enough for Web text as some relations appear across sentences. Later systems to achieve robustness under noisy setting and to extract arbitrary facts use probabilistic [43] and statistical methods [63,76]. However, these approaches ignore deep semantic analysis of text and do not exploit entirely the semantic information contained in text. Some other work used machine learning methods for Web AE. Supervised learning methods achieved high performance in AE, but they need more hand labeled training data in order to be effective [10]. Due to the lacking of high quality large annotated training data, and the low performance of supervised methods for extracting arbitrary relations from large-scale corpora such as Web, recent AE systems used semi-supervised learning methods [10,64], bootstrapping methods [70], self-supervision methods [69], distant supervision methods [51], and unsupervised clustering methods [31]. However, each of these methods suffers from several challenges. For example, bootstrapping methods suffer from semantic drift problem, and distant supervision methods suffer from noisy training data. The output of unsupervised clustering methods often does not resemble ontological relations, and the resulting relations are difficult to map into a domain-specific ontology.
In contrast to traditional ontology-based AE, Open IE [12] as a new emerging trend in IE aims to extract possible relations in a text without a pre-specified attribute vocabulary and with no domain-specific knowledge engineering effort. The main issue in Open IE is that the extractions are purely surface text and do not resemble domain-specific ontological relations [62]. In recent years, a few studies [60–62,71] have been done to adapt Open IE extractions to pre-specified ontological relations. Soderland et al. [62] propose a two-step approach for adapting Open IE extractions to a domain ontology. In the first stage, the Open IE tuples are annotated by a domain concept recognizer, and then a number of relation-mapping rules are learned by using a cover learning algorithm to map the tuples to domain relations. Since machine learning approaches to learning high precision mapping rules need more training data, and such a high volume data are not available, the authors in [60] chose to create domain mapping rules manually rather than adopting machine learning approaches. In
Some other work rely on shallow semantic analysis of text [34,37,65]. For example, Surdeanu et al. [65] proposed a rule-based approach, which contains a number of mapping rules to map semantic role (SRL) frames to domain-specific attributes. However, the previous work relying on shallow semantic analysis have not addressed entirely some challenging linguistic phenomena such as synonymy and polysemy, and these approaches detect relations existing only within a sentence, which is not enough for Web data and may lead to reduced performance.
More recently, some work integrate syntactic dependencies and semantic information derived from distant knowledge bases to address the challenges like synonymy and polysemy. For example, Moro and Navigli [52] combined syntactic dependencies and distributional semantic information to extract ontological relations. However, the resulting relations are still bound to surface text, lacking actual semantic content. Bovi et al. [29] developed
Similar to [29] and [32], our approach draws on this idea of enriching dependency graphs with semantic information derived from a distant knowledge base. However, the limitations to their approach are that (i) it extracts relations existing only within a sentence containing a verb, which is not enough for Web text as some relations exist in clauses that have not a verb predicate, and (ii) it cannot discover implicit relations, which implied by the text. Our approach can easily extract cross-sentence verb-based relations and also infer implicit relations existing in a discourse document through probabilistic inference about relations. In addition to verb-based relations, our approach is able to identify noun-based relations accurately.
Comparison of our approach and closely related AE methods
Comparison of our approach and closely related AE methods
V: extracts verb-based attribute values from verb phrases in unstructured text fragments? N: extracts noun-based attribute values from noun phrases in unstructured text fragments? W: extracts web-based attribute values from structured text fragments?
Joint inference and MLNs has become popular recently, because they make it possible for features and constraints to be shared among tasks. For example, Chen et al. [22] used MLNs to infer implicit family relations between persons. In other work, YU and LAM [74] used MLNs to extract relations between entities in Wikipedia articles. Our AE features to extract verb-based attributes is closely related to that of YU and LAM [74]. However, in their work, they applied MLNs to extract relations from raw text, whereas we apply MLNs formula on SRL frames and dependency graphs, which is different from working on raw text. Furthermore, they have not addressed entirely the challenges of synonymy and polysemy. These methods generate relations in textual surface form that are not linked to distant knowledge base entries.
After this short review, it can be seen that extracting semantic, structured information in natural language Web text is still a challenging, unsolved problem. This highlights the need for developing new and efficient AE approaches.
Table 1 shows a comparison of our AE approach and closely related AE approaches. In summary, our work extends previous work on AE research in a number of ways. Our approach produces semantic profiles by making rich use of distant knowledge bases and deep analysing the text syntactically and semantically. Our verb-based AE can alleviate the problem of syntactic writing variation exist in syntactic based AE methods. In contrast to previous work in relation extraction, which mainly focused on extracting atomic facts existing only within sentences of individual documents, we focused on cross-document person profiling, which attempts to extract explicit and implicit relations within and across sentences and documents.
Figure 1 shows the state diagram of our proposed approach for the problem of cross-document person profiling. As shown in Fig. 1, we decompose the problem as four subtasks:
In the following, we describe these components in more detail.

The state diagram of our cross-document person profiling system.

An example of unstructured-style and structured-style data expression format in a sample Web page.
In this article, we focus on the textual part of the Web pages, because the majority of the information about entities on the Web is often expressed in the natural language text. The Web pages need to be pre-processed and prepared according to system’s desired format, in which the content of a web page is transformed into a set of structured and unstructured text fragments. We decompose pre-processing into five main steps: (i) html tag removal, (ii) named entity tagging, (iii) co-reference resolution, (iv) sentence splitting, and (v) fragment type identification. First, for each Web page, Jsoup1
Jsoup: Java HTML Parser, [
The content on a Web page is often expressed in a mixture of different representations: unstructured format and structured format [21,50]. Unstructured text follows prescribed writing standards, and prepared for a fairly broad audience [21,50]. Unstructured writing needs to be clear and unambiguous. Longer and complete sentences are likely to be more prevalent in unstructured text. Complete sentences usually contain a subject, object and one or multiple verbs. On the contrary, structured text has few constraints on writing format, mixes various representations, prepared quickly and intended for a narrow audience [50]. In Fig. 2, we show an excerpt of structured and unstructured fragments. Each of the structured and unstructured expression formats require different information extraction methods. Identifying the type of fragment expression helps us to overcome the problem of structure variation and choose proper attribute extraction methods (Section 3.3) to extract entity-centric information according to data representation format. Therefore, we need to identify the expression format of text. In the fragment type identification, we classify text into structured and unstructured fragments. We adapt and modify the method proposed by Chen et al. [21] to segment text. There are two stages in fragment type identification: sentence type identification and sentence grouping. In sentence type identification, Chen et al. [21] used a rule-based method to classify lines in a web page as structured or unstructured according to the percentage of tokens that begin with capitalization. We extend their approach by using support vector machines (SVMs) [27], a supervised machine learning algorithm to classify each sentence in text as one of the two classes, structured or unstructured. The main feature for classification is the percentage of capitalized tokens and length of the sentence. The selection of these features comes from the fact that an structured sentence mainly is short and contains capitalized tokens. To perform sentence classification, we first manually select 100 sentences from WePS-2 training data [7] and then annotate them in terms of classification features. Out of these, 50 are structured examples and 50 are unstructured ones. We then use LibSVM toolbox3
The pre-processing tools may produce errors, which propagate to the latter stages. However, improving the pre-processing components is beyond the scope of this paper. The remainder of the processing described in the following use this pre-processed text.
The semantic analysis component takes as input the pre-processed sentences, analysis each sentence semantically and syntactically, and produces semantic augmented syntactic dependencies and SRL frames. Semantic analysis consists of four stages: (i) word sense disambiguation, (ii) semantic role labeling, (iii) dependency parsing and (iv) semantic enrichment. In the following, we describe these steps in more details.
Word sense disambiguation
In word sense disambiguation (WSD) stage, we disambiguate each surface text word and entity mention to identify which of its senses is given in the text. By this way, we alleviate polysemy and synonymy problems that many previous attribute extraction works suffer from these problems. Our WSD system takes as input the target entity mention e and the context S(e) around it, and then utilizes Babelfy [53], a state of the art entity linking and word sense disambiguation system to obtain a sense mapping from surface text words and entity mentions to word senses and semantic ontological named entities. We define S(e) as follows:

A sample fragment of text with four example sentences, three entities and their co-referent mentions.
We map the word’s synset offset generated by Babelfy to possible corresponding WordNet [35] sense and DBPedia4
In semantic role labelling, we assign a “who did what to whom, when, where, why and how” structure to each sentence. We refer this structure as semantic role (SRL) frame. We use SENNA [26] to extract SRL frames from sentences of text. We notice that any semantic role labelling system can be used for SRL frame extraction. The reason that we used SENNA is the fact that it is competitive with state of the art SRL systems, and it is open source. Let

An excerpt of semantic analysis for a sentence; (a) sample sentence, (b) disambiguated word senses; (c) SRL frame Frm in PropBank style extracted by SENNA system; (d) SRL frame Frm in VerbNet style; (e) dependency graph
To cope with the problem of predicate sense dependency of semantic roles, a conversion is needed to map PropBank-style roles into VerbNet domain-independent thematic roles such as agent, patient, theme, etc. To fulfil this task, we employ the PropBank and VerbNet mapping provided by the SemLink project5
The token mapping takes as input the lexeme p and its corresponding mapping vector L, and decides which
We use the Stanford dependency parser6
Semantic enrichment stage augments each SRL frame Frm and syntactic dependency graph
Attribute extraction
Attribute extraction (AE) component takes as input the person names in question and their relevant documents processed by semantic analysis stage. The person names and their relevant documents are previously provided in evaluation datasets. AE component then extracts the attributes of persons, and forms their discourse profile. Formally, we define the discourse profile of an entity e, P(e) as follows:
Vocabulary of attributes in question
Vocabulary of attributes in question
To handle this problem, we use unstructured-style AE and structured-style AE methods. These two methods are complementary and have little overlap in the resulting extractions. We ran unstructured-style AE module and structured-style AE module in parallel, and report the union of extractions as final result. If the two methods had different output for a target attribute, we use only the unstructured-style AE result. In the following, we describe these AE methods in more detail.
Let
Verb-based attribute extraction. Verb-based AE takes as input the SRL frames of each document that have been processed by semantic analysis, and extracts discourse filler values for the attributes in question. Our key idea in using SRL frames for AE is twofold: (i) semantically labelled arguments in SRL frames almost correspond to the filler values of the attributes in question, (ii) mapping SRL frames to domain-specific attributes is more straightforward, while extracting attributes from raw text is so challenging. To extract filler values, we use several AE formulae/features and employ Markov logic networks (MLNs) [58] to model inter-weaved constraints and AE formulae to map the SRL frames to possible attributes. MLNs are joint models, which combine and make full use of the merits of both first-order logic (FOL) and probability. Our choice of the MLNs is guided by the following facts: (i) MLNs are capable to model uncertainty efficiently using probabilistic unified graphical models; (ii) MLNs model the IE task in a collective manner, i.e., they are capable of performing within and cross sentence IE. This is different from other AE methods that predict relations independently without considering the dependencies between entities across sentences; and (iii) MLNs incorporate first order logic to express and combine various sources of knowledge.
Formally an MLN L is a set of pairs (F, w), where F is a formula in FOL and w is a real number indicating the weight attached to formula F. In our work, each FOL formula F is composed of a set of AE features and constraints to captures the contextual information of the focus entity and extract filler values for given attributes. We represent the AE features and constraints using four types of symbols: constants, variables, functions, and predicates. Constants represent the objects in discourse documents, e.g., person names: “Amenda Lentz”, “Alexander Macomb”. Variables (e.g., x, y) range over the ground objects. Predicates represent attributes of objects or relations among objects within discourse documents (e.g., Father(x, y), Synonym(x, y)). We represent each SRL frame Frm as several predicates of the form SFrm(dID, pred, srl, arg), where dID is document id, pred is verb predicate in SRL frame Frm, srl is semantic role label, and arg is the argument value for role srl in the frame Frm. Similarly, we represent attribute classes as predicates of the form Attr(dID, e, v), where Attr represents an attribute class, e is a person entity, and v is the filler value of Attr for e. In the following, we discuss features that we used to verb-based AE.
The main challenge here is the identification of seed trigger verbs and keywords for each of the attribute classes. The seed trigger words can simply be constructed manually as performed in previous work [11,21,54,60–62,68]. These approaches use pre-compiled lists of attribute-specific keywords were collected manually from different information resources such as Wikipedia, DBpedia and FreeBase ontologies, and tables found on the Web. For example, for the attribute “occupation”, a list of occupations is collected from DBpedia and formed keyword set “occupation”. However, these approaches need human supervision and expertise and are labor intensive to create such lists. To solve this issue, we propose a supervised approach for seed fact learning from training data. Our approach is based on the world knowledge of co-occurring words.
Our seed learning approach mainly consists of two steps: (i) seed extraction, and (ii) seed pruning. Algorithm 1 shows the pseudo code of our seed fact extraction approach. Let
To select the best matches seeds, we score the candidate seeds and select high score ones. To do this, we assign a weight for each candidate seed. We compute
Let
Algorithm 4 shows the pseudo code of seed pruning stage. The remaining words are selected as trigger seeds for a given attribute class
Since we did not have the ground truth of seed facts in the source text, we omitted the evaluation of seed extraction and instead we focused on calculating the performance of the identified and extracted seeds. We approached this task by randomly selecting a sample of 100 seeds from each of keywords and event verbs extracted from WePS-2 training [7] and Wikipedia training dataset [28]. We then calculated the number of correct and incorrect seeds in sampled data and compute performance scores. We manually examined the sentences corresponding to each sampled seed to identify whether the extracted seed is correct or incorrect. Table 3 shows the performances obtained by our seed extraction method on benchmark training datasets. In the experiments, we conducted evaluations using three criteria: precision (P), recall (R), and F1 score. For more detail about these metrics, see e.g. [57]. However, the performances are not ideal, but we find that the resulting seeds are sufficient in practice. In future work, we plan to enrich seed verbs and keywords using extra Web resources and various ontologies to improve the performance.
Performances of our seed extraction method on benchmark training datasets
Performances of our seed extraction method on benchmark training datasets
Given attributes’ class recognizers, an SRL frame SFrm is considered as a potential candidate for an attribute
w: Verb(v) ∧ Synonym(v, “born”) ∧ SFrm(dID, v, “agent”, e) ∧ SFrm(dID, v, “AM-Loc”, g) ⇒ Birthplace(dID, e, g)
This formula looks at the arguments of a SRL frame and decides which arguments corresponds to the “birth place” of the focus person e. In above formula, predicate Synonym(v, “born”) first generates all WordNet synonyms of verb “born” and from synonymous vector
w: Verb(v) ∧ Synonym(v, “born”) ∧ SFrm(dID, v, “agent”, e) ∧ SFrm(dID, v, “AM-Loc”, g) ∧ Location(dID, g) Person(dID, e) ⇒ Birthplace(dID, e, g)
The predicate Person(dID, e) ensures that the agent of “birth place” to be a person in document dID. Similarly, the predicate Location(dID, g) ensures that the argument g must be a location named entity in document dID. We used Stanford named entity tagger to mark common entities including person, location and organization as indicators of possible values for attributes. Currently, there is no tool which can be used to extract some special named entities (e.g. occupation, award, major and degree). This is a main issue for named entity recognition in Web data. To mark such special named entities, we use our seed extraction strategy as explained in above section.
w: Verb(v) ∧ Synonym(v, “marry”) ∧ SFrm(dID, v, “agent”, e) ∧ SFrm(dID, v, “patient”, y) ∧ MotherOf(dID, y, z) ∧ Person(dID, e) ∧ Person(dID, y) ∧ Person(dID, z) ∧ Co-occur(dID, e, y, z) ⇒Relatives(dID, e, z)
When applying inference rules, we exploit co-reference information to make the extractions more precise. To infer a filler value for an attribute of the focus entity e, we use the SRL frames of the sentences appearing in neighbourhood of entity e, S(e) as defined in Eq. (1). In Eq. (1), we set
w: Verb(v) ∧ Synonym(v, “win”) ∧ SFrm(dID, v, “agent”, e) ∧ SFrm(dID, v, “theme”, y) ∧ Person(dID, e) ∧ AwardIndicator(dID, y) ∧ IsNounPhrase(dID, y) ⇒ Award(dID, e, y)
w: Verb(v) ∧ Synonym(v, “born”) ∧ SFrm(dID, v, “AM-NEG”, “not”) ∧ SFrm(dID, v, “agent”, e) ∧ SFrm(dID, v, “AM-Loc”, g) ⇒ null
w: Verb(v) ∧ Synonym(v, “marry”) ∧ SFrm(dID, v, “agent”, e) ∧ SFrm(dID, v, “patient”, y) ∧ Person(dID, e) ∧ Person(dID, y) ⇒!Birthplace(dID, e, y) ∧ !Degree(dID, e, y)
This formula implies that the relation between two persons cannot be “birth place” or “degree”. By this way the argument coherency is satisfied.
Noun-based attribute extraction. The filler values of the attributes that contained in noun-based constructions cannot be extracted by verb-based AE. For example, in the sentence given in Fig. 4(a), the verb-based AE cannot conclude that the phrase “professor” is a filler value of the attribute “occupation” for the person “Daniel Jurafsky”, because this attribute is not expressed by the verb of “born”. This phenomenon occurs in a variety of domains like “The New York mayor, Bloomberg signed the contract”. To extract noun-based attributes, we define a series of MLN AE rules, which exploits the semantic boosted syntactic dependencies and the lexical information in the form of named entities and keywords. The overall strategy of our noun-based AE is inspired by previous work [61], but there is a significant difference between those AE approach and our approach. The limitations to their approach are that (i) the attributes’ filler values that are multi-word expressions, and cover more than one word in dependency graph cannot be correctly extracted and (ii) it may produce irrelevant extractions due to ignoring co-reference information between nominal mentions containing attributes of interest and target entity mentions. We extend their approach in five ways: (i) extracting both single-word and multi-word attribute values from semantic boosted dependency graph; specifically, borrowing the idea from the work of Bovi et al. [29], we couple syntactic dependencies and fully disambiguated entity mentions and word senses to solve the problem of multi-word attribute extraction; (ii) applying noun-based AE method on the semantic enriched syntactic dependency graph instead of the pure syntactic dependency graph, (iii) considering the co-reference information between the nominal mentions that include attributes of interest and target entity mentions; this constraint discards erroneous and irrelevant extractions; (iv) proposing an automatic method to extract trigger keywords that are indicators of the given attributes; and (v) modelling noun-based AE rules in the form of MLN rules that enables us to model all of the attribute-specific constraints in a collective manner instead of designing separated filters.
The input to the noun-based AE algorithm is the query entity e, a set of semantic boosted dependency graphs generated for sentences related to entity e, and a set of pre-compiled keyword set K extracted automatically by our seed extraction system. We first represent each dependency graph as a set of predicates in the form DepRel(dID, rel, arg1, arg2), where DepRel is a dependency relation, dID is the document id, rel is the dependency arc between two arguments arg1 and arg2. For example, by setting dID to 1, we show the dependency graph in Fig. 4(f) as follows:
DepRel (1, “nsubjpass”, “born”, “Daniel Jurafsky”)
DepRel (1, “appos”, “Daniel Jurafsky”, “professor”)
DepRel (1, “prep_of”, “professor”, “Stanford”)
DepRel (1, “prep_in”, “born”, “1962”)
DepRel (1, “prep_in”, “born”, “Yonkers, NY”)
Our algorithm then iterates on the corresponding predicates of a dependency graph, and looking for a co-occurrence of a keyword w: DepRel(dID, “amod”, e, k) ∧ Co-refer(dID, e, k) ∧ Person(dID, e) ∧ NationalityIndicator(k) ⇒ Nationality(dID, e, k) w: DepRel(dID, “appos”, e, m) ∧ DepRel(dID, “amod”, m, k) ∧ Co-refer(dID, e, k) ∧ Person(dID, e) ∧ NationalityIndicator(k) ∧ IsNounPhrase(m) ⇒ Nationality(dID, e, k) w: DepRel(dID, “nsubjpass”, e, m) ∧ DepRel(dID, “amod”, m, k) ∧ Co-refer(dID, e, k) ∧ Person(dID, e) ∧ NationalityIndicator(k) ∧ IsNounPhrase(m) ⇒ Nationality(dID, e, k)
where Co-refer(docID, e, k) means that the entity e and keyword k are co-referent in the document dID, NationalityIndicator(k) implies that the keyword k is an indicator for attribute “nationality”, and predicate Nationality(dID, e, k) means that the keyword k is a candidate value for the “nationality” of entity e. The first rule covers cases like “American professor, Daniel Jurafsky was born in Yonkers, NY”. The second rule works for cases like “Daniel Jurafsky, the American professor was born in Yonkers, NY”. This rule concludes that there is a path between entity “Daniel Jurafsky” and the keyword “American”. This path contains dependency arcs “appos” and “amod”. It is a valid path and meets constraints, thus the keyword “American” is a valid filler value for the attribute “nationality”. The third rule works for cases like “Daniel Jurafsky is the American professor”. Using such type of rule, we solve the defect of verb-based AE on extracting hypernymy relations: verb-based AE cannot extract attributes’ values when the main verb in a sentence is one of the auxiliary verbs (light verbs) having zero arguments in semantic role labelling annotation.
Since dependency arcs in dependency graph
We use Tuffy [55] for implementing MLN formula. Our choice of the Tuffy is guided by the facts that (i) it is efficient for inference, (ii) it is competitive with state of the art implementation of MLNs in both quality and speed, and (iii) it is scalable for large-scale data. We manually designed around 68 fact extraction and inference rules. We used the Diagonal Newton discriminative learner algorithm [47] to learn the optimal weights of MLN rules. By this way, we give a dump of training data (50 randomly sampled pre-processed documents of the WePS-2 training) and the manually designed MLN rules to the learning algorithm. The algorithm then computes the weights of MLN rules by maximizing the likelihood of the training data. To speed up the learning process, we learn weights for different rules individually. That is, we make a simplifying independence assumption for all rules so that we can learn rule weights individually. The MLN rule’s weight indicates how the MLN rule is actually observed in the training data. Intuitively, the system gives a MLN rule a high score if it covers many facts.
We notice that to reduce complexity, in attribute extraction step, we apply MLN rules on SRL frames extracted from a single discourse document at each time.
When the instances of attributes are expressed in structured-style format, the formal-style AE approaches cannot directly be applied. This is due to the fact that structured-style text does not follow a standard writing format. To extract the attributes from structured-style fragments, we use Web-specific patterns. The overall strategy of our Web-specific patterns is similar to those AE methods taken by [11,21]. However, they obtained attribute-specific keywords manually. In contrast, we use an automatic keyword extraction method as described in Section 3.3.1. Each Web-specific pattern uses attribute-specific gazetteers and regular expressions to mark potential values for attributes in structured-style fragments. Let
For each attribute
Once we have extracted values for personal attributes, we decide which values are valid and which ones should be ignored. We use a set of validation rules to verify the correctness of attribute values. These rules are a set of attribute-specific constraints to each attribute according to its type. For example, for “date of birth” attribute, we defined date validation rules, for “phone” and “fax” attributes, suitable phone and fax validation rules, and so on. In this way, a candidate value is considered as valid filler for a given attribute, if it satisfies all constraints specified for that target attribute, otherwise it will be discarded. For example, the correct value for “website” attribute must contain the original name of the person under consideration, any variations of the given name identified by co-reference module, or domain name from the “email” attribute. We formulated attribute specific constraints in the form of regular expressions and attribute specific rules. Notice that for the attribute of “date of birth”, we first normalize the date values using Stanford SUTime library [19] and then validate them. Since we solve polysemy and synonymy at WSD stage (Section 3.2.1), we do not need normalization to normalize abbreviations and location names. This is while most previous work suffer from the problems of polysemy and synonymy. After validating attribute values, to create a discourse profile for each person, his/her related attributes are assembling while eliminating redundant information.
Name disambiguation
The individual profiles derived from multiple sources exhibit different attributes of an entity and do not entirely overlap, thus these profiles can complement each other. For integration of profiles extracted from multiple documents and create corpus-level profiles of persons, we need to solve the problem of name ambiguity across documents, and make an exactly one to one correspondence between persons and their profiles at corpus level. We formulated the person name disambiguation as a clustering problem. Let
We exploit an integration of two important types of information about persons as clustering features. The first is the personal attributes that stored in persons’ profiles. The second source of information is social links between persons. The main idea behind our approach is the fact that the attributes of a person can complement his/her social links, and vice versa. In other words, if one source of information is missing or noisy, the other can make up it. Social links provide a better interpretation of the information contained in discourse profiles. Similarly, the personal attributes give meaning to social links between persons as they identify the roles the persons play with respect to each other (mentor of somebody, brother of somebody). However, we found that the attributes stored in discourse profiles are not sufficient enough to robust name disambiguation. To alleviate this issue, we propose a profile enrichment method to enrich local discourse profiles with extra, global semantic information extracted from distant knowledge bases.
Algorithm 5 shows the pseudo-code of our person name disambiguation approach. Our approach first extracts social relationships of person entities and create a social relationship graph. Our social link extraction approach obeys closeness centrality theory [15]. We assume that social relationship between entities in the real world is reflected by their closeness in text of the documents they are mentioned in. We assume that two entities are socially linked if they collocate together in a corpus more frequently. In the profile enrichment step, the system takes as input the discourse profile and enriches the local discourse profiles with rich global features retrieved from a distant knowledge base by considering co-occurring entities and their surrounding context. The graph creation component takes as input the social relationship graph and enriched discourse profiles associated with each ambiguous person name and then map them into an undirected graph G. Graph clustering takes as input the graph G and a pre-defined similarity measure (in our research, neighbourhood random walk distance) and group graph nodes into a set of disjoint clusters
Social link extraction
Social link extraction phase takes as input the target ambiguous name and it related documents and makes a social graph for that name. Our social link extraction approach obeys closeness centrality theory [15]. We assume that social relationship between entities in the real world is reflected by their closeness in text of the documents they are mentioned in. We assume that two entities are socially linked if they collocate together in a corpus more frequently. To identify the linked entities with an target entity e, we extract the co-occurring entities in the neighbourhood of entity e. To do this, we first identify context window S(e) as defined in Eq. (1). Let
Profile enrichment
The sparse data contained in discourse profiles may not be sufficient to resolve ambiguities and the system robustness will be degraded due to low quality of AE component. For these reasons, we propose an enrichment method of the persons’ profile via global attributes extracted from distant knowledge base. Profile enrichment attempts to alleviate the problem of data sparseness and improve the robustness of AE. Profile enrichment includes two steps: (i) entity linking, and (ii) attribute extraction. In entity linking step, for an entity mention e, we determine its identity in text to identify the best matching entity in the distant knowledge base. In attribute extraction step, we retrieve the global attributes for the target person e from distant knowledge base. These attributes are beyond the local discourse profiles.
Entity linking phase takes as input the target entity mention e and the context S(e) (Eq. 1) around it. It identifies the entity mentions in neighbourhood of entity e and forms a list of co-occurring entity mentions
We primarily rely on the Babelfy itself to identify the correct identity of the target entity e. Babelfy may produce some noisy data because in some cases it cannot infer the correct identity of entities. Therefore, to avoid dependency on the output of the Babelfy to infer whether the retrieved external entity t best matches with the target entity e, we rank the candidate entity t by a similarity measure and prune out candidates with low confidence. In similarity computation, we first compare the type tag of the entities e and t. If the entity type tag of entities e and t are not the same, we ignore the external entity t; otherwise we compare the attributes of entity t with local attributes of the target entity e. For this purpose, we compute the normalized similarity between entities t and e based on their attributes:
In general, each attribute class

An example of mapping the sample Web page given in Fig. 2 to attribute-relationship graph; (a) discourse profile extracted by our profiling system and enriched with external global attributes; (b) attribute-relationship graph. Note that the weights of attribute edges and structure edges are not given here.
For attribute date of birth, we first normalize the date values using Stanford SUTime library [20] and then compute the similarity. Borrowing the idea presented in [23], to compare date of birth values, we first convert dates into a number of days. We calculate the number of days according to the fix date 01-01-2016. Let
Results obtained by our experiments on the given datasets show that the normalized Levenshtein metric is appropriate for the attributes of affiliation, award, degree, nationality, occupation, and school; the Cosine similarity metric for the attributes of relatives, phone, fax, e-mail, website; and the Jaccard index for the attribute of mentor; and Dice coefficient for the attribute of birth place. For some attributes one or more similarity measures relatively reported the same results. For the attribute of other name, one can use Dice’s coefficient or Cosine similarity measure. For the attribute of major, it is no matter which similarity measure is used; however, in our implementations, we used the normalized Levenshtein metric for the attribute of major. We notice that for attribute date of birth, we only use Spd measure.
The graph creation component takes as input the social relationship graph and enriched discourse profiles associated with each ambiguous person name and then map them into an undirected graph G. Graph G is an undirected weighted graph summarizing all of the information about entities contained in the given Web text documents, and the given distant knowledge base. The remainder components of the name disambiguation system work with this rich graph instead of Web text documents. This enables us to use optimal graph mining algorithms for name disambiguation task. Figure 5 shows as example of mapping people’s attributes and social links extracted from the Web page given in Fig. 2 into a graph. Figure 5(a) shows the discourse profiles extracted by profile extraction system. In Fig. 5(a), the local attributes are shown in black colour and external attributes obtained by profile enrichment are shown in blue colour. The profile information and social links are mapped to a graph shown in Fig. 5(b). In Fig. 5(b), the structure node corresponding to target person “Dan Jurafsky” is shown in filled rectangle, the structure nodes for other persons are shown in rectangles, the structure nodes corresponding to organizations in triangles, attribute nodes in round ellipses, structure edges by solid lines, and attribute edges by dotted lines.
Graph clustering
Our procedure for clustering takes as input the attribute-relationship graph G, and uses the recently proposed BIC-Means [40], an efficient graph clustering algorithm to partition the graph into a set of disjoint clusters
Once we have clustered the set of discourse profiles for a certain person e, the next step is to aggregate those attributes were found within the cluster belonging to that person and form his/her corpus-level profile. In other words, distributed information about a given person is integrated to form a unified, enriched corpus-level profile. In integration step, we design some inference rule to extract more implicit information. An example of such rules are given in following formula:
Relatives(cID, x, y) ∧ Relatives(cID, y, z) ∧ Person(cID, x) ∧ Person(cID, y) ∧ Person(cID, z) ⇒ Relatives(dID, x, z)
where cID indicates the cluster id. This rule infer there is a relative relation between person x and z. We design 19 such rules to infer implicit information contained in discourse profile.
Experiments and results
For the sake of evaluating our cross-document person profiling system comprehensively, we conduct our experiments at both levels: the component level and system level.
Component-level evaluation
At the component-level evaluation, we evaluate our approaches in three phases: evaluation of pre-processing, evaluation of attribute extraction, and evaluation of name disambiguation.
Experiments of the pre-processing components
As we mentioned before, the pre-processing tools may produce errors that leads to a degradations in the performance of subsequent components. For our implementations, we adopted Stanford named entity recognizer tool. It achieves F1 score of 92.29% on person names, 88.51% on locations and 81.72% on organizations evaluated on the CoNLL-2003 evaluation set [36]. We employed Stanford co-reference resolution system with 54.62% average F1 score on CoNLL 2011 shared task dev dataset7
For our experiments for evaluating the AE system we used two benchmark datasets: WePS-2 test dataset10
In the remainder of this paper, WAE is short for structured-style Web-specific patterns, VAE for verb-based AE, and NAE for noun-based AE. Table 4 shows the macro-averaged performance scores12
Macro-averaged score consists of computing the performance score for every test set (person entity) and then averaged over all test sets.
Performances of our AE methods on the WePS-2 test dataset
Performances of our AE methods on the Wikipedia test dataset
Table 6 shows the detailed performance of the individual attributes obtained by our approaches on WePS-2 test dataset. Table 7 presents the performances for Wikipedia test dataset. Table 6 and Table 7 clearly show the contribution of different AE method for each particular attribute class. In the tables, the term “Same” means that the performance of the target AE method is as the previous one.
Performances of attributes obtained by our AE methods on the WePS-2 test dataset in terms of F1 score
We use Stanford co-reference system to extract the attribute of “Other name”.
Performances of attributes obtained by our AE methods on the Wikipedia test dataset in terms of F1 score
For WePS-2 test dataset (Table 6), we find that the WAE method have achieved good performance for the attributes of “email”, “phone”, “fax”, and “website”. This is due to the fact that the instances for these attributes are often expressed with fixed, easily predictable patterns in structured-style fragments. Similarly, the VAE method have achieved good performance for the attributes of “birth place”, “date of birth”, and “occupation”, because their instances follow a specific format and are in sentences having a verb predictive of the target attribute. The performance scores of some attributes such as “nationality”, “affiliation”, “relatives”, “occupation”, and “degree” improved after incorporating NAE method. This is consisted with the notion that the majority of instances for these attributes are from noun-based constructions. We notice that integrating different AE methods can improve the performance of only some specific attributes. This is due to the fact that each AE method is appropriate for only some types of attributes. Thus, in final combination, incorporating some AE methods for some attributes does not significantly affect the performance. This is obvious from Table 6, where the attributes of “email”, “phone”, “fax”, and “website” has achieved good performances solely using the WAE method, and is remained fixed even after incorporating the VAE and NAE methods.
Nonetheless, the final hybrid AE can complement the performances of some attributes such as “nationality”, “degree”, and “occupation”, and further improves the overall performance. We observe that for the attributes of “major”, “mentor”, “affiliation”, “relatives”, “school”, and “nationality” in WePS-2 dataset, none of the methods cannot achieve good performances. This shows that more robust methods are required for these sorts of attributes. For Wikipedia test dataset (Table 7), which mainly consists of formal-style text, incorporating the VAE and NAE significantly improved performance scores. This shows that our formal-style based AE methods perform fairly well on formal-style fragments.
We compare our AE approach with five state of the art approaches achieving the best published results for benchmark datasets. The counterparts include UvA_2 [11], CASIANED [38], Chen et al. method [21], Yu and Lam method [74], and
Performances of our AE systems and state of the art methods on the WePS-2 test dataset
Performances of our AE systems and the best prior method on the Wikipedia test dataset
Table 9 compares our approach on the Wikipedia test dataset against the method of Yu and Lam [74]. As shown in Table 9, our approach achieves an F1 score of 74.52%, providing an improvement of about 1.35% F1 score points. This shows that our integrated approach is also efficient for the extraction of attributes from homogeneous Web corpora.
To compare our NAE method with
Performances of our NAE method and IMPLIE system on sampled sentences from WePS-2 test and Wikipedia test dataset
The performance of AE system with considering semantic enrichment (AE+) and without considering semantic enrichment (AE−) on WePS-2 test dataset, in terms of F1 score
In summary, the counterpart approaches could not extract entirely the attribute values contained in noun-based and verb-based constructions. Furthermore, these methods have problem with multiple-word attribute values. In this paper, we solved this problem by adopting the semantic analysis of the text. The results indicate that incorporating verb-based AE, noun-based AE, and Web-specific patterns, and also deep semantic analysis of the text is effective in increasing performance of the AE system.
According to the results given in Tables 4–10, we observe that the results look promising but are not ideal. This shows that the Web person AE is far from being solved. Thus more effort is needed in this respect. Our manual investigation over incorrect extractions shows that the performance of AE can be raised if we perform the following work:
Creating more robust AE method: overall, existing AE methods reports low F1 score for some attributes on question. This indicates that AE in Web documents is still a big challenge. Obviously, the more precise the AE methods are, the higher the performance scores are. Therefore, if we spend more time in the development of more robust AE methods, the system performance will pick up. Our AE approach provides a flexible framework, which other more robust AE modules can be incorporated to improve the result.
The performance of AE system with considering semantic enrichment (AE+) and without considering semantic enrichment (AE−) on Wikipedia test dataset in terms of F1 score
Improving the performance of pre-processing components: our manual investigation reveals that almost half of the incorrect extractions were because of the inefficiency of pre-processing and semantic analysis stages, and not because of the inefficiency of AE method. Errors in pre-processing and semantic analysis stages are propagated to AE step and cause wrong extractions. For example, the inefficiency of named entity recognition system causes our AE system cannot identify the border of attribute values accurately. The low performance of pre-processing stages is a bottleneck for efficient AE. However, improving pre-processing and semantic analysis is orthogonal to our problem and therefore out of the scope of this paper. Nonetheless, to alleviate the errors in semantic analysis stage, we enrich the analyzed text with semantic information extracted from a distant ontology. Table 11 shows the effect of semantic enrichment on WePS-2 test dataset. Table 12 shows the results for Wikipedia test dataset. In the tables,
Enriching discourse profile: in this paper, we focused on the extraction of attributes only from the given Web pages in the input corpus. Since the Web pages are often noisy, irregular and contain incomplete information about entities, the AE system cannot extract rich profile attributes. One of the promising solutions to improve the result and alleviate the issue of data sparseness is to enrich local discourse profile with semantic information inferred from distant knowledge bases. We use this possibility in name disambiguation, which leads to more robust name resolution. However, we do not consider the distant attributes in the evaluation of AE system, because we concerned to answer the question “how much information can be extracted if the AE system uses only the information contained in the input corpus?”
We used two standard datasets as our benchmark to evaluate our name disambiguation system. These datasets include WePS-1 test dataset [5], and WePS-2 test dataset [7]. These datasets provide a real corpus, which can test a disambiguation system for personal names with varying ambiguity and in different domains. Both WePS-1 and WePS-2 test datasets consisted 30 Web page collections, each one corresponding to one ambiguous person name. Each Web page collection consists of N top ranked Web pages (100 in WePS-1 test dataset and 150 in WePS-2). We conducted evaluations using two types of scoring measures, B-cubed scoring measure, and the purity-based scoring measure. We use three B-cubed measures including B-cubed precision (
Performances of our name disambiguation approaches on WePS-1 test datasets
Performances of our name disambiguation approaches on WePS-1 test datasets
Table 13 shows the results obtained by our name disambiguation methods on WePS-1 test dataset. Table 14 shows the results for WePS-2 test dataset. In order to indicate the effect of each clustering feature, in Table 13 and 14, we begin with the feature of social links and then add features of local discourse profile attributes and external attributes one by one. The results clearly show the effect of profile enrichment and integrating attributes with social relationships. In Table 13 and 14, we notice that the performance is consistently increasing when incorporating more clustering features. The final feature model (social links + local attributes + external attributes) achieves the best performances. The results justify that the majority of information for name disambiguation is given in the Web pages being processed. However, incorporating the external attributes improves performances.
Performances of our name disambiguation approaches on WePS-2 test datasets
Comparison of results obtained by baselines and our method on WePS-1 test dataset
Comparison of results obtained by baselines and our method on WePS-2 test dataset
Comparison of results obtained by state-of-the-art methods and our method on WePS-1 test and WePS-2 test dataset
Table 15 shows the best performance obtained from the baselines and our method on WePS-1 test dataset. Table 16 shows the results for WePS-2 test dataset. As shown in Table 15 and 16, our method clearly outperforms the baseline methods for both datasets in terms of both
Table 17 summarizes the average performance obtained by our proposed method and four state of the art methods for the benchmark datasets. We compared the results obtained by our method with those reported in Han and Zhao [39] and Chen et al. [27] on WePS-1 and WePS-2 test dataset, Dutta and Weikum [30] and Yerva et al. [73] on WePS-2 test dataset. We notice that the comparison is not precise, because the mentioned algorithms implemented and tested with different settings on machines with different processing characteristics. In Table 17, Han and Zhao [39] did not report obtained B-cubed scores. As shown in Table 17, our approach performs well on datasets, exceeding or matching the best performance obtained by the state of the art methods in terms of
At the system level, we evaluate the performance of the end-to-end aggregated EP system. We use the standard WePS-3 test dataset [4] as our benchmark. In the experiments, we conducted evaluations in terms of precision, recall and
Performance of state of the art and our EP approach on WePS-3 dataset
In this paper, we developed a cross-document person profiling system that extracts persons in question and their relevant information from Web pages. We mainly focused on the core task of profiling system, namely attribute extraction. For attribute extraction, we use different information extraction methods according to Web text expression format and attribute type. Our approach relies on deep semantic analysis of the text and utilizes the syntactic parsing and semantic role labelling technologies for attribute extraction. Our attribute extraction approach can extract both explicit and implicit attributes within and across sentences. The resulting profile information is meaningful and linked to the existing knowledge bases, which facilitates the multi-lingual information processing. Experiments show that our proposed methods for profiling subtasks achieved respectable results. There are several interesting directions for future work. One of the most promising directions is to use semantic-based machine learning techniques to automatically learn attribute extraction MLN rules with a minimum of training and human knowledge engineering effort. Since the final results of entity profiling system depend on the performance of four subtasks including pre-processing, semantic analysis, attribute extraction, and name disambiguation, therefore, second interesting future work is to improve the pre-requisites’ performance, which eventually can improve the overall quality of system. Our third future work is to develop a multi-lingual entity profiling system that can extract all entity-related information from multi-lingual text documents. Finally, we plan to generalize our approach to profile generic entity type, where entities are not limited to people and entity’s profile schema is not specified in advance. We expect the generalized system can work with minimum training data for a new domain.
