A hybrid approach to domain-independent taxonomy learning

Abstract

Creating domain ontologies is usually performed by teams of knowledge engineers and domain experts, and is considered to be a time-consuming and difficult task. As a result, scientists have started to develop automatic approaches to ontology learning and population. For the proposed research, we focus on the central subtask of ontology learning, being the hypernym detection task, where the system has to detect hierarchical semantic relationships, i.e. hypernym–hyponym relationships, between domain-specific terms, resulting in a domain-specific taxonomy.

We propose in this paper a hybrid approach to automatic taxonomy learning, which combines a data-driven and a knowledge-based component. The data-driven component is composed of a lexico-syntactic pattern-based module, a morpho-syntactic analyzer and a distributional model, whereas the knowledge-based component extracts structured semantic information from the Linked Open Data cloud (DBpedia) and WordNet. The proposed methodology has been applied to three different knowledge domains: viz. food, equipment and science. A thorough quantitative and qualitative evaluation has shown promising results for all considered test domains. In addition, the results show a clear contribution of all different modules to the automatic taxonomy learning task. Although there is still room for improvement for all different modules, our approach outperforms state-of-the-art systems that participated in the SemEval “Taxonomy Extraction Evaluation” task when it comes to comparing the automatically constructed taxonomy against a manually verified gold standard taxonomy. As all modules are run automatically, the system provides a flexible and domain-independent approach to automatic taxonomy learning and could be an important step in solving the knowledge acquisition bottleneck in ontology learning.

Keywords

Taxonomy construction taxonomy learning hypernym detection

1. Introduction

Ontologies have shown to be indispensable to enrich text with semantic information or “meaning” as they have been successfully applied to various natural language processing (NLP) tasks such as word sense disambiguation (e.g. to distinguish between stone (mass of hard consolidated mineral matter) and the Rolling Stones (music band)) (Alexopoulou et al., 2009) or coreference resolution (e.g. to link UN with United Nations) (Prokofyev et al., 2015) as well as for different language technology applications such as efficient information retrieval (Liang et al., 2006) or advanced online question answering services (by expanding on the queries) (Ray et al., 2009).

Also from a business perspective, ontologies and user-specific taxonomies appear to be very useful (Azevedo et al., 2015). Companies often desire to build their own mono- or multilingual enterprise semantic resources containing all relevant sector- and company-specific terminology, which allows them to standardize the company’s language use and fastens the translation processes. In addition, these resources also help shorten the learning curve for new employees by self-teaching in a new domain and improve the effectiveness of definitions and explanations in technical writing (e.g. by grouping products in product types or hierarchically structured families) (Wright and Budin, 2001). Whereas automatic terminology extraction from domain-specific data is a well-researched NLP task (Bernhard, 2006; Zhang et al., 2012; Lefever et al., 2009), organizing the resulting terminology into a hierarchically structured taxonomy remains a very challenging task.

As opposed to the recognized value of ontologies, globalization and rapid technological evolution have made it virtually impossible to manually create and manage ontologies for a large variety of scientific and technological (sub)domains (Cross and Bathijaa, 2010). Because manual ontology creation is such a cumbersome and expensive task, researchers have started to investigate how terminological and semantically structured resources such as ontologies or taxonomies can be automatically constructed from text (Biemann, 2005).

Biemann (2005) defines ontologies as “specifications of shared conceptualizations of a domain of interest” that build upon a hierarchical backbone, a definition that goes back to Gruber (1993). Sowa (2000) further states that a formal ontology is “specified by a collection of names for concept and relation types organized in a partial ordering by the type-subtype relation”. Ontologies can then be further distinguished by the way the subtypes are distinguished from their supertypes:

Axiomatized ontologies are conceptualizations whose categories are distinguished by axioms and definitions stated in formal language. Examples of axiomatized ontologies include the specification of a database schema in OWL.

Prototype-based ontologies distinguish subtypes by a comparison with a typical member or prototype for each subtype rather than by axioms and definitions in logic.

Terminological ontologies are also specified by subtype-supertype relationships but describe concepts by concept labels or synonyms rather than prototypical instances. In addition, their categories do not need to be fully specified by axioms and definitions.

A well-known example of a terminological ontology is WordNet (Fellbaum, 1998), whose categories are specified by relations such as hypernymy or part-whole relations, which determine the relative positions of the concepts with respect to one another without completely defining them. The advantage of using terminological ontologies is that they enable a connection between the formal representation of the domain and the language used to refer to domain concepts within text (Velardi et al., 2013). As stated by Peters (2013), terminological and formal ontologies maintain a rather uneasy relationship; both models partially overlap and complement each other, which results in linguistic confusion. This viewpoint is shared by Grabar et al. (2012), who believe that the increasing interest for ontologies has made the distinction between ontologies and other semantic resources such as taxonomies and terminologies somewhat fuzzy.

As a result, a lot of ongoing work is now focusing on representing lexical knowledge within formal ontologies. Examples are OpenMinTeD (Peters, 2016), the Open Mining Infrastructure for Text and Data, which aims at enabling interoperability between linguistic and ontological knowledge by adhering to existing Linked Data standards (e.g. populated RDF or OWL models), and the Lemon framework (McCrae et al., 2012), a model for sharing lexical information on the semantic web. In contrast with this work, the aim of our research is to hierarchically organize a domain-specific term list in a fully automated way, without specifying the categories of the taxonomy by axioms and definitions.

To avoid confusion, we briefly review the most important notions and their definitions as adopted in this research. Terms are considered as lexical units or “the words that are assigned to concepts used in the special languages that occur in subject-field or domain-related texts” (Wright, 1997). Terminologies are then defined as a set of terms, which represent the system of concepts for a specific domain, whereas ontologies describe a system of concepts and its associated properties for a specific area, built upon formal specifications and constraints. Taxonomies, finally, are collections of terms that are arranged hierarchically. In the proposed research, these hierarchical relations cover both the ontological IS-A relations (e.g. a “dog” is an “animal”) as well as the Instance-of relations (e.g. “China” is an instance of the general concept “country”). As there is only one type of relationship between these terms, namely the hypernym–hyponym relation, and we do not specify the categories of our taxonomy by axioms and definitions, our notion of taxonomy is most related to the definition of terminological ontologies as specified by Sowa (2000). An important difference, however, is that we build hierarchies of terms, which at their turn denote concepts, whereas Sowa considers terminological ontologies as hierarchies of types. A limitation of our term-driven approach is that we do not resolve cases of polysemy, where the same term denotes different concepts (e.g. bank as a “financial institution” or “sloping land”). Given the “one sense per discourse” theory (Gale et al., 1992), however, we believe that terms occurring in the same context or specialized domain are nearly always used with the same sense. As a result, the impact of polysemy is expected to be limited, but not negligible, for domain-specific taxonomies.

This paper presents a multi-modular domain-independent taxonomy learning system to detect hypernym–hyponym relations between terms. Therefore we build upon prior research (Lefever, 2015), and combine a previously developed pattern-based module, morpho-syntactic analyzer and WordNet module, with a newly developed DBpedia and neural networks module. The contribution of this research is twofold. Firstly, the implementation of different approaches to automatic taxonomy learning allows us to perform a thorough analysis and comparison of the different methodologies. In addition, the evaluation of all modules against the same gold standard for different test domains provides solid insights in their weaknesses and strengths. Secondly, by combining the separate modules in a hybrid system, we can overcome shortcomings of individual modules and obtain very promising results when comparing the hybrid system with other state-of-the-art systems for different test domains.

The remainder of the paper is organized as follows: in Section 2 we give an overview of related research. Section 3 explains in great detail the different modules of our taxonomy learning system, while Section 4 describes the experimental setup and results on different data sets from the SemEval-2015 “Taxonomy Extraction Evaluation” competition. These results consist of both an evaluation of our system against a gold standard taxonomy for three test domains (viz. food, equipment and science) and a comparison of its performance with state-of-the-art taxonomy learning systems. In Section 5, we summarize our main findings and suggest directions for future work.

2. Related research

The task of automatic taxonomy learning consists in finding hypernym–hyponym relations between terms of a given domain of interest. Different approaches have been proposed to automatically detect hierarchical relationships between (domain-specific) terms.

Hearst’s pattern-based approach (Hearst, 1992) deploys a list of lexico-syntactic patterns able to identify hypernym pairs in text. An example of these manually defined Hearst patterns is “NP {, NP} ∗ {,} or other NP”, as in “Bruises, wounds, broken bones or other injuries”, which results in three hypernym pairs, being (injury, bruise), (injury, wound) and (injury, broken bone).

The lexico-syntactic approach has been applied and further extended for English (Pantel and Ravichandran, 2004) and various other languages such as Romanian (Mititelu, 2008) and French (Malaisé et al., 2004). In addition, Hearst’s method has also been applied to technical texts. Oakes (2005) implemented lexico-syntactic patterns to automatically detect hypernym relations in a pharmaceutical corpus. While some researchers have defined these lexico-syntactic patterns manually (Kozareva et al., 2008), statistical and machine learning techniques have also been deployed to automatically extract and extend the list of patterns and to train hypernym classifiers (Ritter et al., 2009). Kozareva and Hovy (2010), for instance, have proposed a weakly supervised bootstrapping algorithm that starts from a seed hypernym–hyponym pair in a doubly anchored hyponym pattern to learn new hyponym and hypernym terms. Their method looks for the seed hypernym and hyponym term in conjunction with another noun, which is then identified as an additional hyponym for the seed hypernym. For example, given the input pair (bruises, injuries), the method looks for occurrences of patterns like injuries like bruises and other NP, where the nouns instantiating NP can be considered as additional hyponyms of injuries. In contrast to Hearst (1992), who also proposes a simple algorithm to learn new patterns by bootstrapping from patterns found by hand, or by bootstrapping from an existing lexicon or knowledge base, Kozareva and Hovy (2010) introduce recursive patterns that use only one seed to harvest the arguments and supertypes of a wide variety of relations with little supervision.

The pattern-based approach has two well-known weaknesses. Because of the strict syntactic (and sometimes lexicalized) constraints imposed by the predefined patterns, the coverage of this approach is assumed to be rather low. In addition, some patterns tend to overgenerate and introduce noise in the output. We have tried to address this issue by only implementing patterns with high precision in our pattern-based module.

Other researchers have applied a distributional approach to find hypernym pairs in text (Caraballo, 1999; Van der Plas and Bouma, 2005; Lenci and Benotto, 2012; Fu et al., 2014). Distributional approaches start from the assumption that semantically related words tend to occur in similar contexts (Firth, 1957; Harris, 1968), being it semantically or syntactically similar contexts. The taxonomy learning task is approached by these methods as a clustering task, where semantical similar words are clustered together. The hierarchical structure of the clustering can then be used to express the hypernym–hyponym relation between terms. An extension of this approach is the distributional inclusion hypothesis (Weeds and Weir, 2003), which has been the inspiration to use directional (or asymmetric) similarity measures to detect hypernym pairs (Lenci and Benotto, 2012). As the hyponym term is typically a semantically narrower term than the hypernym, a significant number of salient distributional features of the hyponym term is included in the context vector of the hypernym term. Santus et al. (2014) further elaborate on the idea that hypernyms are semantically more general than hyponyms and by consequence occur in less informative contexts. To identify the hypernym in the hypernym–hyponym pairs, they measure the semantic generality of the terms by the entropy of their statistically most prominent contexts.

More recently, also the potential of word embedding spaces has been investigated to predict hypernyms. Word embeddings are word representations computed using neural networks, resulting in word vectors that are representing the distribution of the context in which the target word appears. Fu et al. (2014) construct semantic hierarchies based on word embeddings, which can be used to measure the semantic relationship between words. They report F-scores of 73.74% for a manually labeled set of hypernym–hyponym relations. Rei and Briscoe (2014) evaluate how well different vector space models and similarity measures perform on the task of hyponym generation. They conclude that simple window-based vectors perform just as well as the ones trained with neural networks, but the dependency-based vectors outperform all other vector types.

Lately, focus has shifted towards supervised distributional methods, where candidate hypernym–hyponym pairs $(x, y)$ are represented by a combination of their embedded vectors $(\vec{x}, \vec{y})$ and a classifier is trained to predict hypernymy relations. Baroni et al. (2012) represent the two terms as the concatenation of their latent dimension vectors $(\vec{x} + \vec{y})$ , whereas Roller et al. (2014) propose a logistic regression model trained on difference vectors $(\vec{y} - \vec{x})$ . The difference between two terms on a given dimension is then also supposed to capture the degree of distributional inclusion on that dimension.

As distributional methods are able to find implicit hypernym relations in text, they obtain a higher coverage for hypernym detection than pattern-based approaches. On the other hand, they suffer from lower precision scores, since they often have problems to determine the exact nature of the semantic relationship (synonymy, part-whole, hypernymy, antonymy, etc.) between the terms appearing in the same cluster of distributional space. The more recent supervised approaches, however, claim to be able to distinguish between the different types of semantic relations (Roller et al., 2014). Moreover, Shwartz et al. (2016) propose a hybrid system, where dependency paths (i.e. extension of Hearts’s patterns) are encoded using a recurrent neural network, and show that the combination of lexico-syntactic paths and distributional information obtains state-of-the-art results on the task of hypernym detection.

The morphological structure of terms has also been used to extract hypernym–hyponym pairs from compound terms (Tjong Kim Sang et al., 2011). These morpho-syntactic approaches have shown to be very fruitful for technical texts, where a large number of the domain specific terms are compounds (Lefever et al., 2014). The morpho-syntactic approach is based on the head-modifier principle (Sparck Jones, 1979), which states that the linear arrangement of the compound parts expresses the kind of information being conveyed, the head referring to the more general semantic category, while the modifiers restrict the sense of the compound term. This idea can be easily transferred to the hypernym relation where the full compound is to be considered the hyponym, and the head term of the compound the hypernym (e.g. grapefruit is a kind of fruit and apple cake is a kind of cake). Hippisley et al. (2005) applied this principle to automatically extracted terms from English and Chinese text. They revealed its use for information retrieval and the automatic induction of semantic lexicons, where seed words act as representatives of a semantic class and the head-modifier principle is used to suggest hypernym–hyponym relations between the seed word and the other elements in compound constructions (e.g. if bomb belongs to the weapons class, car bomb can be automatically added to the weapons class as well). For this research, we have further developed these insights in order to use morphological information to detect hypernym–hyponym pairs in domain-specific compounds, both for single and multi-word terms.

In addition to the emergence of data-driven approaches to taxonomy construction, other approaches use heuristics to extract hypernym relations from structured (collaborative) resources such as Wikipedia. Ponzetto and Strube (2011) use the Wikipedia categorization system as a semantic network and present methods for generating a large scale taxonomy by automatically assigning hypernym labels to the relations between categories. To this end, they use methods based on connectivity in the network and lexico-syntactic patterns to label the relations between categories. Navigli and Velardi (2010) use word class lattices, or directed acyclic graphs, to develop a pattern generalization algorithm trained on a manually annotated training set, which is able to extract definitions and hypernyms from web documents. The approach was further developed by Faralli and Navigli (2013) to train the algorithm on automatically extracted training corpora from Wikipedia.

The Linked Open Data cloud also offers new possibilities for domain-specific applications, as it contains millions of concepts from over one hundred structured data sets (Meij et al., 2011). As such, Linked Open Data can also be exploited to build domain ontologies, as has been done by Dastgheib et al. (2013) in the area of biomedicine. Chiarcos et al. (2013) use linked data principles to publish and interlink two linguistic resources, being WordNet and the annotated MASC corpus respectively, openly on the web. Modeling these different types of linguistic resources using RDF and OWL allows to represent them in a uniform way and improves their interoperability. Hellmann et al. (2013) developed the NIF (NLP Interchange Format), an RDF/OWL-based format aiming to improve interoperability between NLP tools, language resources and annotations. The NIF specification has known implementations for 30 different NLP tools, amongst which DBpedia Spotlight (Mendes et al., 2011), which will be used in this research. The DBpedia community project (Lehmann et al., 2014) extracts structured, multilingual knowledge from Wikipedia and makes it available via Semantic Web and Linked Data standards. This way, DBpedia enables large-scale knowledge extraction from crowd-sourced content repositories, which makes it very useful to support NLP tasks such as entity disambiguation or question answering (Mendes et al., 2012). In addition, a number of specialized data sets have been created for specific NLP tasks (Lexicalization dataset) and a large number of applications and tools have been built around DBpedia, which also provides links to more than 30 external data sets (among others GeoNames,1

¹
http://www.geonames.org/.

OpenCyc,2

http://datahub.io/nl/dataset/opencyc.

US Census,3

https://www.census.gov/econ/currentdata/datasets/.

WordNet4

⁴

http://datahub.io/dataset/w3c-wordnet.

and YAGO25

⁵

http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/.

). Moreover, the DBpedia project has resulted in a community-curated DBpedia ontology (maintained and extended by the community in the DBpedia Mappings Wiki), which consists of 320 classes which form a subsumption hierarchy and are described by 1,650 different properties.

The great interest in automatic taxonomy learning is also reflected in the set-up and success of the Taxonomy Extraction Evaluation Task (Bordea et al., 2015), which has been organized in the framework of SemEval-2015,6

⁶

http://alt.qcri.org/semeval2015/.

an international evaluation of computational semantic analysis systems. The task is concerned with automatically finding relations between pairs of terms and organizing them in a hierarchical structure. This way, the task assumes that a list of domain specific terms is already available in order to focus on the relation detection between terms. The resulting taxonomies are evaluated through comparison with gold standard relations collected from WordNet (Fellbaum, 1998) and other existing taxonomies and classification schemes. In addition, expert evaluation has been performed by pooling a subset of the relations submitted by the participants. The great value of this SemEval “Taxonomy Extraction Evaluation” task is that it not only provides domain-specific data sets to work with, but that it also offers gold standard taxonomies and evaluation metrics, which make it possible to benchmark different taxonomy learning systems on the same data sets.

3. System description

We present a hybrid system combining data-driven and knowledge-based approaches to detect hypernym relations between domain-specific terms. The system combines four main components: a lexico-syntactic pattern-based approach, a morpho-syntactic analyzer, a distributional model and a module retrieving hypernym relations from structured semantic resources, being WordNet and DBpedia more specifically. Each module takes as input a domain specific term list and outputs a list of hypernym–hyponym pairs from this list. The domain-specific term lists were provided in the framework of the SemEval-2015 “Taxonomy Extraction Evaluation” competition, and will be discussed in more detail in Section 4.1.

3.1. Pattern-based approach

The first module that automatically detects hypernym relations is a lexico-syntactic pattern-based approach, based on the work of Hearst (1992). These patterns are implemented as a list of regular expressions containing lexicalized expressions (e.g. like), as well as isolated Part-of-Speech tags (e.g. noun) and chunk tags, which group different Part-of-Speech sequences (e.g. noun phrase (NP) = determiner + adjective + noun, adjective + noun, noun + noun, etc.). An example of these manually defined patterns is “NP {, NP} ∗ {,} or/and other NP”,7

⁷
Curly brackets indicate optional parts of the pattern.

as in “green beans, carrots, peas and other vegetables”, which results in four hypernym–hyponym pairs, being (vegetables, green beans), (vegetables, carrots), (vegetables, peas) and (vegetables, onions).

3.1.1. Domain specific corpus

As the pattern-based module is a purely data-driven module, which applies lexico-syntactic patterns that are indicative of hypernym relations on linguistically preprocessed text, we first needed to compile corpora for all considered domains. Corpora for three different domains (science, equipment, food) were compiled by means of the BootCaT toolkit (Baroni and Bernardini, 2004), which can be used to build a specialized web-based corpus starting from a list of seed terms. Whereas the BootCat tool only requires a small set of seed terms, and iteratively extracts new seed terms from the retrieved pages, we decided to use all items of our entire domain-specific term list as seed terms. This way, the corpus compilation process was guided in order to look for pages containing all our domain-specific terms. We ran BootCat allowing 10 queries per seed term and did not use the option to construct multi-word terms by combining the seed terms. As a post-processing step, sentences containing (1) only URL links or (2) no domain specific term were removed. Table 1 gives an overview of the number of seed terms and resulting number of tokens (words, punctuation marks, symbols, etc.) in the three domain-specific corpora (before and after post-processing).

Table 1
Number of seed terms and tokens in the domain-specific web corpora

Domain #Seed terms #Tokens original corpus #Tokens cleaned corpus

Food 1555 15,958,495 11,888,522

Equipment 612 5,848,346 5,631,582

Science 452 28,990,236 26,992,396

Domain	#Seed terms	#Tokens original corpus	#Tokens cleaned corpus
Food	1555	15,958,495	11,888,522
Equipment	612	5,848,346	5,631,582
Science	452	28,990,236	26,992,396

3.1.2. Linguistic preprocessing

As our pattern-based approach takes as input a linguistically enriched text, we first performed a number of linguistic preprocessing steps on the original web-based corpus. The following preprocessing tasks were performed by means of the LeTs Preprocess Toolkit (Van de Kauter et al., 2013):

Tokenization: splitting all sentences into tokens (words, punctuation marks, numbers, symbols, etc.);

Part-of-Speech Tagging: assigning each token in the sentence its correct grammatical category (e.g. adjectives, verbs, nouns, etc.);

Lemmatization: reduce each full form token to its lemma (as it is stored in a dictionary);

Chunking: regrouping words into syntactically related parts (e.g. noun phrases, verb phrases, etc.).

3.1.3. Lexico-syntactic pattern matching

The resulting linguistically preprocessed corpus is the input for our lexico-syntactic pattern-based module. The example below shows a preprocessed sentence matching the pattern:

\begin{matrix} {other} * NP such as NP {, NP} * {(and - or) NP} * \end{matrix}

where the first column contains the word form, the second column the lemma of the word, the third column the Part-of-Speech tag (NN(S) is a noun, JJ an adjective, IN a preposition and CC a conjunction) and the last column the chunk information (e.g. NP is a noun phrase, AP is an adverbial phrase, O is a punctuation mark and I-indicates that the element forms one unit with the preceding phrase):

Spices	spice	NNS	NP
such	such	JJ	AP
as	as	IN	I-AP
lemon	lemon	NN	NP
,	,	,	O
ginger	ginger	NN	NP
,	,	,	O
white	white	JJ	NP
pepper	pepper	NN	I-NP
,	,	,	O
salt	salt	NN	NP
,	,	,	O
cardamon	cardamon	NN	NP
and	and	CC	O
nutmeg	nutmeg	NN	NP

resulting in the following hypernym pairs: (spices, lemon), (spices, ginger), (spices, white pepper), (spices, salt), (spices, cardamon) and (spices, nutmeg). The program is designed in such a way that the user can generate hypernym pairs containing full forms (e.g. (spices, ginger)) or lemmas (e.g. (spice, ginger)). For the experiments in this paper, we generated the lemmatized form of the terms. As can be seen in the example, the hard-coded lexicalized parts of the patterns (“such as” in this case) do not appear in the extracted hypernym–hyponym pairs.

We optimized the pattern-based model presented by (Lefever et al., 2014) in different ways. As the original list of lexico-syntactic patterns was designed for clean company-specific text, a major drop in precision was noticed when applying the approach to our noisy web corpus. Therefore we decided to only consider patterns that proved to obtain high precision in previous research (Lefever et al., 2014). Table 2 lists the lexico-syntactic patterns that were implemented for this research, together with examples of matched sentences in the corpus and the resulting hypernym–hyponym pairs.

Table 2
List of lexico-syntactic patterns that were used for hypernym detection and resulting examples from the domain corpora

Pattern Examples Hypernym pairs

NP such as NP {, NP} ∗ {(and|or) NP} appetizers such as octopus, sardine, calamari, fried zucchini (appetizer, octopus)
(appetizer, sardine)
(appetizer, calamari)
(appetizer, fried zucchini)

such NP as NP {, NP} ∗ {(and|or) NP} such food as hamburgers and hot dogs (food, hamburger)
(food, hot dog)

NP {, NP} ∗ {,} (and|or) (sometimes|many|in|any|another|some) other NP Rubber or some other tacky material (tacky material, rubber)

NP , (include(e|ing)|mainly|mostly |particularly|namely|etc.) NP , NP ∗ (and|or) NP other diseases include pink disease, bacterial heart rot, anthracnose (disease, pink disease)
(disease, bacterial heart rot)
(disease, anthracnose)

NP, in particular {,} NP {, NP} ∗, {(and|or) NP} Italian sparkling wine, in particular Asti and Prosecco (Italian sparkling wine, Asti)
(Italian sparkling wine, Prosecco)

NP{,} (apart from|except (for)) NP {, NP} ∗ {(and|or) NP} dry seasoning except for salt dishes apart from other beef stews (dry seasoning, salt)
(dish, beef stew)

NP{,} (other than) NP {, NP} ∗ {(and|or) NP} hats other than baseball caps country other than China (hat, baseball cap)
(country, China)

NP , like (other) NP a theme park like Disneyland root vegetables like carrots (theme park, Disneyland)
(root vegetable, carrot)

NP , NP ∗ (and|or) (similar|equal) NP IPod and similar devices a hurricane, earthquake, flood or similar disaster (device, Ipod)
(disaster, hurricane)
(disaster, earthquake)
(disaster, flood)

NP, a(n) ((form|sort|kind) of) NP coke, a form of carbon (carbon, coke)

NP{,} called NP a substance called Creatine Phosphate (substance, Creatine Phosphate)

NP, another NP celery another food (food, celery)

NP is (a|and) {((form|sort|kind|type) of)} NP England is an island harsha, a kind of griddle cake (island, England)
(griddle cake, harsha)

Pattern	Examples	Hypernym pairs
NP such as NP {, NP} ∗ {(and\|or) NP}	appetizers such as octopus, sardine, calamari, fried zucchini	(appetizer, octopus) (appetizer, sardine) (appetizer, calamari) (appetizer, fried zucchini)
such NP as NP {, NP} ∗ {(and\|or) NP}	such food as hamburgers and hot dogs	(food, hamburger) (food, hot dog)
NP {, NP} ∗ {,} (and\|or) (sometimes\|many\|in\|any\|another\|some) other NP	Rubber or some other tacky material	(tacky material, rubber)
NP , (include(e\|ing)\|mainly\|mostly \|particularly\|namely\|etc.) NP , NP ∗ (and\|or) NP	other diseases include pink disease, bacterial heart rot, anthracnose	(disease, pink disease) (disease, bacterial heart rot) (disease, anthracnose)
NP, in particular {,} NP {, NP} ∗, {(and\|or) NP}	Italian sparkling wine, in particular Asti and Prosecco	(Italian sparkling wine, Asti) (Italian sparkling wine, Prosecco)
NP{,} (apart from\|except (for)) NP {, NP} ∗ {(and\|or) NP}	dry seasoning except for salt dishes apart from other beef stews	(dry seasoning, salt) (dish, beef stew)
NP{,} (other than) NP {, NP} ∗ {(and\|or) NP}	hats other than baseball caps country other than China	(hat, baseball cap) (country, China)
NP , like (other) NP	a theme park like Disneyland root vegetables like carrots	(theme park, Disneyland) (root vegetable, carrot)
NP , NP ∗ (and\|or) (similar\|equal) NP	IPod and similar devices a hurricane, earthquake, flood or similar disaster	(device, Ipod) (disaster, hurricane) (disaster, earthquake) (disaster, flood)
NP, a(n) ((form\|sort\|kind) of) NP	coke, a form of carbon	(carbon, coke)
NP{,} called NP	a substance called Creatine Phosphate	(substance, Creatine Phosphate)
NP, another NP	celery another food	(food, celery)
NP is (a\|and) {((form\|sort\|kind\|type) of)} NP	England is an island harsha, a kind of griddle cake	(island, England) (griddle cake, harsha)

The efficiency of the pattern-based module was further improved by only considering noun phrases containing a maximum of 6 consecutive nouns and by ignoring Named Entities. This appeared to be necessary as the web-based corpus contains a lot of lists and enumerations, causing problems for the recursive way the regular expressions are built. Manual analysis of the output showed that patterns containing Named Entities did not perform well on the web corpus either. For the other hypernym modules (e.g. word embeddings module), however, we did consider Named Entities for the extraction of the hypernym–hyponym pairs. Precision, on the other hand, was improved by ignoring pairs containing terms appearing both as hypernym and hyponym (e.g. (hand truck, truck) and (truck, hand truck)). Finally, the output of the pattern-based module was filtered by only considering pairs where both terms (either lemma or full form) occurred in the term list of the considered domain.

3.2. Morpho-syntactic analyzer

Our second hypernym detection module applies a morpho-syntactic approach where the morphological structure of compound terms is used to extract a hypernym-hyponym relation from this term. As already mentioned in Section 2, this approach is inspired by the head-modifier principle (Sparck Jones, 1979). We can indeed observe that the head of the compound refers to a more general semantic category, whereas the modifying part narrows the meaning of the compound term. Following this reasoning, the complete compound term can be considered as a hyponym of the head term (or hypernym).

Rules were implemented for three different syntactic hypernym–hyponym relations in compounds:

Single-word terms: If term T0 is a suffix string of term T1, T0 is considered to be a hypernym of T1. The rule for single-word terms is very productive, resulting among others in the following examples (preceded by the test domain, e.g. food):

food: (torte, sachertorte)

food: (fruit, dragonfruit)

equipment: (pin, candlepin)

science: (linguistics, psycholinguistics)

Multiword terms: If term T0 is the head term of term T1, T0 is considered to be a hypernym of T1. It is important to mention that we also allow multiple possible hypernyms in case different terms occur as suffixes of the compound term, e.g. phu quoc fish sauce is the hyponym of both sauce and fish sauce.8

⁸
quoc fish sauce is not a valid hypernym of phi quoc fish sauce. This is resolved by our system by only considering terms included in the domain-specific term list as valid hypernym/hyponym terms.

As the head of a nominal phrase appears at the right edge of a multiword NP in English, the last constituent of the NP is regarded as the head of the compound, and thus as the hypernym of the complete term, as shown in the following examples:

food: (sauce, béarnaise sauce)

equipment: (microscope, scanning hall probe microscope)

science: (physics, quantum physics)

Complex prepositional phrases: If term T0 is the first part of a term T1 containing a noun phrase + preposition + noun phrase, T0 is considered to be a hypernym of T1. In the case of a prepositional compound phrase, the head is situated at the left edge of the compound term. Examples of such hypernym pairs are the following:

food: (soup, soup all’imperatrice)

science: (immunology, immunology of infectious disease)

science: (sociology, sociology of culture)

In addition, restrictions were added to these general rules in order to improve the precision of the morpho-syntactic module. In order to prevent noise by detecting very short suffix terms occurring in the term list, we set a threshold of minimum three characters for the detection of valid hypernyms. An example of invalid hypernyms filtered out this way is tu that could be detected as a hypernym of pesarattu, as both terms occurred in the food term list.

Manual inspection of the system output made it clear that food terms (e.g. names of dishes) are often loan words from other languages. Therefore we added a list of foreign adjectival affixes (e.g. french affix al/ale) that should not be considered as a hypernym of the compound term. This way we prevent for instance ale to be detected as the hypernym of chicken provencale or café royale.

3.3. Word embeddings module

Our third module is based on the distributional approach, which represents all words in a corpus through the contexts in which they have been observed. As a result, words are represented as a vector in a high-dimensional space, in which each dimension is a context word and the coordinates of the vector reflect the association degree of the term with this context word. To build our distributional model, the word2vec algorithm of Mikolov et al. (2013) was used, which implements recurrent neural networks to learn the word vector representations. The following steps were taken to generate hypernym pairs based on distributional semantic information:

Construct the distributional model. We built a large distributional model incorporating word vectors trained on part of the Google News dataset, containing about 100 billion words.9

⁹
https://code.google.com/p/word2vec/.

Therefore, we used the word2vec Skip-gram model, which uses a word to predict a target context, and 300 dimensions. This resulted in a distributional model containing 3 million words and phrases, which places distributional similar words close to each other in our 300-dimensional space. This way, relative meanings of words have been translated into measurable distances in the distributional space.

Calculate the semantic similarity between terms. We used the cosine distance to calculate the semantic similarity between the word embedding vectors of all words occurring in our domain-specific term lists. The cosine similarity is defined as the angle between two word vectors; no similarity is expressed as a 90 degree angle, while total similarity of “1” is a 0 degree angle. Examples of high overlap are (tapenade, aioli) and (molecular genetics, molecular biology), which both have a similarity or cosine distance score of 0.75, while examples of low cosine distance are (religion, electrical engineering) and (guacamole, kiwi), which have an overlap of only 0.10.

Generate lists of hypernym pairs. The distributional approach essentially finds high similarities between words that co-occur in similar contexts. These words, however, could be synonyms, hyponyms, hypernyms, named entities or even antonyms. Therefore, this module is presumed to obtain high recall but fairly low precision, and should ideally be backed up by other modules to generate correct hypernym–hyponym pairs. In order to gain insights in the performance of the module with respect to the similarity score, we implemented three different versions of the word embeddings module, taking into account similarity thresholds of 0.50, 0.60 and 0.70 respectively. By consequence, only hypernym–hyponym pairs having a similarity score above the considered threshold (0.50, 0.60 and 0.70) were considered to generate the resulting hypernym list.

3.4. Structured semantic resources

The fourth module we implemented is a knowledge-based module, which extracts hypernym relations from (1) structured Linked Open Data resources, being DBpedia more specifically, and (2) WordNet.

3.4.1. DBpedia

In order to extract hierarchical information from DBpedia, we used DBpedia Spotlight (Mendes et al., 2011), a system that automatically annotates text documents with DBpedia URLs. The whole process of hierarchical relation extraction is performed in the RapidMiner Linked Open Data Extension (Paulheim et al., 2014). Figure 1 shows a screenshot of the RapidMiner process, which consists of the following steps:

Import Data: domain-specific term list in csv format (“Read CSV”).

Select the Linked Open Data extension (“DBpedia Spotlight Linker”).

Select the SPARQL DBpedia connection to extract Hierarchy Relations (“Specific Relation Generator”).

Write the output to a CSV file, which contains (1) the term of interest, (2) retrieved Wikipedia URLs containing the respective concept and (3) detected hypernyms of the term in the DBpedia Hierarchy (“Write CSV”).

Fig. 1.

RapidMiner.

In a final step, we selected from the resulting CSV file the terms and their associated hypernym terms in case both occurred in the domain specific term list. As we noticed that the DBpedia terms and hypernyms often contained plural forms, we also performed the lookup in the domain specific term list after stripping off the plural morpheme -s. Examples of hypernym pairs retrieved from the DBpedia ontology are listed in Example 4:

equipment: (equipment, headgear)

equipment: (telescope, schmidt camera)

food: (vegetable, artichoke)

food: (soup, borscht)

science: (natural science, aeronautics)

science: (medical science, cardiology)

3.4.2. WordNet

The second hierarchically structured lexical resource we exploited is WordNet (Fellbaum, 1998). WordNet is a lexical database where content words (nouns, verbs, adjectives and adverbs) are grouped into synsets, which are sets of synonyms that express a specific concept. Synsets are interlinked by means of, amongst others, hierarchical semantic relations (hypernyms–hyponyms). The WordNet module looks up these synsets in WordNet for all domain-specific terms and retrieves all hypernyms appearing in the full hierarchical path of the synsets. Hypernym pairs containing identical terms were removed. Examples of hypernym–hyponym pairs retrieved from WordNet are listed in Example 5:

science: (science, semantics)

science: (linguistics, semantics)

science: (computer science, artificial intelligence)

science: (science, computational linguistics)

equipment: (scientific instrument, refracting telescope)

equipment: (equipment, batting helmet)

food: (vegetable, artichoke)

food: (vegetable, beetroot)

3.5. Combined system

For the combined system, all hierarchical relations resulting from the different hypernym detection modules are aggregated into a single hypernym–hyponym pairs list. As opposed to Shwartz et al. (2016), who combine different hypernym detection approaches by integrating lexico-syntactic patterns into the distributional system itself, we rather combine the results of the different approaches in an aggregation step. By doing so, error percolation between the different hypernym detection modules is avoided.

We implemented three different versions of our hybrid taxonomy learning system:

Combined system, which straightforwardly takes into account the union of all hierarchical relations generated by the different modules.

Strict voting system, which only considers the hierarchical relations generated by at least two different hypernym detection modules.

Relaxed voting system, considering the hierarchical relations generated by at least two different hypernym detection modules together with all relations resulting from the morpho-syntactic analyzer, which has shown to be a very reliable hypernym detection module in previous research (Lefever, 2015).

In a post processing step, only one hypernym–hyponym pair was kept in the case of identical pairs (e.g. (handtruck, truck) and (handtruck, truck)) and the following term pairs were removed from the final hypernym–hyponym list:

terms appearing in reflexive hypernym relations (e.g. (truck, truck)),

terms appearing in symmetric hypernym relations (e.g. (hand truck, truck) and (truck, hand truck)).

4. Experiments

This section describes the experimental set-up and the results from a detailed evaluation of our taxonomy learning system. A detailed analysis is provided for all individual modules as well as for the different flavors of the combined system. In addition, we benchmark our results by comparing them with state-of-the-art systems that took part in the SemEval-2015 “Taxonomy Extraction Evaluation” competition (Bordea et al., 2015).

4.1. Data sets

All experiments were carried out on the data sets provided within the framework of the SemEval-2015 “Taxonomy Extraction Evaluation” competition (Bordea et al., 2015). We participated for three domains, namely food, equipment and science. For all domains, two different types of data sets were provided. The first type of data sets was extracted from WordNet (Fellbaum, 1998), a structured lexical database containing more general vocabulary, whereas the other data sets were composed from more technical domain taxonomies:

The equipment domain: excerpt of the Material Handling Equipment taxonomy10

¹⁰
http://www.ise.ncsu.edu/kay/mhetax/index.htm.

combined with IS-A relations from WiBi (the Wikipedia Bitaxonomy Project (Flati et al., 2014)).

The science domain: The Taxonomy Of Fields And Their Subfields11

¹¹

http://sites.nationalacademies.org/PGA/Resdoc/PGA_044522.

combined with IS-A relations from WiBi (the Wikipedia Bitaxonomy Project).

The food domain: excerpt of the Google product taxonomy12

¹²

http://www.google.com/basepages/producttype/taxonomy.en-US.txt.

combined with IS-A relations from WiBi (the Wikipedia Bitaxonomy Project).

As our system contains a knowledge-based module retrieving semantic information from WordNet, we do not show results for the WordNet data sets because this would give a misleading picture of the actual system performance. Table 3 gives an overview of the number of domain-specific terms and hierarchical relations for all different data sets.

Table 3

Number of domain-specific terms and hierarchical relations for all different data sets

Domain	#Terms	#Hierarchical relations
Food	1556	1587
Equipment	613	615
Science	453	465

The use of these data sets was twofold: they were used (1) to create the domain-specific term list, which is the input for the taxonomy learning system, and (2) to create the gold standard taxonomy, which was used to evaluate the system taxonomy for that particular term list.

4.2. Evaluation metric

As stated by Velardi et al. (2013), ontology evaluation is not a trivial task – even for humans – as there is always more than one valid solution to model the domain of interest. To evaluate our domain taxonomies, we applied automatic evaluation of the taxonomy considering all hypernym relations output by the system against the respective gold standard taxonomy. We are well aware of the fact that this evaluation is incomplete, as hypernym relations between terms produced by the system that are not in the gold standard taxonomy can be either wrong or correct.

To evaluate the performance of the separate modules as well as the different flavors of combined systems, we calculated Precision, Recall and their weighted F-score, which can be used to measure the ability to reproduce hypernymy relations between term pairs. Precision then reflects the system pairs in common with the gold standard taxonomy divided by the number of system pairs, whereas recall represents the system pairs in common with the gold standard taxonomy divided by the number of gold standard pairs.

Let S be the number of hypernym relations output by the system, and $GS$ be the number of hypernym relations contained by the gold standard taxonomy, then $\begin{array}{l} Precision = \frac{S \cap GS}{S}, \\ Recall = \frac{S \cap GS}{G S}, \\ F - score = 2 \times \frac{Precision \times Recall}{Precision + Recall} . \end{array}$

4.3. Results

The results section gives a detailed overview of the performance of the separate modules (Section 4.3.1) and of the combined taxonomy learning systems (Section 4.3.2). This section concludes with a comparison of the system results with the performance of other state-of-the-art systems (Section 4.3.3).

Table 4
Detailed performance of the separate modules for the food domain (total of 1587 gold standard relations)

Relations found Correct relations Precision Recall F-score

Patterns 491 97 0.198 0.061 0.093

Morpho-synt 465 280 0.602 0.176 0.273

WordNet 974 214 0.220 0.135 0.167

Rapid Miner 305 44 0.144 0.028 0.047

Word2Vec_50 24,105 188 0.008 0.118 0.015

Word2Vec_60 5,294 56 0.011 0.035 0.016

Word2Vec_70 328 8 0.024 0.005 0.008

	Relations found	Correct relations	Precision	Recall	F-score
Patterns	491	97	0.198	0.061	0.093
Morpho-synt	465	280	0.602	0.176	0.273
WordNet	974	214	0.220	0.135	0.167
Rapid Miner	305	44	0.144	0.028	0.047
Word2Vec_50	24,105	188	0.008	0.118	0.015
Word2Vec_60	5,294	56	0.011	0.035	0.016
Word2Vec_70	328	8	0.024	0.005	0.008

Table 5

Detailed performance of the separate modules for the science domain (total of 465 gold standard relations)

	Relations found	Correct relations	Precision	Recall	F-score
Patterns	72	22	0.306	0.047	0.082
Morpho-synt	198	128	0.646	0.275	0.386
WordNet	242	68	0.281	0.146	0.192
Rapid Miner	196	48	0.245	0.103	0.145
Word2Vec_50	519	38	0.073	0.082	0.077
Word2Vec_60	110	13	0.118	0.028	0.045
Word2Vec_70	11	3	0.273	0.006	0.013

Table 6

Detailed performance of the separate modules for the equipment domain (total of 615 gold standard relations)

	Relations found	Correct relations	Precision	Recall	F-score
Patterns	14	3	0.214	0.005	0.010
Morpho-synt	239	189	0.791	0.307	0.443
WordNet	42	14	0.333	0.023	0.043
Rapid Miner	63	21	0.333	0.034	0.062
Word2Vec_50	70	11	0.157	0.018	0.032
Word2Vec_60	23	8	0.348	0.013	0.025
Word2Vec_70	2	1	0.500	0.002	0.003

4.3.1. Detailed analysis of the different modules

This section presents an extensive analysis of all hypernym detection modules: the pattern-based approach (referred to as Patterns in the results), the morpho-syntactic analyzer (Morpho-synt), the two knowledge-based modules (WordNet and Rapid Miner), and three flavors of the word embeddings module, using a similarity threshold of 0.50 (Word2Vec_50), 0.60 (Word2Vec_60) and 0.70 (Word2Vec_70). Table 4 lists the precision, recall and F-scores per module for the food domain, while Tables 5 and 6 list the scores per module for the science and equipment domains respectively. In addition, we also provide the number of generated hypernym relations and correct (viz. belonging to the gold standard) hypernym relations per module.

A number of observations can be made based on the presented results. When it comes to recall, we see that the morpho-syntactic approach performs consistently well, whereas the pattern-based module only results in few hypernym pairs, especially for the equipment domain. This can be explained by the very strict constrains imposed by the lexico-syntactic patterns. This recall could be improved by compiling larger domain specific corpora for the pattern based approach, as the current corpora are rather small. We could also perform more focussed querying, by only retrieving high informative web pages, such as for instance Wikipedia pages, as we noticed that the retrieved web pages often contain a lot of noise. The knowledge-based and distributional (especially the Word2Vec_50 flavor) modules show a more varied picture, with reasonable recall figures for the food and science domains, but more modest scores for the equipment domain. This can be explained by the fact that the equipment data set contains a lot of very specialized vocabulary (e.g. kugelrohr, uppsala southern schmidt telescope, claas axion, allis-chalmers d series, hook gauge evaporimeter, etc.) which are not contained by neither the structured lexical resources nor by the news corpus that was used to train the distributional model.

A qualitative error analysis revealed that there is still room for improvement for all different modules. The morpho-syntactic analyzer achieves good recall results, but the downside is that the module outputs a considerable number of invalid hypernym pairs as well. Examples of wrong output are listed in Example 6:

(sour soup, hot and sour soup)

(apple, pineapple)

(cream, ice cream)

(cake, david eyre’s pancake)

(rice, soup all’imperatrice)

A number of improvements could be implemented to augment the precision of this module. Although it is often the case that a substring of the final noun results in a valid hypernym of that word (e.g. (fruit, dragonfruit)), we noticed that this rule often over generates for multi-word terms (e.g. (rice, soup all’imperatrice)). Another restriction should be added to prevent that hypernyms are generated from multiword terms containing conjunctions, as it is the case for (sour soup, hot and sour soup). In these cases, only the last noun of the compound can be considered as a valid hypernym of the full compound term.

Even for the knowledge-based module, which extracts information from manually verified and collaborative knowledge resources, we detected some invalid results,13

¹³
Although one might argue about the correctness of pairs such as (music, communication), these are no valid hypernym–hyponym pairs given the respective domain of interest, being science in this case.

both for the hierarchical relations resulting from the WordNet lookup:

(apple juice, pineapple juice)

(cheese, macaroni and cheese)

(hand truck, truck)

(physics, phonetics)

(food, alcohol)

(communication, music)

as well as from the DBpedia lookup:

(game, baseball bat)

(game, baseball equipment)

(game, baseball glove)

(herb, artichoke)

As expected, the distributional model generates a lot of invalid hypernym pairs, as terms with high distributional similarity scores are not always characterized by hypernym relations but also by other types of semantic relationships (e.g. synonymy, antonymy). Therefore, the output of the distributional model should preferably be used as additional evidence to validate hypernym relations also generated by other approaches. When considering the different flavors of the word embeddings module, each applying different similarity thresholds, our experimental results show the best precision–recall balance for the flavor incorporating the 0.50 similarity threshold.

4.3.2. Result of the combined systems

In a next step, we measured the performance of the three flavors of hybrid taxonomy learning systems:

Combined gives an overview of the performance scores of the system straightforwardly combining the output of all different modules, with similarity thresholds of 0.50, 0.60 and 0.70 for the word embeddings (word2vec) module.

Relaxed_Voting lists the scores for the system aggregating the output of the Morpho-syntactic analyzer and other hypernym pairs that are generated by at least two different modules.

Strict_Voting shows the results for a very rigid combined system, which only considers hypernym pairs that occur in at least two different modules’ output.

Table 7 shows all hybrid system results for the equipment domain (on a total of 615 Gold Standard relations), whereas Table 8 lists the results for the food domain (total of 1587 Gold Standard relations). Finally, Table 9 gives an overview of the results obtained for the science domain (total of 465 Gold Standard relations).

Table 7
Performance of the various combined systems for the Equipment domain

Relations found Correct relations Precision Recall F-score

Combined system

Combined_50 401 219 0.546 0.356 0.431

Combined_60 356 217 0.610 0.353 0.447

Combined_70 338 213 0.630 0.346 0.447

Relaxed voting system

Relaxed_Voting_50 244 191 0.783 0.311 0.448

Relaxed_Voting_60 242 190 0.785 0.309 0.443

Relaxed_Voting_70 240 189 0.788 0.307 0.442

Strict voting system

Strict_Voting_50 27 19 0.704 0.031 0.060

Strict_Voting_60 23 17 0.739 0.028 0.053

Strict_Voting_70 18 13 0.722 0.021 0.041

	Relations found	Correct relations	Precision	Recall	F-score
Combined system
Combined_50	401	219	0.546	0.356	0.431
Combined_60	356	217	0.610	0.353	0.447
Combined_70	338	213	0.630	0.346	0.447
Relaxed voting system
Relaxed_Voting_50	244	191	0.783	0.311	0.448
Relaxed_Voting_60	242	190	0.785	0.309	0.443
Relaxed_Voting_70	240	189	0.788	0.307	0.442
Strict voting system
Strict_Voting_50	27	19	0.704	0.031	0.060
Strict_Voting_60	23	17	0.739	0.028	0.053
Strict_Voting_70	18	13	0.722	0.021	0.041

Table 8

Performance of the various combined systems for the Food domain

	Relations found	Correct relations	Precision	Recall	F-score
Combined system
Combined_50	25,818	620	0.024	0.391	0.045
Combined_60	7,084	522	0.074	0.329	0.120
Combined_70	2152	492	0.229	0.310	0.263
Relaxed voting system
Relaxed_Voting_50	871	394	0.452	0.248	0.321
Relaxed_Voting_60	751	359	0.478	0.226	0.307
Relaxed_Voting_70	724	351	0.485	0.221	0.304
Strict voting system
Strict_Voting_50	536	192	0.358	0.121	0.181
Strict_Voting_60	391	140	0.358	0.088	0.142
Strict_Voting_70	352	128	0.364	0.081	0.132

As shown by the experimental results, the Relaxed voting system obtains the best precision scores, with moderate losses on the recall side. As a result, the latter system also achieves the best overall F-scores. Although one could expect that stricter constrains result in better precision, the overall good performance of the morpho-syntactic analyzer highly impacts the results of the Relaxed voting system. The scores also confirm the word embeddings module applying the 0.50 similarity threshold to be the best flavor of the distributional module for integration in the hybrid taxonomy learning system.

Another important insight is that the aggregation of different approaches contributes to the correct detection of hierarchical semantic relations between terms. If we compare the results of the best individual module with the best hybrid system, being the relaxed voting system integrating the word2vec model applying a similarity threshold of 0.50, we can observe that the combined system consistently outperforms the best individual module with F-scores of 0.448 versus 0.443 for the equipment domain, 0.321 versus 0.273 for the food domain and 0.408 versus 0.386 for the science domain.

Table 9

Performance of the various combined systems for the Science domain

	Relations found	Correct relations	Precision	Recall	F-score
Combined system
Combined_50	1083	227	0.210	0.488	0.293
Combined_60	695	211	0.304	0.454	0.364
Combined_70	610	206	0.338	0.443	0.383
Relaxed voting system
Relaxed_Voting_50	265	149	0.562	0.320	0.408
Relaxed_Voting_60	248	145	0.585	0.312	0.407
Relaxed_Voting_70	241	143	0.593	0.308	0.405
Strict voting system
Strict_Voting_50	123	60	0.488	0.129	0.204
Strict_Voting_60	97	49	0.505	0.105	0.174
Strict_Voting_70	89	47	0.528	0.101	0.170

To further analyse the contribution of the different modules in a combined framework, we generated some additional statistics per module. Table 10 lists the number of correct hypernym relations that were only detected by one specific module, whereas Table 11 shows the overlap between the output of the different modules. To measure the overlap, a label “found” (detected hypernym relation) or “not found” (missed hypernym relation) was assigned per entry of the GS for each module. The percentage of overlap was then calculated by dividing the number of shared labels by the total number of GS relations. As the results of the combined systems confirmed the Word2Vec module applying the 0.50 similarity threshold to be the best distributional model, we focused on this flavor for the analysis.

Table 10

Number of GS hypernym relations only detected by one particular module

Domain	Patterns	Morpho-synt	WordNet	Rapid Miner	Word2Vec_50
Food	29	224	93	9	1
Equipment	1	176	8	13	0
Science	5	95	30	25	0

Table 11

Overlap in hypernym relations between the different system modules, measured on the total number of GS relations

	Patterns	Morpho-synt	WordNet	Rapid Miner	Word2Vec_50
Patterns	N/A	0.72%	0.88%	0.87%	0.95%
Morpho-synt	0.72%	N/A	0.68%	0.68%	0.73%
WordNet	0.88%	0.68%	N/A	0.81%	0.86%
Rapid Miner	0.87%	0.68%	0.81%	N/A	0.89%
Word2Vec	0.95%	0.73%	0.86%	0.89%	N/A

Table 10 further motivates the setup of the Relaxed voting system. As the morpho-syntactic analyzer outputs a high number of correct hypernym relations that are not detected by any of the other modules, it is an obvious choice to always include the output of this module in the combined system output. In addition, Table 11 shows the least overlap between the output of the morpho-syntactic analyzer and the other modules. It is also interesting to notice that both structured knowledge bases, being WordNet and DBpedia, contain different hypernym relations for all considered test domains. In contrast, the distributional module does not extract any correct hypernym relation that is not detected by one of the other modules.

To conclude, depending on the envisaged application of the automatically constructed taxonomy, one might consider to use the Combined or Relaxed voting system. In the case where the taxonomy will be manually verified, it is more interesting to start from the combined taxonomy, which has a higher recall. If the taxonomy will be applied as such, it is preferable to opt for a system with higher precision, which is the Relaxed voting system with the word2vec model applying a similarity threshold of 0.70 in this case.

4.3.3. Comparison with state-of-the-art systems

Table 12
Performance of all systems participating to the SemEval-2015 Taxonomy Extraction task (TExEval)

System Precision Recall F-score

Food

INRIASAC 0.1884 0.5179 0.2763

ntnu 0.0700 0.0541 0.0611

QASSIT 0.0666 0.0655 0.0660

TALN-UPF 0.0363 0.0359 0.0361

USAAR-WLV 0.1589 0.2696 0.2000

LT3 0.4524 0.2483 0.3206

Equipment

INRIASAC 0.2611 0.4959 0.3421

ntnu 0.0161 0.0065 0.0092

QASSIT 0.2459 0.2455 0.2457

TALN-UPF 0.1458 0.1577 0.1515

USAAR-WLV 0.4142 0.3691 0.3903

LT3 0.7828 0.3106 0.4477

Science

INRIASAC 0.1795 0.4494 0.2565

ntnu 0.0544 0.0451 0.0493

QASSIT 0.2035 0.2236 0.2131

TALN-UPF 0.0733 0.2559 0.1139

USAAR-WLV 0.1817 0.3720 0.2441

LT3 0.5623 0.3204 0.4082

System	Precision	Recall	F-score
Food
INRIASAC	0.1884	0.5179	0.2763
ntnu	0.0700	0.0541	0.0611
QASSIT	0.0666	0.0655	0.0660
TALN-UPF	0.0363	0.0359	0.0361
USAAR-WLV	0.1589	0.2696	0.2000
LT3	0.4524	0.2483	0.3206
Equipment
INRIASAC	0.2611	0.4959	0.3421
ntnu	0.0161	0.0065	0.0092
QASSIT	0.2459	0.2455	0.2457
TALN-UPF	0.1458	0.1577	0.1515
USAAR-WLV	0.4142	0.3691	0.3903
LT3	0.7828	0.3106	0.4477
Science
INRIASAC	0.1795	0.4494	0.2565
ntnu	0.0544	0.0451	0.0493
QASSIT	0.2035	0.2236	0.2131
TALN-UPF	0.0733	0.2559	0.1139
USAAR-WLV	0.1817	0.3720	0.2441
LT3	0.5623	0.3204	0.4082

As a last step, we also compared our results with all systems participating in the SemEval “Taxonomy Extraction Evaluation” task (Bordea et al., 2015). As can be noticed in Table 12, our new system (LT3) obtains state-of-the-art performances and ranks first when considering precision and F-score for all three test domains. The INRIASAC system (Grefenstette, 2015), which also uses morpho-syntactic information and co-occurrence statistics, obtains higher recall but suffers from low precision scores. For the official competition, the organizers also performed a manual evaluation of the system output in order to measure the precision of the hypernym pairs that were not present in the gold standard taxonomy. There as well, the LT3 system, which aggregated a pattern-based, morpho-syntactic and WordNet module as presented in (Lefever, 2015), achieved with an average precision of 0.60 the highest score for the precision of the novel hypernym pairs not present in the gold standard.

5. Conclusions and future work

We presented a taxonomy learning system combining four components: a lexico-syntactic pattern-based approach, a morpho-syntactic analyzer, a word embeddings module and a knowledge-based module retrieving hierarchical relations from the Linked Open Data cloud (DBpedia in this case) and WordNet. A comparison with state-of-the-art systems that participated in the SemEval “Taxonomy Extraction Evaluation” competition shows very competitive results for our system when it comes to quantitative analysis against a gold standard taxonomy. As our system starts from the domain-specific corpora at hand, this data-driven approach could be a valid solution to the knowledge acquisition bottleneck for automatic hypernym detection in specialized domains and low-resourced languages, which do not dispose of large lexico-semantic resources.

With respect to the individual modules, the experimental results reveal a very good performance of the morpho-syntactic analyzer, both for recall and precision. To detect hypernym relations between terms that do not share morphological features, the WordNet and DBpedia, and to a lesser extent, the pattern-based and distributional models, were successfully applied. The latter two modules, which are both trained on web corpora, complement each other to some degree: the pattern-based module operates under very strict constrains, resulting in low recall figures, while the word embeddings module over-generates, resulting in more modest precision scores. Experiments with the similarity threshold for the word embeddings module demonstrated that a similarity score of 0.50 gives the overall best F-score.

For our final hybrid taxonomy learning system, we experimented with different ways of aggregating the various hypernym detection modules. The system simply combining all different modules’ output achieved the highest recall figures. The best precision and overall F-scores were obtained, however, by a relaxed voting system combining the output of the morpho-syntactic analyzer and hypernym relations generated by at least two different modules.

A qualitative analysis indicated that there is certainly room for improvement. In future research, we would like to improve the recall of the taxonomy learning system by crawling larger dedicated web corpora for the different test domains and adding additional structured lexical resources. In addition, we will also experiment with other distributional techniques and explore the use of multilingual information for automatic taxonomy learning. Another line for future work consists in combining TExSIS (Macken et al., 2013), our in-house term extraction system, with the presented relation detection system. TExSIS is a hybrid system combining linguistic and statistical information to automatically extract a list of domain-specific terms from a text collection without using external knowledge resources. In a first step, candidate terms are generated based on a list of predefined Part-of-Speech sequences such as “noun noun” (e.g. beef [Noun] stew [Noun]), “adjective noun” (e.g. white [adjective] pepper [noun]) or “named entity” (e.g. IPod). In a second step, statistical filters are applied to check the domain-specificity (termhood) of terms as well as the degree of cohesiveness inside multi-word terms (unithood). In contrast with the current research, where the different modules take as input a predefined domain-specific term list, the system would then (1) start from a list of automatically extracted terms and named entities and (2) automatically identify semantic relationships between these terms and named entities. Taking the automatically extracted terms as the input for the hypernym relation finder would enable to generate full-fledged taxonomies from scratch for any given domain- and user-specific text collection.

References

Alexopoulou, D., Andreopoulos, B., Dietze, H., Doms, A., Gandon, F., Hakenberg, J., Khelif, K., Schroeder, M. & Wächter, T. (2009). Biomedical word sense disambiguation with ontologies and metadata: Automation meets accuracy. BMC Bioinformatics, 10(1), 1–15. doi:10.1186/1471-2105-10-1.

Azevedo, C., Iacob, M.E., Almeida, J., van Sinderen, M., Ferreira Pires, L. & Guizzardi, G. (2015). Modeling resources and capabilities in enterprise architecture: A well-founded ontology-based proposal for ArchiMate. Information Systems, 235–262. doi:10.1016/j.is.2015.04.008.

Baroni, M., Bernardi, R., Do, N. & Chung-chieh, S. (2012). Entailment above the word level in distributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France (pp. 23–32).

Baroni, M. & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the web. In Proceedings of LREC 2004 (pp. 1313–1316).

Bernhard, D. (2006). Multilingual term extraction from domain-specific corpora using morphological structure. In Proceedings of EACL, The Association for Computer Linguistics (pp. 171–174).

Biemann, C. (2005). Ontology learning from text: A survey of methods. LDV Forum, 20(2), 75–93.

Bordea, G., Buitelaar, P., Faralli, S. & Navigli, R. (2015). Semeval-2015 task 17: Taxonomy Extraction Evaluation (TExEval). In Proceedings of the 9th International Workshop on Semantic Evaluation, Association for Computational Linguistics, Denver, Colorado (pp. 902–910).

Caraballo, S. (1999). Automatic acquisition of a hypernym-labeled noun hierarchy from text. In Proceedings of ACL-99, Baltimore, MD (pp. 120–126).

Chiarcos, C., McCrae, J., Cimiano, P. & Fellbaum, C. (2013). Towards open data for linguistics: Lexical Linked Data. New Trends of Research in Ontologies and Lexical Resources, 7–25. doi:10.1007/978-3-642-31782-8_2.

10.

Cross, V. & Bathijaa, V. (2010). Automatic ontology creation using adaptation. Artificial Intelligence for Engineering Design, Analysis and Manufacturing, 24(Special Issue 01), 127–141. doi:10.1017/S0890060409000183.

11.

Dastgheib, S., Mesbah, A. & Kochut, K. (2013). mOntage: Building Domain Ontologies from Linked Open Data. In 2013 IEEE Seventh International Conference on Semantic Computing (pp. 70–77). doi:10.1109/ICSC.2013.21.

12.

Faralli, S. & Navigli, R. (2013). A Java framework for multilingual definition and hypernym extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria (pp. 103–108).

13.

Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press.

14.

Firth, J.R. (1957). A synopsis of linguistic theory 1930–1955. In

F.R.

Palmer (Ed.), Studies in Linguistic Analysis. Oxford: Philological Society. (Reprinted in F.R. Palmer (Ed.), Selected Papers of J.R. Firth 1952–1959, London: Longman, 1–32.)

15.

Flati, T., Vannella, D., Pasini, T. & Navigli, R. (2014). Two is bigger (and better) than one: The Wikipedia bitaxonomy project. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, Maryland, USA (pp. 22–27).

16.

Fu, R., Guo, J., Qin, B., Che, W., Wang, H. & Liu, T. (2014). Learning semantic hierarchies via word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, Maryland, USA (pp. 1199–1209).

17.

Gale, W.A., Church, K. & Yarowsky, D. (1992). One sense per discourse. In Proceedings of the DARPA Speech and Natural Language Workshop, New York, USA (pp. 233–237). doi:10.3115/1075527.1075579.

18.

Grabar, N., Hamon, T. & Bodenreider, O. (2012). Ontologies and terminologies: Continuum or dichotomy? Applied Ontology, 7, 375–386.

19.

Grefenstette, G. (2015). NRIASAC: Simple hypernym extraction methods. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, USA (pp. 911–914). doi:10.18653/v1/S15-2152.

20.

Gruber, T.R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199–220. doi:10.1006/knac.1993.1008.

21.

Harris, Z.S. (1968). Mathematical Structures of Language. New York: Interscience Publishers John Wiley & Sons.

22.

Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the International Conference on Computational Linguistics (pp. 539–545).

23.

Hellmann, S., Lehmann, J., Sören, A. & Brümmer, M. (2013). Integrating NLP using linked data. In The Semantic Web – ISWC 2013: 12th International Semantic Web Conference (pp. 98–113). Berlin Heidelberg: Springer. doi:10.1007/978-3-642-41338-4_7.

24.

Hippisley, A., Cheng, D. & Ahmad, K. (2005). The head-modifier principle and multilingual term extraction. Natural Language Engineering, 11(2), 129–157. doi:10.1017/S1351324904003535.

25.

Kozareva, Z. & Hovy, E. (2010). Learning arguments and supertypes of semantic relations using recursive patterns. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), Uppsala, Sweden (pp. 1482–1491).

26.

Kozareva, Z., Riloff, E. & Hovy, E. (2008). Semantic class learning from the web with hyponym pattern linkage graphs. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL), Columbus, Ohio, USA (pp. 1048–1056).

27.

Lefever, E. (2015). LT3: A multi-modular approach to automatic taxonomy construction. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, USA (pp. 943–947).

28.

Lefever, E., Macken, L. & Hoste, V. (2009). Language-independent bilingual terminology extraction from a multilingual parallel corpus. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Athens, Greece (pp. 496–504).

29.

Lefever, E., Van de Kauter, M. & Hoste, V. (2014). HypoTerm: Detection of hypernym relations between domain-specific terms in Dutch and English. Terminology, 20(2), 250–278. doi:10.1075/term.20.2.06lef.

30.

Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P., Hellmann, S., Morsey, M., van Kleef, P., Auer, S. & Bizer, C. (2014). DBpedia – A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 6(2).

31.

Lenci, A. & Benotto, G. (2012). Identifying hypernyms in distributional semantic spaces. In Proceedings of the First Joint Conference on Lexical and Computational Semantics (*SEM), Montréal, Canada (pp. 75–79).

32.

Liang, J., Nguyen, T., Koperski, K. & Marchisio, G. (2006). Ontology-based natural language query processing for the biological domain. In Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology (pp. 9–16). New York, New York: Association for Computational Linguistics. doi:10.3115/1654415.1654418.

33.

Macken, L., Lefever, E. & Hoste, V. (2013). TExSIS: Bilingual terminology extraction from parallel corpora using chunk-based alignment. Terminology, 19(1), 1–30. doi:10.1075/term.19.1.01mac.

34.

Malaisé, V., Zweigenbaum, P. & Bachimont, B. (2004). Repérage et exploitation d’énoncés définitoires en corpus pour l’aide à la construction d’ontologie. In

Blache, (Ed.), Proceedings of TALN 2004 (Traitement Automatique des Langues Naturelles), Fès, Maroc (pp. 269–278).

35.

McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gómez-Pérez, A., Gracia, J., Hollink, L., Montiel-Ponsoda, E., Spohr, D. & Wunner, T. (2012). Interchanging lexical resources on the Semantic Web. Language Resources and Evaluation, 46(6), 701–709. doi:10.1007/s10579-012-9182-3.

36.

Meij, E., Bron, M., Hollink, L., Huurnink, B. & de Rijke, M. (2011). Mapping queries to the Linking Open Data cloud: A case study using DBpedia. Web Semantics: Science, Services and Agents on the World Wide Web, 9(4), 418–433.

37.

Mendes, P., Jakob, M. & Bizer, C. (2012). DBpedia for NLP – A multilingual cross-domain knowledge base. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey (pp. 1813–1817).

38.

Mendes, P.N., Max, J., García-Silva, A. & Bizer, C. (2011). DBpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, I-Semantics ’11 (pp. 1–8). New York, NY, USA: ACM. doi:10.1145/2063518.2063519.

39.

Mikolov, T., Sutskever, I.n., Chen, K., Corrado, G. & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (Vol. 26, pp. 3111–3119). Curran Associates, Inc.

40.

Mititelu, V. (2008). Hyponymy patterns. Semi-automatic extraction, evaluation and inter-lingual comparison. In Text, Speech and Dialogue. Lecture Notes in Computer Science (Vol. 5246, pp. 37–44). doi:10.1007/978-3-540-87391-4_7.

41.

Navigli, R. & Velardi, P. (2010). Learning word-class lattices for definition and hypernym extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden (pp. 1318–1327).

42.

Oakes, M. (2005). Using hearst’s rules for the automatic acquisition of hyponyms for mining a pharmaceutical corpus. In Proceedings of the Workshop Text Mining Research (pp. 63–67).

43.

Pantel, P. & Ravichandran, D. (2004). Automatically labeling semantic classes. In Proceedings of HLT/NAACL-04, Boston, MA (pp. 321–328).

44.

Paulheim, H., Ristoski, P., Mitichkin, E. & Bizer, C. (2014). Data Mining with Background Knowledge from the Web. RapidMiner World.

45.

Peters, W. (2013). Establishing interoperability between linguistic and terminological ontologies. In New Trends of Research in Ontologies and Lexical Resources (pp. 27–42). Berlin Heidelberg: Springer. doi:10.1007/978-3-642-31782-8_3.

46.

Peters, W. (2016). Tackling resource interoperability: Principles, strategies and models. In Proceedings of the LREC 2016 Workshop “Cross-Platform Text Mining and Natural Language Processing Interoperability”, Portoroẑ, Slovenia (pp. 34–37).

47.

Ponzetto, S. & Strube, M. (2011). Taxonomy induction based on a collaborative built knowledge repository. Artificial Intelligence, 175, 1737–1756. doi:10.1016/j.artint.2011.01.003.

48.

Prokofyev, R., Tonon, A., Luggen, M., Vouilloz, L., Djellel Eddine, D. & Cudré-Mauroux, P. (2015). In SANAPHOR: Ontology-Based Coreference Resolution, The Semantic Web – ISWC 2015: 14th International Semantic Web Conference, Bethlehem, PA, USA (pp. 458–473).

49.

Ray, S., Singh, S., Joshi, B.P., Tiwary, U.S., Siddiqui, T., Radhakrishna, M. & Tiwari, M.D. (2009). Exploring multiple ontologies and WordNet framework to expand query for question answering system. In Proceedings of the First International Conference on Intelligent Human Computer Interaction (IHCI 2009) (pp. 296–305). New Delhi: Springer. doi:10.1007/978-81-8489-203-1_29.

50.

Rei, M. & Briscoe, T. (2014). Looking for hyponyms in vector space. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (pp. 68–77).

51.

Ritter, A., Soderland, S. & Etzioni, O. (2009). What is this, anyway: Automatic hypernym discovery. In Proceedings of Association for Advancement of Artificial Intelligence Spring Symposium on Learning by Reading and Learning to Read (pp. 88–93).

52.

Roller, S., Erk, K. & Boleda, G. (2014). Inclusive yet selective: Supervised distributional hypernymy detection. In Proceedings of COLING 2014, The 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland (pp. 1025–1036).

53.

Santus, E., Lenci, A., Lu, Q. & Schulte Im Walde, S. (2014). Chasing hypernyms in vector spaces with entropy. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 38–42). Sweden: Gothenburg.

54.

Shwartz, V., Goldberg, Y. & Dagan, I. (2016). Improving hypernymy detection with an integrated path-based and distributional method. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany.

55.

Sowa, J.F. (2000). Knowledge Representation: Logical, Philosophical and Computational Foundations. Pacific Grove, CA, USA: Brooks/Cole Publishing Co.

56.

Sparck Jones, K. (1979). Experiments in relevance weighting of search terms. Information Processing and Management, 15, 133–144. doi:10.1016/0306-4573(79)90060-8.

57.

Tjong Kim Sang, E., Hofmann, K. & de Rijke, M. (2011). Extraction of Hypernymy Information from Text, Interactive Multi-Modal Question-Answering. Theory and Applications of Natural Language Processing (pp. 223–245). Berlin Heidelberg: Springer.

58.

Van de Kauter, M., Coorman, G., Lefever, E., Desmet, B., Macken, L. & Hoste, V. (2013). LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit. Computational Linguistics in the Netherlands Journal.

59.

Van der Plas, L. & Bouma, G. (2005). Automatic acquisition of lexico-semantic knowledge for question answering. In Proceedings of the IJCNLP Workshop on Ontologies and Lexical Resources, Jeju Island, Korea.

60.

Velardi, P., Faralli, S. & Navigli, R. (2013). OntoLearn reloaded: A graph-based algorithm for taxonomy induction. Computational Linguistics, 39(3), 665–707. doi:10.1162/COLI_a_00146.

61.

Weeds, J. & Weir, D. (2003). A general framework for distributional similarity. In Proceedings of EMNLP-03, Sapporo, Japan (pp. 81–88).

62.

Wright, S.E. (1997). Term selection: The initial phase of terminology management. In Handbook of Terminology Management (pp. 13–23). John Benjamins. doi:10.1075/z.htm1.04wri.

63.

Wright, S.E. & Budin, G. (2001). Handbook of Terminology Management, Volume 2: Application-Oriented Terminology Management. John Benjamins Publishing Company.

64.

Zhang, C., Niu, Z., Jiang, P. & Fu, H. (2012). Domain-specific term extraction from free texts. In 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2012) (pp. 1290–1293). IEEE. doi:10.1109/FSKD.2012.6234350.