Abstract
Domain-specific terminologies play a central role in many language technology solutions. Substantial manual effort is still involved in the creation of such resources, and many of them are published in proprietary formats that cannot be easily reused in other applications. Automatic term extraction tools help alleviate this cumbersome task. However, their results are usually in the form of plain lists of terms or as unstructured data with limited linguistic information. Initiatives such as the Linguistic Linked Open Data cloud (LLOD) foster the publication of language resources in open structured formats, specifically RDF, and their linking to other resources on the Web of Data. In order to leverage the wealth of linguistic data in the LLOD and speed up the creation of linked terminological resources, we propose TermitUp, a service that generates enriched domain specific terminologies directly from corpora, and publishes them in open and structured formats. TermitUp is composed of five modules performing terminology extraction, terminology post-processing, terminology enrichment, term relation validation and RDF publication. As part of the pipeline implemented by this service, existing resources in the LLOD are linked with the resulting terminologies, contributing in this way to the population of the LLOD cloud. TermitUp has been used in the framework of European projects tackling different fields, such as the legal domain, with promising results. Different alternatives on how to model enriched terminologies are considered and good practices illustrated with examples are proposed.
Introduction
International institutions have become major producers of multilingual terminology databases, understood as resources that account for the specialised words used in a particular field in multiple languages. Since its foundation, the European Union has maintained initiatives to cater for the collection, maintenance and creation of terminologies, thesauri or vocabularies, to cover their internal communication needs and to support translators. Some of the best known resources are available from TermCoord1
The creation and curation of such vocabularies has not only supported translators, documentalists and legal drafters at EU institutions, but has also become a reference for translators and language professionals outside the EU. Nowadays, curated language resources have proven to be more relevant than ever in light of natural language processing (NLP) tasks that rely on sound linguistic data. For example, query expansion using WordNet,5
Initiatives such as the Linguistic Linked Open Data cloud8

Motivating scenario for the development of TermitUp.
In addition, with the surge in technology solutions for the legal domain, in what is called LegalTech or RegTech, such challenges have become even bigger, since resources of this sort tend to be scarce, private to companies, published in unstructured formats, or no longer available (e.g. the legal multilingual WordNets built in the LOIS project [58], the LexALP term bank on spatial planning and sustainable development [37], or the European legal taxonomy syllabus on consumer protection law [1]). From those resources that have open licenses, such as EuroVoc, most have a wider scope and do not exhaustively cover a specific area of law, or, on the contrary, may only cover a particular sub-area of law (such as the International Labour Organisation Thesaurus9
With the aim of palliating the need for multilingual terminological resources of a specific domain or project, leveraging resources already available in the LLOD, we have devised a method to automatically cover the whole life cycle of the terminology creation process. Our contribution, TermitUp, puts together pieces of language technology previously isolated, and improves them to build a pipeline that, taking as input a domain specific corpus in one language, generates as output a multilingual terminology semantically enriched with data from the LLOD and published in open formats. The specific subprocesses of the method proposed include terminology extraction, terminology postprocessing, terminology enrichment, relation validation and RDF publication.
Henceforth, the paper is structured as follows: Section 2 presents relevant previous work; Section 3 exposes the linguistic foundations supporting the development of TermitUp; Section 4 lists the application requirements; Section 5 describes every component of TermitUp architecture; Section 6 exposes its current and potential impact; Section 7 contains the discussions that have arisen throughout the development and Section 8 summarises the conclusions and future work.
This section covers previous work related to the different processes mobilised in our system, namely, automatic terminology extraction, modern terminology management tools and semi-automatic terminology enrichment approaches (2.1). We also review existing language resources in RDF and the modelling approaches they follow (2.2).
Terminology-related technology
There is a wide variety of ready-to-use terminology extraction tools, both proprietary (such as SDL MultiTerm Extract,12
More comprehensive terminology management tools integrate monolingual and multilingual term extraction as a starting point feature, and offer additional functionalities to enrich the extracted terms. For example, in Tilde’s Terminology platform18
With regard to semi-automatic terminology enrichment, we also find several approaches in the literature. In [18], the enrichment consists of adding terms to a source thesaurus by exploiting parallel corpora. In [30], WordNet is used to establish hierarchical relations between the source terms. Oliveira and Gomes [47] propose a method to automatically enrich a Portuguese thesaurus with synonyms extracted from dictionaries. Some efforts have also been devoted to further specialise the related to relation that is common in thesauri with specific semantic relations, as in [61]. In the reviewed works, the scope of the proposed solutions has been limited to one aspect of the terminological resource (synonyms), one external resource (WordNet), or one specific language or language pair. In any case, these efforts deal with one specific feature of the resource or for certain languages, that cannot always be easily extrapolated to other domains or languages.
Concerning existing language resources published in RDF, general domain resources are the most valuable assets in the LLOD cloud. WordNet,24
Apart from the general resources mentioned above, the LLOD cloud also gathers some domain specific resources. One of the most important contributions of this kind is the RDF dump of IATE, an effort described in [17]. The complete resource is available through a Search API, but not structured in RDF. There have also been efforts to automatically enrich these data [4] with machine translated definitions. IATE offers translations, synonyms and definitions for terms in various domains, but it lacks relations amongst terms.
Some type of term relations are, however, present in EuroVoc,25
In summary, to ease the creation of terminological resources, we can make use of state-of-the-art terminology extraction tools, although only a few of them provide additional linguistic or semantic data to further describe the terms. To relieve this situation, there have been some approaches pursuing automatic terminology enrichment, yet, they are targeted at one specific type of information, and most of them involve manual efforts. In this paper, we present TermitUp, an automatic approach to generate Multilingual Semantically Enriched Legal Terminologies from corpora in Semantic Web formats. With TermitUp, terms are automatically enriched with translations, term variants or synonyms, definitions, examples of use, information about frequency and hierarchical relations, and are linked with other resources in the LLOD cloud.
The pipeline implemented by TermitUp is in line with the terminographical methods proposed by well-established Terminology theories for the compilation of terminological resources (communicative theory of terminology [16], socioterminology [26], sociocognitive theory of terminology [57] or frame-based theory [20]). In the most common scenario, the starting point in a terminological work is a corpus of specialised texts. The more care taken in constructing the corpus, the better. According to Barrière [6], texts should be domain relevant and contain knowledge-rich contexts (a notion defined by Meyer as “sentences that are of interest to terminologists because they contain important terms and knowledge patterns”, i.e., expressions of semantic relationships [42]). In our approach, the corpus construction task is a manual task assigned to users, who may not be so interested in the knowledge-rich value of texts, but on the relevance of the documents for a certain endeavour.
The next step consists in identifying terminological units in those documents. These can correspond to different part-of-speech (noun, verb, adjective, adverb), and participate in multi-word expressions or phraseological units. Deciding if a lexical unit has a terminological status is not devoid of difficulties. To assist terminologists in this step, several authors propose guidelines in the form of criteria that lexical units have to satisfy to be considered terms [16,35]. The meaning of a unit is to be discovered in text and constructed through relations to other terminological units. This allows terminologists to derive the conceptual structure underlying those designations, which enables translators or any other language professionals (documentalists, technical writers, subject specialists, etc.) to understand an area of knowledge. Such a structure can take the form of an ontology, as suggested in [20], and is the approach taken by the so called terminological knowledge bases, as dubbed in [43], in which a knowledge base component is enriched with a linguistic (terminological) component. Some well-known examples of terminological knowledge bases in different areas are GENOMA-KB [14], OncoTerm [21] or EcoLexicon [22].
These theories also propound that terms are to be analysed as used in real communication by experts in the domain, and that this may result in identifying various forms of designations (synonyms or term variants). Variants are to be accounted for in terminological resources, as well as the causes for that variation [15]. Depending on the purpose of the resource at hand, additional linguistic descriptions are also common in terminological resources, namely, source of the term, morphosyntactic information, definition, references to other terms (which can be of different nature, e.g. synonyms, hyponyms, antonyms), usage contexts (that show how the term behaves in real texts), usage notes, or phraseology. Terms are usually assigned to a domain, and all sources from which information has been obtained are referenced, together with other metadata (author, date, reliability degree, etc.).
When considering the multilingual perspective, best practices in terminology work recommend that equivalents in other languages are also collected from domain-specific corpus in the languages of interest, as well as the rest of linguistic descriptions [16]. An exact equivalence relation is assumed when terms in multiple languages are related to a source term, although language and culture differences may be captured in the form of notes. However, previous works on multilingual terminological knowledge bases in the legal domain show how important it is to define culture-specific knowledge as intermediate representations associated with a common shared ontology [31].
Finally, we briefly refer to the theoretical studies (and practical applications) made by terminologists about terminological or conceptual relationships between terms. Conscious of the importance of accounting for such relationships in termbanks, terminographers have characterised them, studied them in particular domains, and created methods for identifying them in corpora. The most important relations in this regard are the so-called hierarchical relationships (hyperonymy-hyponymy and meronymy). However, several non-hierarchical relationships have been intensively studied in some particular domains (cause-effect, entity-function), and others have also been considered for inclusion in terminological resources (antonymy, synonmy, derivative relationships, co-occurrents and collocations). For a nice overview we refer the interested reader to [36].
Requirements
The development of the first version of TermitUp was guided by a set of requirements derived from the study of existing language technologies, specifically those that deal with terminology, and the observation of their results, as well as from numerous discussions between the linguists, computer scientists and researchers involved in this project.
R1. Enrichment When confronted with domain specific data, there is a need for identifying the specific terms used in texts, as well as their meaning. Plain lists of terms tend not to suffice if they are to be used for annotation, classification or disambiguation and other complex NLP tasks. Definitions, morpho-syntactic information, term variants and explicit relations amongst terms can significantly contribute to improving performance of subsequent text processing tasks.
R2. Multilingualism As already mentioned, international institutions have catered for the creation of multilingual terminologies or thesauri to meet their needs. However, these do not necessarily cover the needs of a company or project in terms of languages, or the purposes of the system being developed. This results in the need for systems that assist in the creation of ad-hoc terminologies for certain language combinations. There have been some initial attempts to developing terminology extractors that work on multilingual corpora, but results are still preliminary.
R3. Disambiguation Although traditional theories to terminology and language planning have backed the approach that the terms in a domain are unambiguous, unique and semantically precise, corpus-based terminology studies have demonstrated that term variation or synonymy is common also in domain specific areas, and that texts may also vary in the degree of specificity. Additionally, external language resources (see Requirement 4) may contain different senses of a term, since they are usually of a general character rather than domain specific. This translates to a necessity for a disambiguation step when matching corpus-extracted terms with external ones.
R4. Reusability and standardisation Knowledge reuse is the cornerstone of Linked Open Data [7] and the main goal of TermitUp. To meet this objective, this service extracts knowledge from existing resources in the LLOD cloud and publishes the resulting terminologies in a structured and open-licensed manner, agreed by the community, so they can be freely reused.
R5. Data provenance When working with texts from a specific domain, it is of utmost importance to guarantee the univocity of the terms managed. Therefore, knowing the source from which each term has been extracted is equally essential, since by knowing these sources, the final user of the terminology has the freedom to choose which term to use depending on the confidence level of such sources. Taking into account that we are dealing with terminologies enriched with heterogeneous external resources, we must maintain traceability not only of the terms themselves, but of each piece of information associated with them: synonyms, translations, definitions, usage examples, etc.
R6. Open source and easy access Following the philosophy of Linked Open Data, we highlight open source as one of the requirements for this service. All the code used will be openly exposed in a Github repository to allow collaboration between users and developers. In addition, TermitUp will be published as a web service for easy integration with other processes.
Throughout this paper, we describe TermitUp functionalities and expose how their specific features comply with each of the requirements above mentioned.

TermitUp architecture.
With the aim of satisfying the requirements spelled out in the previous section, we present TermitUp, a service to generate domain specific terminologies directly from corpus, enriched with disambiguated terminological data from existing language resources in the LLOD cloud. This section describes the five interdependent modules that compose the TermitUp architecture (Fig. 2).
Module 1: Terminology extraction
This module allows to obtain a list of the most representative terms from a given corpus. After analysing and testing several open source automatic term extraction (ATE) tools, and also proprietary software, as mentioned in Section 2, we chose to implement the TBXTools service30
Originally, TBXTools is intended to process English texts but we fine-tuned the tool to work with Spanish texts (a need arose from our use cases, Requirement 2). We added lists of Spanish stop words and a set of exclusion regular expressions to avoid noisy constructions, which can be consulted in the repository.32
Regardless of previously mentioned improvements, we manually reviewed the automatically raw extracted terms and noticed recurrent patterns in Spanish that did not correspond to any multi-word term. For this purpose, we relied on some works that have studied the most common structure of terms in English and Spanish, specifically in the legal domain [2,3,16].
Traditionally, nouns were considered the main parts of speech to be included in terminological resources [34], since their main purpose was to label concepts. However, linguistic approaches to terminology argue that terms can belong to different parts of speech (nouns, verbs, adjectives, and sometimes adverbs), often with closely related meanings (for instance, the verb to contract and the noun contract) [35].
With the objective of filtering common term patters from invalid structures, we designed a post-processing stage in which a terminology filtering algorithm relies on part-of-speech annotations to remove structures that do not correspond to valid terms in Spanish. In this regard, a set of 42 linguistic patterns were compiled to detect what we call non-terminological structures. Examples of such patterns can be found in Table 1.
Additionally, we also implemented Añotador33
Examples of Spanish non-terminological patterns and temporal expressions, and their approximate translation into English for the sake of understanding
The next step in this approach is to take full advantage of the information in the LLOD relative to the previously filtered terms. Since most of the available resources have a wider scope, either covering several legal areas or general encyclopedic knowledge, a disambiguation process becomes necessary. To this end, we implemented an available word sense disambiguation (WSD) algorithm34
At this point, we introduce the concept of sense indicator, that refers to any word in the surroundings of a term that can be used to disambiguate its meaning.
The algorithm receives as input a source sense indicator and several candidate target sense indicators from the queried external resources. In TermitUp, the source sense indicator is built by the term t and its surrounding context (up to 100 tokens) from the input corpus
At first, we assumed that good target sense indicators could be definitions, since definitions contain other relevant words or terms in the domain. For instance: a training contract is a particular type of employment contract drawn up between an employer, a training organisation and an apprentice. However, we observed that not all the accessed resources contained definitions, so we decided to take every other possible piece of information that could indicate the sense of a term: broader terms, term variants (synonyms) and domain descriptors (see Fig. 3). We intentionally avoided using narrower and related terms since often they included terms from neighbouring domains that misled the algorithm. For instance, for the term promoter, in the sense of a person who supports the development of a company, we get as narrower term DNA promoter, as part of the DNA that starts transcription.
Table 2 shows an example of the five contexts for the term hearing obtained from the input corpus, three sense indicators built with domain descriptors from the queried resource and the resulting weights, returned by the WSD implementation. From these weights, the highest refers to the sense that is closest to our domain of interest. From the terms that refer to the sense in question, we can therefore establish a link and enrich our terminology with all the related information available in the queried resources, namely, definitions, translations, synonyms, broader, narrower and related terms. Through this approach, we satisfy Requirement 1: Enrichment; Requirement 2: Multilingualism; and Requirement 3: Disambiguation.
Table 3 lists the LLOD language resources exploited and the type of data retrieved from each of them.

Representation of the word sense disambiguation workflow.
WSD example for the term hearing, with five different contexts representing the sense of the term, and three candidate sense indicators from the queried knowledge base (IATE in this case). The results show that
List of resources exploited in the legal use case of TermitUp, and the type of information extracted from each of them. All of them are modelled in SKOS and accessed through SPARQL endpoints, except for IATE, whose RDF version is limited and outdated, and its JSON API offers more complete and up-to-date data
Some of the resources accessed were originally created and curated by experts. Others, however, were the result of collaborative efforts by users with different levels of expertise. This is why some of the data contained in these resources are not always correct, as it is the case of synonyms and hierarchical relations obtained, for instance, from Wikidata.36
This approach is inspired by the X-bar theory, that states that the formation of multi-word terms follows a hierarchical structure [16]. The approach suggests a comparison amongst the tokens of terms
Additionally, we have implemented a set of rules based on POS-tagging and stemming to generate relations between word forms belonging to the same word family, also known as derivatives. This allows us to group word forms that belong to the same family and gather them under the same concept. Thus, every time we find two terms that follow the patterns noun-noun, noun-adj, noun-verb, adj-adj, noun-verb that share the same stem, we generate a related relation.

Relation validation process.
The publication in RDF of the resulting terminological data does not constitute a module of the API itself, but is part of the enrichment module (Module 3), that directly returns a list of files in JSON-LD format for each of the terms processed. Users can choose the vocabulary to represent such files: either SKOS or Ontolex. We consider this choice a fundamental piece of the application, because depending on the future application of the terminologies, one model will be more suitable than the other. For example, if the user wants to use this terminology with a tool designed to specifically manage taxonomies, such as PoolParty or VocBench, it is necessary to represent the terminology with SKOS. If, on the contrary, the user intends to enrich the terms with morphological information, then the Ontolex model38

Example of term modelled with SKOS.
Once the user has chosen their preferred RDF vocabulary, the publication module (Module 5) enables the publishing of the results in a Virtuoso Query SPARQL Editor39

Example of term modelled with ontolex.
The combination of the exploitation of LLOD resources and publication of results in JSON-LD of Module 3, and the publication service represented by Module 5 completely satisfy Requirement 4: Reusability and Standardisation.
TermitUp has been developed in the framework of the H2020 Prêt-à-LLOD40
The main use case of TermitUp has been in the framework of the H2020 Lynx project,41
To evaluate TermitUp’s enrichment we have compared this labor law terminology with a gold standard generated from the same corpus (see Table 4). In this gold standard, terms have been manually extracted, semi-automatically enriched and manually reviewed by two Spanish terminology experts. Afterwards, an expert in knowledge management from an international law firm has reviewed and validated the quality of the work. In the context of the project supported by Grupo CPOnet,44
Comparison of the enrichment numbers of the semi-automatically generated gold standard and the Labor Law terminology automatically generated with TermitUp. We are comparing five types of enrichment and the approximate generation time
But the impact of TermitUp goes beyond these domain-specific applications. Its use as a streamlined component in composite workflows suggests a wider range of applications. TermitUp might be used to create user-specific terminologies, contribute to the linguistic analysis of a community, or create more precise vector models, with new features corresponding to the links discovered by TermitUp. In its latest application within the SmarTerp project,45
TermitUp is available in a public GitHub repository,46
The main limitation found during the development of this service is related to the publication of enriched terminologies in RDF, i.e., to Requirement 5. The objective of this requirement is to maintain the traceability of the data, since the provenance of the information is an essential indicator of its quality. Thus, TermitUp endeavours to store all sources of the collected data.
In the following, we analyse the different type of data collected by the service and the representation possibilities that SKOS and Ontolex offer:
Terms, synonyms and translations: In SKOS, they are treated as literals, represented with the properties skos:prefLabel and skos:altLabel, that do not allow to attach any additional information to them. SKOS-XL,49
Context: the context of a term is treated as an example of how it is used within a text. Therefore, the most suitable property to represent it in SKOS is skos:example (subproperty of skos:note50
Term note: this is a key element of traditional terminology cards that provides additional information, such as usage recommendations and domain data. Some of the modern language resources do not use term notes anymore, but others still keep them, thus, we consider them valuable pieces of knowledge for language professionals that need to be preserved. In SKOS, term notes can be modeled with skos:note and in Ontolex with ontolex:usage, both object properties pointing to literals. This implies that if we collect term notes from different language resources, we would not be able to model their provenance.
Definitions: the same occurs with definitions, since SKOS vocabulary applies skos:definition, that is also a subproperty of skos:note, therefore an object property that points to a literal. Ontolex does not propose any class for definitions either, and also employs skos:definition. We therefore have the same issue to model its provenance.
Besides the difficulties stated above, we face another modelling decision, since we find different types of sources at different levels. This is, the language resources with which the terms are enriched (i.e. IATE) can be understood as intermediate sources, that could be represented with the schema:provider property.51
Another discussion that arose from the modelling stage debates was whether the skos:definition (and related documentation properties) should be attached either to the skos:Concept or to the ontolex:LexicalSense. The SKOS specification remains vague in this point, and both approaches are at least syntactically sound –neither skos:definition nor its superproperty skos:note declare a rdfs:domain. This freedom suggests a flexible use which might be suitable to capture some subtleties.
First, when terminological data is transformed from different sources, definitions sometimes seem attached to concepts (e.g. data imported from Wikidata qualifies concepts), sometimes lexical senses (e.g. data imported from WordNet). We suggest the application of skos:definition in a flexible manner, being its subject a skos:Concept or a ontolex:LexicalSense at discretion.
Second, this loosen specification brings about the opportunity to distinguish reference and sense, in fregean terms. In his famous essay Über Sinn und Bedeutung (1892), Gottlob Frege told apart the reference and the sense of expressions [25]. In this writing, Frege uses the example of Venus: both “the morning star” and “the evening star” refer to the same object, Venus, but the thought they express is rather different. The sense is a mode of presentation, illuminating only a single aspect of the referent. We wonder whether computers can capture these nuances. We can certainly make such an effort, reserving the objective information about Venus for its skos:Concept (e.g. radius = 3000 km), but administer the different subjective perceptions the different components of the synset. Perhaps we want to attribute the ontolex:LexicalSense “Venus” a relatively neutral subjective value related to celestial bodies, and give the ontolex:LexicalSense “morning star” a hotter affective valence, possibly related to a more poetic context. These definitions and affective valences will be necessarily stereotypes, not reflecting subjective values (which are different for each mind), but intersubjective, namely, reflecting common perceptions and images (we refer the reader to [13] for more information about emotional words).
We wonder whether personalized lemonizations will ever be possible, describing the linguistic realities of specific individuals, perhaps inferred from personal big data such as personal email inboxes or alike. But this endeavour is well beyond the scope of this paper; we only stress the opportunity of attributing skos:definition (and other triples) to skos:Concept or ontolex:LexicalSense in the most beneficial manner; in this sense, the ontoterminology theory may be a nice point of discussion [51].
We have therefore gathered such ongoing discussions on modelling issues in a proposal for good practices to model terminological resources, published as a Terminology draft in the wiki of the Ontology-lexicon Community Group in the W3C,54
The automation in the generation of language resources (specifically, terminological resources) is a challenging task still unresolved. Automatic terminology extraction and terminology management tools provide a good starting point and excellent assistance both for terminology experts and language professionals, but substantial manual effort is still required.
This contribution intends to lighten such manual efforts, firstly by automating the post-processing step that terminologists usually need to perform over automatically extracted terms, and secondly, by exploiting the wealth of linguistic and terminological knowledge available in the Linguistic Linked Open Data cloud. The fact that such resources are published according to Semantic Web standards and open licences contributes to their simple and immediate integration in language technology solutions. However, the majority of them are too general, and do not contain domain-specific terms nor rich linguistic descriptions.
TermitUp helps covering those gaps by extracting and post-processing terms from domain specific corpora, and enriching them with translations, synonyms, definitions, usage notes and terminological relations. Consequently, this application establishes links to the resources exploited, contributing to the population of the LLOD with domain expert knowledge. Additionally, the tool offers a module that helps validate the terminological relations retrieved, that sometimes may be imprecise. Finally, the tool structures the resulting enriched terminologies, either following SKOS or Ontolex model; and stores them in a Virtuoso SPARQL Editor so that they can be freely accessed.
If we make a overall comparison with the terminology-related technology presented in Section 2.1, we can notice that TermitUp tackles some issues that they do not observe, which makes TermitUp not a competitor but a complementary application:
Tilde’s Terminology platform extracts terms from corpus and it is able to look for translations in other resources. It, however, does not enrich with definitions, synonyms, usage contexts or relations, and it returns unstructured data.
SketchEngine is a tool specialised in corpus management. It is also very well known for its terminology extraction algorithm. Although it gives information about term co-occurences and contextual information, the tool does not perform terminology enrichment nor semantic representation.
PoolParty is a powerful thesaurus management tool that allows creating hierarchical relations amongst terms, representing resources in SKOS and linking them to existing ones such as DBpedia. Still, all the work needs to be manually performed through a user interface. In this case, TermitUp could be used to speed up the terminology generation process and PoolParty would enable the manual revision by experts.
Saffron was originally a tool for taxonomy extraction. Recent improvements on the tool include terminology extraction, linking to DBpedia and knowledge graph generation. Saffron features are similar to those of TermitUp; it is however intended to work over scientific publications, and the added value is not terminology enrichment as in TermitUp, but “author and content” oriented.
VocBench is a tool for collaborative management of ontologies and thesauri. It does not generate terminological assets, but helps curate them. As PoolParty, VocBench seems a complementary tool to manage resulting terminologies from the TermitUp workflow.
Furthermore, a remarkable technical benefit of TermitUp is that its development is open source and the community can improve, contribute to or adapt it to their specific uses cases. Also, as it is based on a REST API architecture, TermitUp can be easily integrated with other state-of-the-art technologies or tools.
On the other hand, throughout the development of the service, we have faced several modelling challenges, concretely those related to the provenance of each type of data. With the current vocabularies to model linguistic linked data, not every piece of linguistic information is represented by a class, specifically notes and definitions. This means that no additional information can be added to them, such as the resource from which they have been retrieved. As a consequence, we have discussed and proposed an improvement of the existing models and good practices to accurately structure terminological resources built from heterogeneous data sources to the W3C Ontology-Lexicon Community Group.
During this development, we have also noticed that there is room for improvement in the quality of open (language) knowledge bases available in the LLOD – a fact that affects the performance of services relying on them. This is due to the fact that some of the biggest resources, such as Wikidata and ConceptNet, have been semi-automatically built, and their data have not been curated. On the contrary, those manually reviewed, such as KDictionaries’ RDF version [11], can only be accessed under permission. We therefore continue pursuing the publishing of high-quality language data in open formats, such as the complete version of IATE RDF, and encourage data owners to do it as well.
Regarding the publishing of the results, an immediate step is to resume the work started in the Terminoteca RDF project [12], whose objective is the creation of a repository of multilingual terminologies. That is, to link different terminologies in a single graph so that they can be queried from a single entry point. Therefore, it seems logical that, since the objective of TermitUp is to generate rich multilingual linked terminologies, the next step would be to publish them in Terminoteca RDF, that would also allow to browse the terms through a graphic interface.
On the other hand, we observed that traditional terminological resources (such as TERMIUM and IATE) do not make explicit the relations that may exist between terms, that are to be inferred by the user from the information contained in definitions or usage notes. Terminological knowledge bases or thesauri, which follow a more conceptual approach, intend to classify concepts in a conceptual structure and include hierarchical relations (broader-narrower term relations), as well as an unidentified type of relation that flags that two terms are somehow related (see “related to” in EuroVoc or Agrovoc). Frame-semantics and other Lexicon driven approaches to terminology (see Section 3) agree on the interest of capturing terminological relations, including domain-specific relations, that describe how two terms interact with each other in a given area of expertise. The most generic relations include cause-effect and object-function, for instance.
Consequently, the next version of TermitUp is thought to contain an additional module that allows performing automatic domain-specific relation extraction amongst the terms in the terminology, based on the study of their behaviour in the corpus.
Finally, challenging the current domain-specific application of the tool, we have already two potential projects of very different domains, in which TermitUp will take part: 1) Authors have recently worked jointly with the DFKI research center,55
