Abstract
Since 2004 the European Commission’s Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages and has automatically recognised and compiled large amounts of named entities (persons and organisations) and their many name variants. The collected variants not only include standard spellings in various countries, languages and scripts, but also frequently found spelling mistakes or lesser used name forms, all occurring in real-life text (e.g. Benjamin/Binyamin/Bibi/Benyamín/Biniamin/Беньямин/بنيامين Netanyahu/Netanjahu/Nétanyahou/Netahny/Нетаньяху/نتنياهو). This entity name variant data, known as JRC-Names, has been available for public download since 2011. In this article, we report on our efforts to render JRC-Names as Linked Data (LD), using the lexicon model for ontologies lemon. Besides adhering to Semantic Web standards, this new release goes beyond the initial one in that it includes titles found next to the names, as well as date ranges when the titles and the name variants were found. It also establishes links towards existing datasets, such as DBpedia and Talk-Of-Europe. As multilingual linguistic linked dataset, JRC-Names can help bridge the gap between structured data and natural languages, thus supporting large-scale data integration, e.g. cross-lingual mapping, and web-based content processing, e.g. entity linking. JRC-Names is publicly available through the dataset catalogue of the European Union’s Open Data Portal.
Introduction
Enhanced by Semantic Web technologies, the Linked Data publishing paradigm has become increasingly attractive in the recent years [4,31], giving rise to an ever-growing Web of Data.1
As evidenced by the Linked Open Data (LOD) cloud:
A crucial point relates to the natural language interfacing and processing capabilities of the Semantic Web (SW). Indeed, if the Semantic Web is inherently language-independent [29], the question arises of how to mediate between, on the one hand, language-agnostic data representations and, on the other, language-based information needs and content. For that to happen, it is crucial to enrich structured data with linguistic information in several languages, and to enhance the Semantic Web infrastructure with language processing applications [10]. Overcoming the gap between the Web of Data and natural languages presents challenges and opportunities for both Semantic Web and Natural Language Processing (NLP), which stand here in a mutually beneficial relationship.
With regards to the Semantic Web, such developments are key in several respects. Multilingual linguistic information can first support data integration. Given the growing trend towards the publication of non-English data sources and the risk of ‘monolingual islands’ of data that do not interoperate [28], cross-lingual mappings between datasets are necessary. In this context, the lexicalisation of data on a multilingual basis can be of great help [60]. Linguistic knowledge can also ease data access. Particularly, it can support the development of ontology-based Question-Answering systems in order to allow users to interact with data using their own language(s) [37,62]. Finally, even if data can be interlinked and accessed in several languages, the vast majority of content (i.e. the Web of Documents) remains unstructured. In order to facilitate information discovery and to further develop the scope of structured data, content needs to be marked-up with semantic metadata. This relies again on the availability of web-based linguistic information and technologies.
With respect to Natural Language Processing, adopting Linked Data principles for the distribution of linguistic resources can bring many benefits, including: resource interoperability, both at a structural and conceptual level; resource integration (via interlinking); and resource maintenance (via a rich ecosystem of technologies allowing, among other things, continuous updating) [15]. Based on such insights, members of the NLP and SW communities – in particular the Open Linguistics Working Group and the W3C Ontology-Lexica Community Group2
Such as the LIDER project (lider-project.eu) and the BPMLOD and LD4LT W3C community groups (w3.org/community/bpmlod|ld4lt).
The task of Entity Linking (EL) is particularly representative of the symbiotic relationship between SW and NLP. It illustrates the evolution of information extraction from a document to a semantic-centric viewpoint [43,50] and is at the core of many knowledge extraction tools for the Semantic Web [18,26]. This task requires to align textual mentions of entities with a unique identifier in a knowledge base, typically Wikipedia or DBpedia [36]. Like in traditional named entity recognition, entities of interest are usually of type person, organisation and geo-political, although they can be extended to others. Many EL approaches have been developed [11,19,24,48], all of which acknowledge the lexical gap between KBs and textual content with, especially, the problem of entity surface form variation. Indeed, alternative spellings, abbreviations, aliases or other types of lexical variation make entity mention spotting and/or candidate selection difficult. When provided with extra surface forms, system performances increase, particularly with noisy texts [12] or specific domains [67]. There is thus a need for lexical information regarding entity names, especially across languages.
In this paper we present the release of a multilingual named entity resource for person and organisation names, namely JRC-Names, as Linked Data. The resource is freely available and comprises hundreds of thousands of entity names and their multilingual variants in over twenty languages, including across scripts. This is a follow-up of a first release [56], from which it differs in that (1) it is rendered as Linked Data using the
The remainder of the paper is organised as follows. In Section 2 we introduce the JRC-Names resource; we briefly explain how it was produced (2.1), account for the quality of the resource (2.2) and specify what is included in the dataset (2.3). Next, we describe its conversion to Linked Data (Section 3) and present its interconnections with other datasets (Section 4). We then give accessibility details (Section 5) and summarise known and potential usages (Section 6); finally, after the discussion of related work (Section 7), we conclude and consider future work (Section 8).
Resource creation: Multilingual NER from the news
JRC-Names is a by-product of the Europe Media Monitor (EMM) family of news analysis applications, which gathers and analyses up to 220,000 news articles per day fully automatically in about 70 different languages from up to 7,000 news sites (status January 2015; [54]). Once gathered, news texts enter a pipeline of different modules which cluster related news, link news clusters over time and across languages, and – for currently twenty-one languages – recognise direct speech quotations and perform named entity recognition (NER) and classification for the entity types person and organisation. Location names are also recognised, through a lookup procedure, and disambiguated via document-based heuristics.
NER is performed using a number of manually curated language-independent rules that make use of language-specific lists of titles and other words/phrases that are typically found next to names. As regards person names, these pattern words can be titles (president), professions or occupations (tennis player, playboy), references to countries, regions, ethnic or religious groups (French, Bavarian, Berber, Muslim), age expressions (57-year-old), verbal phrases (deceased) and more. Such phrases, which we generally refer to as trigger words because they include far more than only titles, can be further modified (former) or occur in combination (57-year-old former British Prime Minister). Trigger word lists are produced in a combination of machine learning and manual collection from online sources. Those found historically next to each name are stored in order to build up a frequency-ranked repository of common titles (and more) for each entity. Organisation name recognition is performed in a similar manner, i.e. it makes use of lists of typical organisation name parts (organisation, club, international, bank, etc.). However, it is relatively weakly developed in EMM and, due to a coarse entity type categorisation, other entity types are included such as Belfast Agreement, Nobel Prize, Red Mosque or World War I. We refer the reader to [53] for further details about the NER system.
Besides NER applied to multilingual news, JRC-Names is also the result of a name variant matching process. The NER tool identifies over 500 new name forms per day and, for each of them, the system shall determine whether it refers to a new entity or whether it is a spelling variant of an existing entity name. To this end, a language-independent name matching algorithm is applied, which computes a similarity measure (edit distance) between different name representations. These are obtained after several transformation steps including transliteration, normalisation and vowel removal to create consonant signatures. A newly identified name is merged with an existing one if their overall similarity is above an empirically defined threshold, and kept as separate entity otherwise. More advanced approaches for name similarity across scripts have been explored in [49].
It is important to clarify the concept of language with respect to names and their variants. We avoid talking about certain name variants as being in a certain language. Instead, we prefer to consider that a certain name variant is more frequently found in texts written in a certain language. The same variant may also be found in other languages, but probably with different distributions. For instance, Michail Gorbatschow is the most frequent spelling used in German news when referring to the former Soviet leader Михаил Горбачев, while Mikhaïl Gorbatchev is more frequent in French and this variant is also found in Portuguese texts. This relative frequency information is useful if the purpose is to generate an easy-to-read text in another language (e.g. during Machine Translation).
Finally, let us consider the question of morphological inflection. As other lexical units, proper names are morphologically inflected in many languages. Inflection mechanisms are numerous and heterogeneous, and they can be very difficult to handle when dealing with many languages. Some of the inflected forms found for the surname of the current US president are Obamával (Hungarian), Obamę (Polish) and Obamas (German). In order to avoid the storage of all inflected forms in the database (inefficient and untidy) while keeping the possibility to capture at least a large part of their occurrences in texts, EMM pre-generates the most common inflections for a subset of known name variants or it uses suffix replacement rules during the NER process. This mechanism allows to recognise a majority of name inflections in text and to return the base form for that name. Hence, morphological inflections of entity names are not meant to be part of JRC-Names. However, several of them have erroneously been missed as morphological variants and they have been categorised as variants of known names. This is rather an aesthetic issue because, from a practical point of view, their presence improves the lookup procedure of names in text.
Since 2004, the software has identified about 1.75 million different person and about 10,000 organisation names. In addition to these ‘canonical’ name forms, it contains about 390,000 additional lexical variants. The database grows by about 700 name forms (new names or variants of known names) per week.
Resource quality
The JRC’s software recognises entities in annotated gold standard NER corpora with an average Precision of 92.13% and a Recall of 50.33% for the nine languages De, En, Es, Hu, It, Nl, Pt, Ro and Tr. Precision is highest for English (96.83%) and lowest for Portuguese (83.41%). Recall is highest for Hungarian (73.89%) and lowest for Turkish (31.70%). The evaluation values of the real-life NER system are actually better than that because of the specific settings of JR’s system, which are geared towards (a) recognising each name at least once in a whole cluster of related news and (b) grounding each name to a real-life entity.
The result of EMM’s automatic NER and variant merging process is subject to a (light-weight) human moderation process. Manual intervention is carried out daily (an average of maximally one hour), focusing on the most frequently mentioned names and on regular mistakes that affect large numbers of entities. The human moderator also has the possibility to mine – assisted by an automatic tool – name variants from cross-lingual Wikipedia links and to download entity images. This semi-automatic Wikipedia mining increases the number of languages for name variants beyond the ones covered by the NER system. Although extremely valuable, the manual verification mends only a small part of the data and JRC-Names remains the product of an automated process and, as a consequence, contains noise. The main types of errors consist of non-entities (e.g. Red Piano or French Doctor), wrong name extents (e.g. Even Obama) and wrong entity type (e.g. Merlin Biosciences as a person). Additionally, it is possible that different entities have been merged into one and, conversely, that homonyms have the same identifier, as no disambiguation mechanism is in place. In order to keep most mistakes out of the JRC-Names distribution and also to stick to the more useful entities, only those entities whose frequencies are above a threshold are included in JRC-Names, as we shall see in the next section.
Content of the linked dataset
A first version of JRC-Names has been released in 2011 in the form of a tab-separated text file, accompanied by a Java library for fast lookup. The named entity resource file corresponds to a subset of EMM’s database, and it has since been available on JRC’s website5
Multilingual Linked Open Data for Enterprises:
Person and organisation entity names. Those entities must have been found in at least five different news clusters (i.e. all mentions in all clustered articles of the same day count only as one).7
As mentioned in Section 2.1, EMM groups related news articles into ‘news clusters’ and deals with each cluster as a meta-document. Frequency counts of named entities are relative to these clusters, and not to single news articles.
Name variants. They must satisfy the threshold of having been found in at least 2 different news clusters.
Trigger words. They correspond to titles and function names that have been found in news articles next to the person mentions (cf. Section 2.1). Trigger words are included if they were found in at least 5 different news clusters.
Time stamps. Each name variant or title is accompanied by two time stamps: the first insertion date into the database (when EMM first found this title), and the last update date. This information is useful to detect changing titles, e.g. when a person is mentioned with different positions.
Frequency information. Each name variant has a news cluster frequency count.
Prior probabilities. Name variants have monolingual and multilingual prior probabilities, which reflect how likely an entity is mentioned with a specific variant in a certain language, or across all languages.8
The prior probability of a specific variant of an entity is calculated by dividing the frequency count of this variant by the sum of the frequency counts of other variants of the same entity in the same or across languages.
The resource consists of lexical knowledge, i.e. name variants in multiple languages, about individuals, i.e. person and organisation entities. Lemon and other linguistic vocabularies (Section 3.2) were used to render JRC-Names as Linked Data (Sections 3.3 and 3.4).
The lemon model
lemon is a model to represent linguistic information relative to ontologies in RDF. More specifically, it allows to specify the meaning of lexical units as well as to describe their constructions with respect to the vocabulary of an ontology. In line with the principle of semantics by reference [9,39], lemon maintains a clean separation between the lexical layer, which deals with the morphological and syntactic description of lexical entries (words or phrases), and the ontological layer, responsible for describing the meaning (or resolving the reference) of the lexical entries. The model builds on previous work for representing lexica and combines the strengths of LexInfo [16] and of the Linguistic Information Repository [45], both based on the Lexical Mark-up Framework [25]. The core of the lemon model9
Lexicon, which collects lexical entries and is marked with a language,
Lexical entry, which comprises all syntactic forms of an entry,
Lexical form, which represents the surface realization of a lexical entry, usually in the form of a written representation,
Lexical sense, which represents the usage of a lexical entry as a reference to an ontological entity.
The expression is from [17].
Apart from lemon, which enables the representation of most JRC-Names data, other controlled vocabularies are used: LexInfo and OLiA, which provide linguistic categories and mapping between linguistic schemes, are used to specify linguistic categories and relation properties of name variants [14,16]; lexvo, which provides global IDs for language-related objects, is used to encode language information [20]; and the DBpedia ontology, which organises Wikipedia concepts, is used to encode entity types. As regards meta-data information, the VoID [1] and the DCTerms vocabularies are used. Finally, when no existing vocabulary could answer our needs, we defined our classes and properties in a dedicated vocabulary.12
URLs of all vocabularies are mentioned in Fig. 1.
At the ontological level, JRC-Names entities are encoded as dbo:Person or dbo:Organisation. Each entity has a language-independent ‘base name’, i.e. the variant that was chosen to use for display purposes inside EMM. The choice was made according to the name being either the most frequently found variant in the news (across languages), or the variant found on Wikipedia, or a frequent Latin script version of a name originally written in another script. This base name is therefore not marked with a language (although it is typically a name form that is frequently found in English text) and is encoded as the skos:prefLabel of the RDF entity.
At the lexical level, entity name variants are encoded as lemon:LexicalEntry, the language of which is specified through lemon and lexvo language properties (ISO-639-1 and 3). These lexical entries are also defined as olia:NamedEntity and get further characterised with the lexinfo properNoun part-of-speech.
JRC-Names exhibits a relatively high degree of lexical variation. There are multiple scripts (e.g. Latin vs. Cyrillic Barack Obama – Барак Обама), omission or addition of name parts (Barack Hussein Obama Jr.), inflected forms (Barack Obamát), typos (Barrac Obama), inversion of name parts (Obama Barack) and various other forms (e.g. Barack O’Bama). Because the collection of variants is based on string similarity, formally very different units such as diachronic variants or aliases (Eric Blair, alias George Orwell) do not exist in the resource (or if so, they were manually entered). Variant types, however, are not specified in JRC-Names. As a consequence, even if lemon offers the possibility to represent term variation at the level of surface form, word or sense [46,47], name variants are all lemon:LexicalEntry (i.e. words), although some could be conceived as different lemon:Forms of a variant. Accordingly, name variants of the same language (and of the same entity) are related through lemon:lexicalVariant relations.

Graphical illustration of JRC-Names data representation, with the example of the entity Jean-Claude Juncker.
The path from name variants to their referent is set via lemon:LexicalSense. As reification of the relation between a word and a concept (here an entity), a lexical sense can support the expression of information which is neither of lexical nor of ontological nature. JRC-Names associates contextual information to entity name variants, that is to say their news cluster frequency and the dates of their first insertion and last update in the frequency and the dates of their first insertion and last update in the database. Based on news cluster frequencies, we additionally compute monolingual and multilingual prior probabilities. This information is rendered as properties of name variant lexical senses. Such properties are circumstantial and do not qualify the linguistic usage but the incidence of the association of a given variant with a specific entity (how many times this name appears with this referent, when was the first and last time of this occurrence). This is the reason why we did not use the lemon:context property, which concentrates on pragmatics or discourse properties such as register or temporal and geographical usage constraints. With regards to proper names, such a context could for example specify the time span usage of Byzantium vs. Constantinople vs. Istanbul, or the register difference between Michael Schumacher and Schumy.
Lexical senses additionally allow the expression of translation relations between name variants in different languages referring to the same entity. Translation relations fall indeed within the domain of lexical sense, as they shall be stated between disambiguated names (the English lexical entry London will translate into the French Londres when referring to the city, into London when referring to the writer). These relations are represented through lexinfo:translation object properties, as there was no need to use a more principled way to do it [30].
Besides name variants in multiple languages, the dataset also contains person entity ‘titles’. As detailed in Section 2.1, titles correspond to the trigger words that helped recognise entities in texts and they consist of a heterogeneous set of nominal phrases referring to the function or the social status of a person. Titles are lexically defined as lexical entries and as olia:TitleNoun, a morphosyntactic category describing appropriately those items. They are marked with language, but their part-of-speech remain unspecified. Title lexical units refer through lexical senses to the dbo:PersonFunction class, in a kind of loose lexicalisation of this abstract concept.
Similarly as for name variants, frequency and time-stamp information are available. However, since these elements regard the relation between a title and a person entity and not the one between a title and its concept (dbo:PersonFunction), they cannot be stated on titles’ lexical senses. In other words, what is qualified here is not the linguistic relation between a word and its concept, but the factual one of a person entity having, or occurring with, its title(s). In order to correctly encode this information as well as to capture the person/title relation, we introduced a jrc-model:Occurrence class. It represents a specific occurrence of a title lexical sense and establishes the relation with a person entity via the jrc-model:hasTitle property. As expected, instances of jrc-model:Occurrence additionally holds the frequency and time properties relative to a given person/title association.
Let us mention that in a more rigorous setting the occurrence of a title lexical sense (an instantiation of jrc-model:Occurrence) should point not to the person entity (dbo:Person) but to one of its name variants with which the title originally occurred. This information is however not available in the original database, where title expressions are directly associated with person entities.
A graphical representation of JRC-Names entity and lexical knowledge is given in Fig. 1, with the example of the current President of the European Commission Jean-Claude Juncker. As it is not possible to represent all information, only a few items of each type of information are depicted.
Interlinking
JRC-Names introduces links towards two specialised datasets, New York Times and Talk of Europe, and a generic one, DBpedia [36]. The New York Times (NYT) initiated some years ago the Linked Data publication13
DBpedia contains a great number of person entities with many properties in various languages. As briefly mentioned in the introduction, a well-known issue with knowledge bases is entity disambiguation. Although this was not the primary goal of the present work, we developed a light-weight strategy in order to link JRC-Names entities with their correct counterpart in DBpedia. Given a JRC source entity and its variants in all languages, the algorithm first looks for an exact match between the variants and the English rdfs:label of non-ambiguous person and organisation DBpedia entities. Next, if no match is found, ambiguous DBpedia candidates are selected (based on the variant surface forms) and if only one of these candidates is of the same type as the JRC source entity one, then the resources are interlinked. Finally, when there is more than one possible candidate (i.e. DBpedia entities having the same type and label than the JRC one), the set of English titles of the JRC entity is considered against a selection of English properties of DBpedia candidates (dbo:office, purl:description and db-prop:title), looking again for an exact match. Overall 95,437 links were created (cf. Table 1), 64,002 thanks to the first alternative, 31,340 thanks to the second and 95 to the third. We manually evaluated the correctness of 100 randomly selected links and obtained a Precision of 91%. Errors are mainly due to EMM mixing different persons, resulting into ambiguous entities difficult to link. The linking strategy could be improved in several ways, e.g. by exploiting multilingual features and making a joint use of the different DBpedia chapters.
Some interlinks are set at vocabulary level [34]. JRC’s classes and properties being quite specific, only a few links could be set, mostly on NYT’s vocabulary, with loose relationships (rdfs:seeAlso) from jrc-model:clusterFreq, jrc-model:insertionDate and jrc-model:lastUpdate towards New York Times associated_article_count, first_use and latest_use properties respectively. Finally, let us mention that backward links towards the MLODE dataset are set, based on JRC entity IDs.
Statistical profile of JRC-Names RDF dataset
The RDF version of JRC-Names features an overall number of 72.5 million triples. Table 1 gives further details on the statistical profile of the dataset. The majority of entities are persons, with 331,242 resources of this type against 7,391 of type organisation. Those entities are lexicalised through 1.7 million lexical entries, gathered into a total of 171 language-specific lexicons. It is worthwhile here recalling that NER is performed for 21 languages, and that data for other languages is added through Wikipedia mining. Next, there are about 2.4 million monolingual lexical variant relations, and 32 million translation relations. Finally, external connectivity is reasonably good, with a third of the entities being connected to either DBpedia, New York Times, or Talk of Europe.
Resource metadata are expressed using the VOID vocabulary; provided descriptions include general, access and structural metadata.
The JRC-Names linked dataset is served on the web via the EU Open Data Portal with: an RDF dump file,14
JRC-Names has been used for a whole range of tasks. The major usage probably is the improvement of the recall of searches in databases (including audio-visual) and text collections (including the Internet) [2,57] by expanding the initial user query by all name variants. Alternatively, name mentions in the search space can be normalised by replacing variants with a standard form. Search expansion is particularly important across scripts as even approximate matching techniques will not find foreign script variants of the searched name. Hands-on users of JRC-Names have either replaced the whole entity name by the set of its variants ‘George Bush’ (‘George Busch’, ‘George Buhs’, ‘Corc Uolker Buş’), or they have split all entities in JRC-Names to produce lists of variants for each name part, e.g. ‘Georgius’, ‘Georges’, ‘Georg’, ‘Džordž’, etc. for the English standard spelling of ‘George’. By doing this, the knowledge contained in the resource can be applied to any names and not only to media VIPs. Another usage of JRC-Names relates to Machine Translation systems, which typically have problems translating proper names [5]. This challenge can be overcome by identifying and removing names before the translation process and by then reinserting the target language equivalent [61]. Also, lists of names in two different scripts are often used to learn transliteration rules, e.g. [49]. Collections of names and their variants have been used to train and/or improve Named Entity Recognition tools [8,21,65] or to disambiguate name mentions [2], but also, more generally, to develop Language Technology tools for lesser-resourced languages [58,69]. The development of higher-level Language Technology tools has benefited from JRC-Names, such as co-reference resolution [55] and cross-lingual linking of related documents in different languages [52]. Furthermore, JRC-Names has been used in higher-level sociological or political studies such as tracking researchers’ mobility on the web [27] or pre-processing text for a subsequent political science study [3]. In principle, JRC-Names can also be useful as a component in Language Technology tools for opinion mining, summarisation, topic detection and tracking, and more.
The LOD version of JRC-Names contains more information and links to other LOD resources. This not only widens the application areas, but most of all it opens the way to a fully-automatic usage of the data. First, the machine-readable version of JRC-Names can be queried by agents18
Examples of queries are available at:
This section summarises previous efforts to compile multilingual lexical information about names, and considers named entity-related data on the LLOD.
Named entities, or proper names when limited to the core categories of person, location and organisation, represent an open word class which evolves endlessly. Dedicated resources or gazetteers are therefore not easy to acquire and require constant updates. In this context, the collaboratively built, semi-structured and multilingual Wikipedia resource appeared as a great relief, and several named entity dictionaries were built out of it [57,59,68]. Prolexbase [38], a manually produced multilingual ontology of proper names built up over many years, recently adopted a semi-automatic enrichment strategy based on Wikipedia [51]. All of these resources are the result of exploiting Wikipedia and, with the exception of [59] which makes use of LMF, they are not interoperable.
Many linguistic resources have been exposed as Linked Data recently. As for entities, they appear mainly in encyclopedic dictionaries and knowledge bases, such as BabelNet [23], DBpedia [36] and YAGO [33], but some are present in lexical resources. In the latter case, resources such as WordNet RDF [42] or lemonUBY [22] do include entity names, but in a rather limited number and with little information about lexical variation. In the former, all entities derive from Wikipedia and are primarily the focus of encyclopedic descriptions. At lexical level, Wikipedia is strong at providing cross-lingual and cross-script variants, but it contains only few spelling variants within the same language and it does not contain information on morphological variants. In contrast, JRC-Names is mostly built up by recognising name variants in real-life multilingual texts. A dedicated resource has been compiled as part of DBpedia Spotlight [44], which consists of entity lexicalisations collected over the graph of labels, redirects and disambiguations of the KB. Anew, the range of name variants is bounded to Wikipedia data, while JRC-Names provides name occurrences of real-life texts. Overall, the picture that emerges is one of complementarity, where various datasets could provide different types of information about entities.
Conclusion
We have presented the new release of the JRC-Names resource as Linked Data using lemon, a model for representing ontology lexica. This work is the continuation of previous efforts and is in line with the general effort of the European Commission to support multilingualism and language diversity. Compared with the initial release of JRC-Names in 2011, the current one is available as Linked Data and provides more information, namely person titles, occurrence time-stamps and frequency information. With name variants extracted from multilingual news, this resource complements those based on Wikipedia and contributes to the ongoing developments within the SW and NLP communities to support data access in several languages.
Future work could be manifold. At data level, it would be useful to further specify the variant types, to carry out a lemon-based publication of morphological generation rules, and to clean erroneously conflated entities (e.g. using titles). At web level, interlinking with other datasets (lexical, encyclopedic or factual) could be expanded, as well as intralinking among titles.
Footnotes
Acknowledgements
The Europe Media Monitor EMM is a multiannual group effort involving many tasks and we would thus like to thank all past and present EMM team members for their help and dedication. We are also grateful to Valentina Fratto and her team from the EU Publication Office for the fruitful collaboration. Finally, we would also like to thank the members of the LIDER project, whose activities emphasise the need for and greatly support the development of linguistic Linked Data.
