Abstract
This paper describes the publication and linking of (parts of) PAROLE SIMPLE CLIPS (PSC), a large scale Italian lexicon, to the Semantic Web and the Linked Data cloud using the lemon model. The main challenge of the conversion is discussed, namely the reconciliation between the PSC semantic structure which contains richly encoded semantic information, following the qualia structure of the Generative Lexicon theory and the lemon view of lexical sense as a reified pairing of a lexical item and a concept in an ontology. The result is two datasets: one consists of a list of lemon lexical entries with their lexical properties, relations and senses; the other consists of a list of OWL individuals representing the referents for the lexical senses. These OWL individuals are linked to each other by a set of semantic relations and mapped onto the SIMPLE OWL ontology of higher level semantic types.
Introduction
The central aim of the linked data movement is to make it easier to use and to share data distributed at various locations across the web by setting up a standardized way of structuring, describing, and interlinking datasets [2]. In the linked data model, data is formatted according to the Resource Description Framework (RDF)1
The language resources and technology (LRT) community is becoming increasingly active within the linked data movement. This is the result of a greater awareness of the opportunities that linked data offers for setting up the kind of general LRT infrastructure variously described in the LRT literature as the “Lexical Web” [3] and as a “Lexical Linked Space” [10]. LRT research has traditionally put great emphasis on the standardisation, linking, and reusability of lexical resources (LRs) and the linked data movement makes it far easier to achieve these core aims.
The increased awareness of the importance of linked data within the LRT community has resulted in a trend towards the conversion of language resources, in particular lexicons, using the RDF format. This has the added benefit that these language resources can also be linked by other kinds of resources on the linked data cloud, such as for example DBpedia.
By now there has been extensive work carried out on the publication of lexical resources as linked data. Among the most important studies in this area are [9,18] describing the conversion of the Princeton WordNet, and [5] for the multilingual resource EuroWordNet.
This paper describes the conversion of a subset of the lexical entries, namely all of the nouns, from a large-scale, multi-layered Italian lexicon PAROLE SIMPLE CLIPS (PSC) as linked open data using the lemon model. This process included the full conversion of the semantic layer of the lexicon into Web Ontology Language (OWL), as well as the creation of a resource containing all the nouns of PSC lexicon; these two resources were then linked using the
lemon (LExicon Model for ONtologies)2
lemon defines a core module and a set of additional modules that together serve to describe the basic morphological and syntactic data typically associated with the lexical entries in a lexicon. It also allows the addition of semantic data to a given lexical entry though mapping the lexical entry to a concept in an ontology via an intermediate lexical sense object. This entails a clear separation between the linguistic and ontological levels of a lexical resource which in turn enables the reuse and plugging in of different ontologies to the same lexical resource.
As mentioned above, this paper describes the (partial) conversion of a lexical resource into the RDF format. The lemon framework was adopted for this purpose for a number of reasons. It was, the authors felt, an extremely efficient and easy to use model. In addition it has become one of the popular models available for publishing computational lexicon as linked data, indeed, one might argue it is almost a de facto standard3
An extensive list of lemon users can be found at http://www.lemon-model.net/.
PAROLE SIMPLE CLIPS (PSC) is a multi-layered Italian language lexicon that was developed in successive stages within the framework of three major lexical resource projects, PAROLE, SIMPLE and CLIPS. PAROLE [15] and SIMPLE [12] were consecutive European projects which resulted in the creation of a wide ranging Italian language lexicon PAROLE-SIMPLE (in addition to similar lexicons in 11 other European languages) that was structured into different, interconnected layers; CLIPS4
CLIPS stands for Corpora e Lessici dell’Italiano Parlato e Scritto.
The lexical information in PSC is encoded at different descriptive levels; these are the phonetic, morphological, syntactic and semantic layers. The semantic layer of PSC, SIMPLE, is largely based on Pustejovsky’s Generative Lexicon (GL) theory [1,14].
GL theory is based on the idea that the meaning of each word in a lexicon is structured into components, one of which, the qualia structure, consists of a bundle of four orthogonal dimensions. These dimensions allow for the encoding of four separate aspects of the meaning of a word: the formal, namely that which allows the identification of an entity within a hierarchy; the constitutive, what an entity is made of; the telic, that which specifies the function of an entity; and finally the agentive, that which specifies the origin of an entity. These qualia structures are used within GL theory in order to explain polysemy in natural languages.
The semantic layer of PSC, which we will refer to as SIMPLE, is actually based on the notion of an extended qualia structure [16], which is, as the name suggests, an extension of the notion of qualia structure found in GL in which a hierarchy of different constitutive, telic, and agentive relations has been defined, as we will see below.
SIMPLE contains a language independent ontology of 153 semantic types common to all of the different language lexicons that were developed as part of the PAROLE SIMPLE projects, as well as
As well as a series of lexical relation relations organized into 5 main classes: and 4 sets of lexical relations
For example, the lexical entry limone (which means lemon in English) has three USems each one subsumed by a different semantic type.
type: Fruit
type: Color
type: Plant
Among these three USems, the PSC semantic framework implements different types of relations. E.g., qualia relations such as:
is-a USemD2369frutto, produced-by USemD2244limone, object-of-the-activity USemD598mangiare,
and lexical relations such as:
polysemy-plant-fruit USemD2244limone,
polysemy-vegetal-entity-color USem76884limone.
Previous work [17] on the construction of an OWL ontology, SIMPLE-OWL, based on the semantic type ontology that was informally presented in the SIMPLE specifications, began with the extraction of the semantic types (e.g., “Plant”, “Flower”, “Color” etc.). Relations were then induced between these semantic types by generalising relations between USems (e.g., “is-a” and “contains”) and the features associated with them (e.g., “plus_edible”), and adding a number of well-formedness constraints. SIMPLE-OWL was induced from the SIMPLE lexicon using a “bottom-up” strategy. As well as formalizing the typical ontological relations derived from the qualia structure, SIMPLE-OWL also contains lexical relations. The SIMPLE-OWL ontology was the starting point of the work described in this paper.
In this section we explain how the (partial) conversion of PSC into lemon was carried out, paying particular attention to the distinction between the meaning of
As described above the lemon model requires a lexical sense object to mediate between a lexical entry and the meaning of that entry as provided by a vocabulary item in an ontology.
The main problem faced in this conversion related to the fact that it was not always possible to identify PSC USems with lexical sense objects in lemon.
This becomes evident when one comes to consider how the lemon model is defined and its various components descibed in works such as [4] and the lemon cookbook. Given that one of the definitions of a lexical sense is as a reification of word-meaning pairings, it would seem that explicitly lexical relations such as those relating to synonymy and hyponymy, were better placed between lexical senses; whereas more conceptual relations, namely those explicitly pertaining, or that seem to pertain to the extensions of words were better placed in an ontology.
Of course there are grey areas,5
Most works on lexical semantics for example consider that meronymy relations to be lexical relations.
In SIMPLE however no such division is made and USems can be linked both by relations which which are clearly lexical (polysemy, derivation, …) and those which relate to the meaning of lexical entries (such as producedby, hasparts, …) rather than as senses qua reified word-meaning pairings.
For this reason the decision was taken to duplicate each USem from SIMPLE both as a lemon lexical sense and as an individual in an ontology; the former was then linked to the latter using the lemon reference relation. The aforementioned ontological individuals were then mapped onto their types in the SIMPLE-OWL ontology.
This allows one to properly distinguish between the SIMPLE relations: so that SIMPLE lexical relations are now encoded between lemon lexical senses in a lexicon, whereas SIMPLE qualia relations now relate items in an ontology. In addition to this it was decided to use the “is-a” relation among USems also to induce the narrower/broader relations among lexical items as defined by the lemon model. We partitioned the final conversion into the following datasets:
With the following sets of relations among items:
Extended qualia relations as defined in SIMPLE-OWL, holding between individuals;
Lexical relations, as defined in SIMPLE-OWL, holding between lexical senses;
Induced narrower/broader relations, as defined by the lemon model, holding between lexical senses.
Here a set of examples are given to clarify the procedure.6
Here and in other examples, we have used the Turtle notation. See
Each lexical sense connects a lexical entry to a corresponding USem in SIMPLE Entry through a
The namespaces inds and simple in the following examples are defined in Section 5.
Then lexical relations are instantiated among lexical senses in the pscLemon resource. In lemon we have:
The last information to be added to the pscLemon resource concerns the narrower/broader relations. Using the the “is-a” qualia relation it is inferred that the sense limone_1 is narrower than the sense frutto_1 which gives:
The SIMPLE Entry resource contains the relations among concepts (USems) and the link between each concepts and the general ontological types defined in SIMPLE-OWL ontology. As stated above, this resource contains only the set qualia relations.
Figure 1 represents the interrelations among the three resources described above.

Schema of the example.
The whole dataset produced for this paper is hosted at
the indexes of PSC lexical entries and of PSC ontological referents, containing the lists of individual URIs for online access;
a compressed version of the lexicon and ontology for download.
Resources and namespaces
Files, units and triples
To limit the number of files in a each folder, a file system structure was created under the namespaces

Example of folder structures.
The solution presented above seems to go a large part of the way towards reconciling the lemon philosophy of separating the lexical and ontological layers of lexical resources with the representation of the multiple dimensions of meaning instantiated by SIMPLE. This differentiates the present solution from other possible solutions in which all SIMPLE semantic relations are encoded directly among lexical sense objects without reference to an external ontology.
In a recently submitted paper [11], a proposal was presented for translating the PSC verbs into RDF. In this proposed model, a verb sense can have an associated syntactic frame as well as a predicative semantic frame. These syntactic and semantic frames are then further specified as regards their arguments using for example the LMF property
