Abstract
Automatic validation of compositionality vs non-compositionality is a very challenging problem in NLP. A very small number of papers in literature report results in this particular problem. Recently, some new approaches have arised with respect to this particular linguistic task. One of these approaches that have called our attention is based on what authors call “lexical domain”. In this paper, we analyze the use of Pointwise Mutual Information for constructing thesauri on the fly, which can be further employed instead of dictionaries for determining whether or not a given phraseological unit is compositional or not. The experimental results carried out in this paper show that this dissimilarity measure (PMI), can effectively be used when determining compositionality of a given verbal phraseological unit. Moreover, we show that the use of thesauri improves the results obtained in comparison with those experiments employing dictionaries, highlighting the use of self-constructed lexical resources which are, in fact, taking advantage of the same vocabulary of the target dataset.
Introduction
Validating phraseological units from raw text is a very challenging task with an increasing interest by the computational linguistics community. Different approaches have been presented in literature [25] with variable results with respect to their performance. A Phraseological Unit (PU) is basically one type of MultiWord Expression (MWE) which can be classified into one of a wide range of linguistic constructions such as idioms (storm in a teacup, sweep under the rug), fixed phrases (in vitro, by and large, rock’n roll), noun compounds (olive oil, laser printer), compound verbs (take a nap, bring about), etc. Phraseological units have always played an important role in language learning, in particular in the discourse training of native language speakers and in the process of learning a foreign language. The global meaning of a phraseological unit cannot usually be obtained through the independent meaning of its components but as a whole [22]; so it is considered a non-compositional phraseological unit, which usually has a “figurative” meaning. Three features are associated with these particular multiword lexical units: 1) their poly-lexical meaning, 2) their fixation degree and, 3) their idiomaticity or lexical opacity [24].
There exists an increasing interest in developing automatic computational methods for verifying whether or not a given VPU has a literal or figurative meaning. This problem is important, since the solution of it may significantly improve the automatic understanding of language. It is of course a very challenging task because the input string is exactly the same, but the result of the verification system can be either, a figurative VPU or a literal VPU.
In this paper, we are particularly interested in studying the non-compositionality characteristic of Mexican Spanish phraseological units containing one verb as the grammar centre, i.e., Verbal Phraseological Units (VPU). Examples of VPUs follow: “Leer entre líneas (To read between the lines)” or “Pasar el chisme (hear (something) through the grapevine)”.
Particularly, by analyzing the research work presented in [26], we observed that a new methodology based on what authors call “domain based” for validation of these linguistic structures is presented. We recognize that the approach is novel and very interesting, however, it yields on the existence of dictionaries for each of the target languages in which the test data has been written. Perhaps, it is not so problematic to find dictionaries for general domains, however, if we pretend to evaluate specific domains, then such narrow-domain dictionaries should be also available, an issue that we consider will be very difficult to solve. Thus, in this paper we propose to replace dictionaries by automatic constructed thesauri in the approach presented by [26]. Such lexical resource can be obtained by the same test dataset or another raw text and therefore, a particular or domain specific dictionary will be not necessary.
The rest of this paper is structured as follows. Section 2 presents some papers reporting research results about non-compositionality of phraseological units. In Section 3 we shortly describe the methodology presented by Priego et al. [26] and the manner in which we modify it for incorporating the use of thesauri constructed on the fly, which is in fact the contribution of this paper. In Section 4 we report and discuss the experiments carried out and the results obtained. Finally, in Section 5, the findings of this research work are given, together with further work depicted in this particular task.
State of the art
MWEs are prevalent in modern text and increasing in frequency as modern language develops. The automatic identification of these linguistic structures is a challenging task for the Natural Language Processing (NLP) area, which has obtained an increasing interest by the NLP community. Once, that linguistic structures are identified, it is needed a validation process for determining if they are non-compositional (figurative meaning) or compositional (literal meaning)
In literature, there exist at least three different approaches for identifying non-compositionally on MWEs: Statistical approaches: These approaches are based on the use of frequencies or co-occurrence metrics over the MWEs. In [6], for example, authors present an approach based on a sequence of statistically related words (n-grams), whereas in [7] it is proposed to identify MWEs by using frequencies of bigrams. In [5], different statistical metrics together with linguistic information are used for identifying MWEs. Other research works, as the one presented in [34] employ machine learning techniques based on conditional random fields for this particular problem. It is important to mention that, a supervised corpus is needed for that purpose. Knowledge based approaches: These approaches usually employ parsers, lexicons, taggers or language grammatical rules, i.e., resources constructed using human knowledge. Some papers reported in literature are: [13] who identify MWEs by combining syntax and morphosyntax of the candidate MWEs. In [21], an approach based on a translation of the MWE from English to French is presented. Once the candidate MWE is translated by using a translation dictionary, the translated sentence is searched in a corpus of the target language in order to identify a literal translation. In such case, the candidate MWE is considered to have a literal meaning, otherwise, we conclude that it has a figurative meaning. Finally, in [3], authors rely the experiments on the employment of knowledge based tools for syntactic-semantic analysis of the candidate MWE. Hybrid approaches: A combined version of the two previous approaches is recurrently reported in literature. Some examples of research works using this approach are: [9, 31].
Other research works are reported in literature using categorization techniques: supervised and unsupervised ones. In [24], for example, an approach for determining whether or not a phraseological unit is compositional is presented. Authors employ different supervised machine learning methods for constructing classification models in order to find out if a given phraseological unit of the news stories genre is compositional or not. In [15], authors measure non-compositionality of collocations verb-noun using lexical functions and WordNet hypernyms. Some unsupervised approaches, such as the one presented in [36], can also be found in literature. They employ unsupervised learning of a semantic composition function in order to detect non-compositionality of compound nouns.
In the last years, approaches based on deep learning has been also reported in literature. The approaches based on deep learning have the advantage that they can easily leverage pretrained word vectors as features [8, 32]. One of the first works reported using deep learning techniques is presented by [12].
The study of the non-composicionality property of phraseological units has been of increasing interest in the last decade. Some forums are promoting the study of this linguistic property, such as the ACL-HLT 2011 workshop on Distributional Semantics and Compositionality (DiSCo) with a shared task to assign a graded compositionality score to phrases in a given corpora [4]. In that workshop, 10 different international teams have participated, covering English and German languages. Different approaches such as statistical, unsupervised, among others were presented.
Some other international forums dealing with the automatic identification of MWEs are promoted by the research interest group named MWE-SIGLEX 1 and by PARSEME 2 , both forums held on 2017 [20] and 2018 [27].
The PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider “suicidarse” in Spanish). The shared task covers 18 languages: Bulgarian, Czech, German, Greek, Spanish, Farsi, French, Hebrew, Hungarian, Italian, Lithuanian, Maltese, Polish, Brazilian Portuguese, Romanian, Slovene, Swedish and Turkish. For all these languages, two corpora are provided (training and text corpora) to the participants.
In summary, non-compositionality has been approached from different sides, from pure statistical approaches to knowledge-based ones. In this paper, we analyze the use of Pointwise Mutual Information for constructing thesauri on the fly, which can be further employed instead of dictionaries for determining whether or not a given phraseological unit is compositional or not.
Description of the proposed approach
In this paper, we present an improvement to the methodology proposed by [26], in which it is possible to validate, with a certain degree of precision, whether or not a multiword expression is compositional. The mentioned approach employs what it is called “lexical or semantic domain”, therefore, in the following paragraphs we recall this concept, so that it can be better understood how we improve that approach presented.
Compositional vs non-compositional validation based on domains
Verifying non-compositionality of a given verbal phraseological unit is challenging, since the same set of words may refer either to a literal or a figurative meaning. Let us consider the following phrase: “to take the bull by the horns”, which in a figurative context will mean “to deal with a difficult situation in a very direct way”. However, the same phrase in another context would have a literal meaning in which a person is literally taking the bull by the horns. In the latter case, the VPU will likely appear in a context with words sharing the topic, i.e., sharing the knowledge domain.
In order to illustrate the last statement, we present the two following examples in which we are able to observe one sentence containing a VPU with a figurative meaning (Sentence 1) and another one containing a VPU with a literal meaning (Sentence 2).
[
[
In Sentence 1, we can see that the nouns “bull” and “horns” of the VPU are likely out of the domain with respect to the word “office” which is in the VPU context. However, in Sentence 2, the same two nouns are closer to the context domain than the previous sentence, since they are related with “bullring” in the first case, and maybe with the word “Spain” which is quite known because of “bullfighting”.
We can make the observation that literal VPUs are domain dependent, whereas figurative VPUs are likely free domain, which lead us to formulate the following linguistic hypothesis:
[
It is true that some figurative VPUs can also be used in their own semantic domain, however, this is not always the case, and the number of occurrences of this type is relatively low in comparison with the occurrences of figurative VPUs in other domains different to the VPU’s one.
The methodology proposed
In order to validate the aforementioned hypothesis, we first need to identify the semantic domain of both, the VPU and its context. A lexical or semantic domain can be constructed by bringing together a number of lexically or semantically similar words.
In [26], it is proposed a basic general methodology for verifying the non-compositionality of a given VPU as follows:
[
[
[
[
This is, of course, a quite generic methodology, since it doesn’t indicates how to calculate Dom (VPU), Dom (VPU C ), Dist (Dom (VPU) , Dom (VPU C )), and how to identify the threshold ε, used for determining the minimum distance between the two domains to be considered similar.
They have proposed a novel method for determining the semantic domain of a given short text by using “dictionaries”, which we consider to be too much linked to the target language, and additionally very difficult to be executed when narrow domain text will be used in the experiments. That is why we are proposing to use instead thesauri automatically constructed on the fly. The next subsection will describe the manner in which we construct the thesauri and how it will be used for determining the particular domain one given word belongs to.
Thesauri construction by means of pointwise mutual information
Pointwise Mutual Information (PMI) is an information theory based co-occurrence measure discussed in [19] which has applications for finding collocations by determining the co-occurrence degree among two terms. This may be done by calculating the ratio between the number of times that both terms appear together (in the same context and not necessarily in the same order) and the product of the number of times that each term occurs alone. Given two terms X1 and X2, the pointwise mutual information between X1 and X2 can be calculated as follows:
The numerator would be modified in order to take into account only bigrams, as presented in [2], where an improvement of clustering short texts in narrow domains has been obtained. We have used the pointwise PMI for obtaining a co-occurrence list from the same target dataset, which we called an automatic created theasuri. The obtained lexical resource is then employed as a kind of “dictionary” in the methodology already proposed by Priego et al. in [26]. This is in fact a great improvement since we do not have to rely the experiments in the existence of a real dictionary of the target language, but we can construct, on the fly, the domain based thesaurus using the same data from which we want to validate the phraseological units.
In order to fully appreciate the automatic construction method of thesauri, in Table 1 we show the co-occurrence list for some words related with the “bull” word, obtained from the same test corpus.
An example of co-occurrence terms
Let us consider a thesaurus Thes with n word entries (entry i ), each one associated with a number of co-occurrence terms. Thus, it is possible to define a function, such as, given an input word, it will return the thesaurus entries (entry1, entry2, . . . , entry k ) in which the list of their co-occurrrence words contain at least one occurrence of the input word; the function can be represented as getEntry (word, Thes).
Based on that function, we can define a new function for determining the “semantic domain” of a single word as: Dom (word) = getEntry (word, Thes), and therefore, we can extend this definition when we are dealing with more than one word in a phrase as shown in Eq. (2).
The same definition applies for the words appearing in the context of the VPU (VPU
C
), thus, the “semantic domain” of the VPU context can be calculated as shown in Eq. (2C).
In order to obtain such semantic domains (VPU and its context), it is needed to have a thesaurus associated to the target language. In our particular case, we have used the target test dataset for constructing such thesaurus on the fly. The terms and co-occurrence terms found have been indexed employing standard methods used in the information retrieval area. Each word in the list of co-occurrence of the thesaurus entry is indexed so that we can be able to find the corresponding thesaurus entry. The different words in whose co-ocurrence terms appears a particular word of the VPU, defines the lexical or semantic domain of the VPU. Finally, we can calculate the difference between Dom (VPU) and Dom (VPU C ) employing a simple intersection metric, as shown in Figure 1.

Measuring distance between two semantic domains.
In this section we present a heuristic evaluation of the proposed approach. Thus, we first describe the datasets employed in such evaluation, thereafter, the metrics used are enumerated, finally, the experimental results are shown and discussed.
Description of the datasets
We have used the same two datasets employed by [26] for the experiments carried out. It contains a number of news stories (from a Mexican newspaper) having verbal phraseological units. The description of such datasets follows.
Firstly, the authors extracted all the verbal phraseological units from a dictionary named “Dictionary of Mexicanisms” 3 , obtaining a total number of 1,219 verbal phraseological units from that dictionary which have been stored in a database, considering them to be further employed for identifying their regular use in the Mexican newspaper domain. From that dataset they have filtered by selecting only the most representative ones, which in this case resulted to be 69 VPUs, by taking into account the frequency of occurrence in the corpus of those VPUs and selecting at the end the most frequent ones.
On the other hand, by means of information retrieval techniques, they found 3,164 news stories containing at least one occurrence of some of the selected verbal phraseological units. This process considers the occurrence of the original VPU in any of its morphological variants; for this purpose, they have lemmatized both, the VPU and the text in the news story, so that they can be able to find variations of the VPU in the target texts. The news stories were gathered from Mexican newspapers belonging to the Mexican Editorial Organization 4 . All the texts compiled are written in Mexican Spanish and contain news stories that occurred between the years 2007 and 2013.
As a consequence of counting the occurrence of Mexican verbal phraseological units in the corpus gathered, they were able to construct a manually annotated corpus, which is employed as a gold standard for the purposes of this paper. The context gathered was manually annotated by 5 human annotators with an inter-annotators agreement greater than 80%. Each human annotator was asked to manually classify when a given raw text contained a non-compositional VPU (Class 1), or when that text contained a compositional VPU (Class 2). The description of the first corpus employed is given in Table 2. As can be observed, this dataset is highly unbalanced, having about 90% of samples containing non-compositional phrases.
Description of the first manually annotated dataset (unbalanced)
Description of the first manually annotated dataset (unbalanced)
In the experiment carried out, we have additionally used a second corpus constructed by those authors (See Table 3). This one is much more balanced than the previous one. Thus we are able to see the behavior of the proposed approach in two different scenarios.
Description of the manually annotated balanced dataset
For comparison purposes, the metrics employed for the evaluation in this paper are exactly the same that those employed in [26]: Precision, Recall, and F-measure.
In this Section we present and discuss the results obtained after executing the experiments with the methodology proposed by [26], comparing the results they reported using “dictionaries” with respect to those we propose using “thesauri”.
In Figure 2 we observe the results reported by [26] on the unbalanced dataset. Figures 3 and 4, on the other side, show the results obtained on the same unbalanced dataset when on the fly constructed thesauri with a PMI greater than 5 and 7, respectively, are used. In these figures we can see the three metrics employed for the evaluation: Precision, Recall and F-measure. A comparison of the performance obtained when using the three different lexical resources is shown in 5. Here we can observe how the use of lexical thesauri obtains better results than the one employing a dictionary.

Results obtained with the unbalanced dataset using a dictionary.

Results obtained with the unbalanced dataset using a thesaurus (PMI >5).

Results obtained with the unbalanced dataset using a thesaurus (PMI >7).
The first finding is that the use of thesauri allows us to find a good performance (greater than 99% of precision) in a very small number of terms in the intersection between the two “lexical or semantic domains”. In contrast, by using dictionaries, it is needed to search for a greater number of terms in the intersection in order to be able to discriminate between compositional and non-compositional multiword expressions. Moreover, in Figure 5 we can observe that the two thesauri constructed improve the performance (significantly) with respect to the approach using dictionaries. The F-measure obtained when a thesaurus filtered with PMI > 5 is greater than 0.94, whereas when using a theasaurus filtered with PMI > 7 is greater than 0.95 and finally, is lower than 0.93 when using a dictionary, when a number of 3 terms in the intersection of the two domains is employed. In that Figure, it can be seen that the performance of the approach using a thesaurus filtered with PMI > 7 is always greater than the one using a dictionary.

F-Measure obtained with the unbalanced dataset using three lexical resources (dictionary, thesaurus with PMI>5 and thesaurus with PMI>7).
As the dataset is highly unbalanced, the results may be biased, therefore, we execute the same experiments, but using a balanced dataset, i.e., one containing a similar number of compositional and non-compositional samples. In Figure 6 we present the results reported by [26] on the balanced dataset, i.e., using a dictionary. Figures 7 and 8, show the results obtained on the same balanced dataset when on the fly constructed thesauri with a PMI greater than 5 and 7, respectively, are employed. In these figures we can see the three metrics used for the evaluation: Precision, Recall and F-measure. A comparison of the performance obtained when using the three different lexical resources is shown in 9. Here we can observe how the use of lexical thesauri obtains again better results than the one employing a dictionary.

Results obtained with the balanced dataset using a dictionary.

Results obtained with the balanced dataset using a thesaurus (PMI >5).

Results obtained with the balanced dataset using a thesaurus (PMI >7).
Again, we can observe that the use of thesauri allows us to find a good performance (greater than 99% of precision) in a very small number of terms in the intersection between the two “lexical or semantic domains”. In contrast, by using dictionaries, it is needed to search for a greater number of terms in the intersection in order to be able to discriminate between compositional and non-compositional multiword expressions. Moreover, in Figure 9 we can observe that the two thesauri constructed improve the performance with respect to the approach using dictionaries. However, the F-measure is lower in comparison with the previous experiment (using an unbalanced dataset), due to the low value of recall obtained. Actually, the F-measure of the approach using a thesaurus filtered with PMI > 7 is again always greater than the one using a dictionary, but in this case it is about 0.7, which we consider to be more realistic than the results obtained when the unbalanced dataset is used.

F-Measure obtained with the balanced dataset using three lexical resources (dictionary, thesaurus with PMI>5 and thesaurus with PMI>7).
The optimum value for the threshold that allows to determine whether or not a multiword expression is compositional seems to be 3 or 4 when using thesauri. This finding will allow the approach may work faster when verifying non-compositionality of a given verbal phraseological unit in raw texts.
In this paper, we have presented a novel improvement for the approach proposed by [26] for verifying non-compositionality of a phrase containing a verbal phraseological unit. We have proposed to use on the fly constructed thesauri instead of dictionaries for calculating the lexical or semantic domain. This improvement allows to use the same target dataset for constructing the thesaurus which can be effectively used when those dictionaries does not exist, in particular, when datasets of narrow or restricted domains are used.
The proposed improvement shows a better behaviour with respect to that one using a dictionary. Moreover, the intersection level between the two domains used in the discrimination phase, has been decreased to a very small number of terms, which is more consistent with the hypothesis presented by [26].
In summary, we have learned the domain topics by employing statistical co-occurrence metrics of words using the technique proposed by [23]. In this case, a lexicon of word co-occurrence has been calculated, and therefore, the union of all the semantic words of the candidate compositional phrase is compared against the union of all the semantic words of their context in order to determine similarity between these two domains obtaining very good results.
Footnotes
It is a section of SIGLEX, the ACL Special Interest Group on the Lexicon. It is dedicated to promote scientific activity on multiword expressions (MWEs)
PARSEME (PARSing and Multi-word Expressions), is an interdisciplinary scientific network devoted to the role of multi-word expressions(MWEs) in parsing.
