Abstract
In this work, we report the results of our experiments on the task of distinguishing the semantics of verb-noun collocations in a Spanish corpus. This semantics was represented by four lexical functions of the Meaning-Text Theory. Each lexical function specifies a certain universal semantic concept found in any natural language. Knowledge of collocation and its semantic content is important for natural language processing, as collocation comprises the restrictions on how words can be used together. We experimented with word2vec embeddings and six supervised machine learning methods most commonly used in a wide range of natural language processing tasks. Our objective was to study the ability of word2vec embeddings to represent the context of collocations in a way that could discriminate among lexical functions. A difference from previous work with word embeddings is that we trained word2vec on a lemmatized corpus after stopwords elimination, supposing that such vectors would capture a more accurate semantic characterization. The experiments were performed on a collection of 1,131 Excelsior newspaper issues. As the experimental results showed, word2vec representation of collocations outperformed the classical bag-of-words context representation implemented in a vector space model and fed into the same supervised learning methods.
Introduction
Human language is a highly ambiguous system, thus presenting a big challenge for natural language processing (NLP) by a machine. In spite of ambiguity, human speakers are able to produce and understand text messages in natural language, so NLP as an area of Artificial Intelligence (AI) has been trying to develop formal models to be implemented in machine text generation and comprehension.
As many other formal models proposed up to date, the formal model of lexical functions developed within the Meaning-Text Theory [1] aims at resolving language ambiguity. In doing so, it created the taxonomy of lexical functions which in a non-ambiguous way represent basic semantic and syntactic properties of collocations. Collocations are defined as phrases whose meaning cannot be derived from the meaning of their constituent words, for example, take a step. This collocation does not mean ‘grab a step’, but ‘perform the action of stepping’. As opposed to collocations, the meaning of free word combinations is compositional, e.g., the semantics of take a book is transparent and easily interpreted as the union of the meanings of individual words: ‘grab’ and ‘book’. Thus, collocations are a particular case of word ambiguity. However, being a particular case, it is not a small case, since collocations may comprise up to 70% of text depending on text theme and style.
Most collocations include two words called the base and the collocate. The lexical choice of the collocate depends on the base. In the example take a step, the base is a step and the collocate is take. In order to lexicalize the meaning of ‘perform the action expressed by the base’, a step chooses take but not any of its synonyms, for example, perform, realize, carry out, etc. Such selectional preference of the base is characteristic of collocations.
Lexical functions (LFs) capture selectional preferences of words as mappings from the base, which is the LF argument, to a set of other words, or collocates, which is the LF meaning. For example, we can generally represent take as step as LF(step) = take. However, each lexical function has its meaning and syntactic pattern, in fact, the specific notation for take a step is Oper1(step) = take. Oper1 is one of about 60 LFs described in [1], its name comes from the Latin word operari ‘do, carry out’, and its semantics is ‘perform the action given by the noun’, as in the collocations make a decision, take a walk, do an exercise, give a smile, pay a visit. Although in these collocations the verbs vary, all of them convey the meaning of ‘performing’ the action lexicalized by the respective noun.
Integers in the LF notation capture the predicate-argument structure typically used to express the LF meaning in sentences. In Oper1, 1 means that the action is performed by the agent, the first argument of the verb; therefore, Oper1 represents the pattern ‘the agent performs what is expressed by the noun’. Consider, for example, the president made a decision to continue the discussion. Here, the president is the agent of the action made (a decision).
Other common LFs found in collocations, and in particular, in verb-noun collocations, are Real1, CausFunc0, and CausFunc1, we explain them in what follows.
Real1, Latin realis ‘real’, conveys the meaning of fulfilling a requirement imposed by the noun or performing an action typical for the noun: drive a bus, follow advice, spread a sail, prove an accusation, succumb to illness, turn back an obstacle.
Oper1 and Real1 are simple LFs, they formalize a single semantic unit. A combination of more than one semantic units is denoted by a complex lexical function, for example, CausFunc0 and CausFunc1.
Caus, from Latin causare ‘cause’, formalizes the pattern ‘do something so that the event denoted by the noun starts occurring’. Func0, from Latin functionare ‘function’, represents the meaning ‘happen, occur’. Making the union of these two concepts, we obtain the function CausFunc0 meaning ‘the agent does something so that the event denoted by the noun occurs’, some examples are bring about the crisis, create a difficulty, establish a system, produce an effect.
Another complex LF, CausFunc1, formalizes the construction ‘the non-agentive participant does something such that the event denoted by the noun occurs’, for example, open a perspective, raise hope, cause damage.
In this work, our objective was to study how word2vec embeddings could affect the performance of supervised learning methods, commonly used in a wide range of NLP applications, on the task of detecting lexical functions in Spanish verb-noun collocations based on a corpus. To the best of our knowledge, there is only one research work, which we mention in section 2, where the authors used word embeddings to cluster collocations, expecting that each cluster would contain a specific lexical function. The scarcity of research on the topic is probably explained by its complexity, since lexical functions encapsulate semantic peculiarities and nuances of collocations on a highly fine-grained level.
The rest of the paper is organized as follows. Section 2 presents related work, in Section 3 we explain the proposed method and give a description of our dataset and the corpus. In Section 4 we discuss the results, and Section 5 includes conclusions and future work.
Related work
It is not a trivial task to detect fine-grained semantic differences represented by lexical functions automatically, so more research has been so far accomplished on manual methods of annotating collocations with lexical functions, and consequently building collocation dictionaries enriched by LFs, or semantic networks and ontologies incorporating LFs in them. One of such projects is French Lexical Network (FLN) [2] based on the LF taxonomy and available for download in the XML format; connections between words in the network are tagged with syntagmatic relations corresponding to LFs. Such annotation, absent in WordNet [3], another popular and widely used semantic network, makes FLN a unique repository of lexical semantic data existing up to date. Fonseca, Sadat, and Lareau [4] developed a Java API in order to retrieve data from FLN, which they used to detect collocations in a textual corpus with an overall precision of 0.763 and to produce semantic classification of the identified collocations.
A few works have been done on LFs automatic detection. Wanner [5], Wanner, Bohnet, and Giereth [6] experimented with nine LFs in Spanish verb-noun collocations, representing each collocation by a vector whose features were hypernyms of the noun and the verb. In the experiments, an average F1-score of about 0.70 was obtained. The highest results were shown by the ID3 algorithm with an F1-score of 0.76 and by the Nearest Neighbor technique with an F1-score of 0.74.
Tutin [7] took advantage of the database of French collocations described by Polguère [8], syntactic patterns, and finite-state transducers associated with metagraphs for labeling collocations with LFs in corpora. In the experiments, a precision of 0.9 and a recall of 0.862 were archived.
Enikeeva and Popov [9] applied Affinity Propagation algorithm to cluster collocations using word embeddings [10] of the collocation constituents. Clusters were expected to contain different semantic classes of collocations, each with respect to a particular lexical function. A precision of 0.64 was obtained on LFs for verb-noun collocations, which is the collocation type we experimented on in this work.
In our previous work [11], an average F1-measure of 0.74 was achieved on Spanish verb-noun LFs, applying all relevant supervised learning methods implemented in Weka software [12]; collocations were represented as vectors of Spanish WordNet [13] hypernyms of the constituent words. We also studied how LFs can be detected by their context words using supervised learning in [14]. The best F1-score was 0.50 using tf-idf of context words as features and support vector machine with a linear kernel.
It is worth mentioning here that high values of precision, recall, and F1-score on LFs detection were achieved due to incorporating manually produced, richly annotated lexical, syntactic, and semantic resources. However, the challenge now is to design resources-independent methods based only on corpora, and to this end, a precision of 0.64 as in [9] or an F1-score of 0.50 as in [14] seem to be quite satisfactory for the initial stage of working on this task.
Finally, we have to note that the numbers mentioned in this section are not comparable: the authors ran their experiments on different datasets and in different languages. In spite of that, we cited the results for the reader to have a basic idea of advances in this complicated NLP task.
Data and method
In our experiments, we used the dataset of Spanish verb-noun collocations labeled manually with lexical functions compiled by us in previous work [11]; Table 1 gives some examples of our data. From this dataset, we chose four lexical functions, namely, Oper1, Real1, CausFunc0, and CausFunc1, due to their relatively high frequency in texts. The meaning of these LFs was explained in the Introduction.
Examples of data used in our experiments
Examples of data used in our experiments
For each of the four functions, we took 60 verb-noun collocations, so our experimental data are perfectly balanced. To these 240 collocations in total, we added 60 verb-noun combinations which are not collocations, but free word combinations, in order to study to what extent these can be distinguished from collocations.
The corpus used in our experiments is a collection of 1,131 issues of the Excelsior newspaper in Spanish published within the period from April 1, 1996 to June 24, 1999.
Our objective was to study the effect of word2vec embeddings on the discriminative ability of classifiers applied to lexical functions in verb-noun collocations.
To build word embeddings, we used gensim.Word2Vec implementation. Many NLP researchers use word embeddings pre-trained on very large corpora, this saves time and effort. However, such embeddings are generated for tokens, or wordforms, but not for words as concepts; this introduces noise in posterior semantic similarity measurement.
Let us illustrate this point with an example for Spanish, since we did this work on Spanish material. There are a number of word embeddings built for this language, and one of them is FastText embeddings 1 generated from Spanish Billion Word Corpus 2 of almost 1.5 billion words [15, 16]. The embeddings are provided with several built-in functions, one of which is similar word retrieval. The line
print(vectors.most_similar([’adiós’])),
invoking most similar words to adiós (good-bye), produces the following output (here we show only words without the corresponding numerical value of similarity measure for each word, with English translation in parenthesis):
despedida (farewell), despedirse (say good-bye, reflexive form), suspiro (sigh), despide ([he/she] says good-bye), despedirme ([I] say good-bye), querida (dear), beso (kiss), llora ([he/she] cries), despedimos ([we] said good bye), despidieron ([they] said good-bye).
As it can be observed, of the ten output words, five are different morphological forms of one and the same verb despedir (say good-bye), so in fact, we get six similar words to the input adiós, but not ten.
Similar word search for verbs happens to be even noisier, because verbs have many more morphological forms than nouns. The line
print(vectors.most_similar([’tener’])),
invoking most similar words to the verb tener (have), produces the output:
tenerlo, tenga, tengan, tienen, tenerse, tenerle, tiene, tenerla, contar (tell), estar (be).
In this list, the first eight words are various morphological forms of the input verb tener, so they have the same core semantics. Surely, different grammatical forms of one word are all similar to its lemma or dictionary form (tener in our case), but this is not really what we expect the machine to find when we design methods for similar words retrieval. Rather, we would prefer to see in the output, for example, the verb poseer (possess), a synonym of tener, but what we got are contar (tell) and estar (be), both can hardly be called similar to tener.
Keeping this in mind, in the pre-processing stage, we lemmatized our corpus and eliminated stopwords, mostly, articles and prepositions. Then we ran the word2vec algorithm, as of gensim implementation, on our corpus, first setting the vector size to 10 dimensions (size = 10), taking into account 10 context words around the target word (window = 10), and as our corpus was not as huge as one typically used for word embeddings, we did not want to lose any word, so we set the minimal frequency of 1, this is the frequency threshold allowing for including a word in the window (min_count = 1).
From the generated word embeddings, we took the vector for the verb and the vector for the noun in each verb-noun pair, and concatenated these two vectors thus getting the vector representation of a verb-noun pair.
In the training and classification stage, we input the vectors to six machine learning algorithms, commonly used in various natural language processing applications: support vector machine, multi-layered perceptron, k-nearest neighbors, decision tree, random forest, and Ada Boost. In the experiments, we applied these algorithms in their scikit-learn implementation [17], and split our data evenly into the training set (50%) and the test set (50%).
We experimented with difference sizes of word embeddings within the interval [10, 200]. The performance of classifiers was evaluated in terms of accuracy, precision, recall, and F1-score, according to formulas (1), where A is accuracy, P is precision, R is recall, F1 is F1-score, TA is true acceptance, TR is true rejection, FA is false acceptance, and FR is false rejection:
Tables 2–6 present the results in terms of F1-score shown by six supervised learning algorithms mentioned in Section 3, LinearSVC is the scikit-learn notation for support vector machine with linear kernel. All algorithms were executed with default parameters in the scikit-learn package [17]. The best F1-score in each table is in boldface.
F1-scores on detecting Oper1
F1-scores on detecting Oper1
F1-scores on detecting Real1
F1-scores on detecting CausFunc0
F1-scores on detecting CausFunc1
F1-scores on detecting FWC
In general, it can be observed, than increasing the vector size has a positive overall effect on classification, however, not equally for all classifiers.
For Oper1, see Table 2, the best F1-score of 0.72 was achieved by Multi-layered perceptron (MLP) on word embeddings with 90 dimensions. MLP is a neural network able to learn a non-linear approximation to classification by means of backpropagation algorithm. By default, the hidden layer size is 100, and the activation function is Relu. For multi-class classification, which is our case, the raw output passes through the softmax function. MLP is different from deep neural network, since it has only one hidden layer, however, for our task of discriminating among lexical function, it works better than the rest learning algorithms.
An F1-score of 0.72 is much higher than an F1-score of 0.41 obtained in our previous work by LinearSVC [14] ran on the same dataset of verb-noun collocations and on the same Spanish corpus. This shows that word embeddings are a more efficient model for capturing contextual characteristics of lexical functions than the classical bag-of-words scheme implemented on a vector space model.
The best result on Real1 detection presented in Table 3 is an F1-score of 0.79 achieved by Random Forest on word embeddings with 130 dimensions. This is also higher than our previous result shown in [14], where an F1-score of 0.65 was shown by Gaussian Naïve Bayes algorithm.
In the present work we did not experiment with Naïve Bayes algorithms due to numerical incompatibility, as the latter manage only positive values, while word embeddings include positive as well as negative real numbers.
Random Forest is an ensemble of individual uncorrelated decision trees, each trained on a subset of the training set, also using different input features. In the output, the predicted value is computed as the mode of all individual trees’ predictions.
Table 4 gives F1-scores for detecting CausFunc0, here the highest value is 0.70 shown by Random Forest, as in the case of Real1, on embeddings with 120 dimensions. In the previous work [14], an F1-score of 0.51 was shown by multi-layered perceptron.
Table 5 presents the results on CausFunc1 detection; here the best value is an F1-score of 0.84 achieved by LinearSVC on embeddings with 180 dimensions. Comparing to our previous experiments [14], there an F1-score of 0.59 was shown by the same LinearSVC classifier. Such a big difference in the values demonstrates the superiority of word embeddings over traditional vector representations of context. In fact, an F1-score of 0.84 is quite close to state-of-the-art results shown by methods dependable on manually produced linguistic resources, see Section 2.
Concerning free word combinations, the best result here is an F1-score of 0.60 achieved by LinearSVC on word embeddings with 180 dimensions, see Table 6. This result is not as impressive as that for CausFunc1, but still quite acceptable. In fact, it is not easy even for human experts to decide whether a given verb-noun combination is a collocation or a free word combination, but this is another research topic, not relevant for the present work.
We also present graphs (see Figs. 1–6) displaying how accuracy, precision, and recall macro averaged over all five classes (four lexical functions and free word combinations) change with respect to the vector size, for each classifier. It can be observed, that the performance of support vector machine with linear kernel (LinearSVC), multi-layered perceptron (MLP), and k-nearest neighbor (kNN) is more stable than the behavior of the resting methods: Decision Tree, Random Forest, and Ada Boost.

Accuracy shown by LinearSVC, multi-layered perceptron (MLP), and k-nearest neighbor (kNN).

Accuracy shown by Decision Tree, Random Forest, and Ada Boost classifiers.

Precision shown by LinearSVC, multi-layered perceptron (MLP), and k-nearest neighbor (kNN).

Precision shown by Decision Tree, Random Forest, and Ada Boost classifiers.

Recall shown by LinearSVC, multi-layered perceptron (MLP), and k-nearest neighbor (kNN).

Recall shown by Decision Tree, Random Forest, and Ada Boost classifiers.
We described a study of detecting four lexical functions of the Meaning-Text Theory [1] in Spanish verb-noun collocations based only on their context in a corpus without resorting to manually annotated linguistic resources. The experiments were performed on the lexical functions Oper1, Real1, CausFunc0, and CausFunc1 due to their relatively high frequency in texts. The corpus we used was a collection of 1,131 Excelsior newspaper issues. The context of collocations in the corpus was represented with word2vec embeddings which then were input into six supervised machine learning classifiers: Support Vector Machine, Multi-layered Perceptron, k-Nearest Neighbors, Decision Tree, Random Forest, and Ada Boost.
The results demonstrated the power of word em- beddings to capture contextual characteristics of subtle semantic differences formalized by lexical functions. The best result was an F1-score of 0.84 achieved by Support Vector Machine with linear ker- nel (LinearSVC) on embeddings with 180 dimensions. This result is very close to state-of-the-art results shown by methods which rely on syntactically and semantically annotated lexical resources.
In future work, we intend to explore deep learning approach to lexical function detection in corpora. High results shown by deep neural networks in other areas of computer science suggest that they can significantly improve performance on this complex task of natural language processing.
