Automatic generation of a large dictionary with concreteness/abstractness ratings based on a small human dictionary

Abstract

Concrete/abstract words are used in a growing number of psychological and neurophysiological research. For a few languages, large dictionaries have been created manually. This is a very time-consuming and costly process. To generate large high-quality dictionaries of concrete/abstract words automatically one needs extrapolating the expert assessments obtained on smaller samples. The research question that arises is how small such samples should be to do a good enough extrapolation. In this paper, we present a method for automatic ranking concreteness of words and propose an approach to significantly decrease amount of expert assessment. The method has been evaluated on a large test set for English. The quality of the constructed dictionaries is comparable to the expert ones. The correlation between predicted and expert ratings is higher comparing to the state-of-the-art methods.

Keywords

Concrete words abstract words word embeddings fastText ELMo BERT machine extrapolation

1 Introduction

Large dictionaries with assessments of the semantic and psycholinguistic properties of words are an important support for experimental and quantitative research in the cognitive sciences, psychology, and linguistics [24]. Dictionaries containing tens of thousands of words provide a selection of the most appropriate words for research. Unfortunately, the creation of such dictionaries is very expensive and time consuming as it needs experts.

Expert assessments of the various semantic properties of words are used in a wide range of studies from theoretical psychology to word processing. The most popular are assessments of the degree of valence, arousal, dominance, age-of-acquisition, and concreteness.

In this article, we will deal with the concreteness/abstractness property. First, let us look into definitions of concreteness and abstractness. The main approach to defining these concepts presented in [28]. Concrete concepts are those that are perceived by the senses. Examples of concrete words are ‘cat’, ‘chair’, ‘mountain’. Abstract concepts are not perceived by the senses. For example, ‘responsibility’, ‘relationships’, ‘misunderstanding’. Similar interpretations are found in many works. For instance, the following definition is given in the paper [21] “abstract nouns are those nouns whose denotata are not part of the concrete physical world and cannot be seen or touched”.

For these purposes, dictionaries (databases) are created by collecting word scores from respondents. The psycholinguistic database assembled by [7] became the first large publicly available digital resource of this kind for English. Recently, crowdsourcing platforms such as Amazon’s Mechanical Turk have been actively used to enable evaluating large number of words. An important example of a dictionary with ratings of concreteness/abstractness (C/A ratings) of words 1 for English is the dictionary created by [4], which contains estimates of almost 40 thousand words and phrases. Many studies use the 40-thousand dataset for English. The original paper [4] is cited almost 1000 times in Google Scholar. In other works (e.g. in [12]) the need for a large-scale data collection to support research within psychology is highlighted.

Except English language, a dictionary containing tens of thousands of words with expert assessments of concreteness/abstractness exists only for the Dutch language [3]. In this regard, the problem of automatic generation of C/A dictionaries is open for many natural languages. Moreover, Hollis et al. [12] doubt that human ratings provide a good gold standard and discuss a possibility of replacing manual ratings with unambiguous, empirically accessible measures. An obvious requirement for automatic dictionaries is high correlation with human dictionaries. In [17], authors discuss possible errors in computer dictionaries and note that the higher the correlation between an automatic dictionary and a human one, the less likely significant errors will appear in the automatic dictionary.

The main idea behind the extrapolation of expert assessments to words without C/A score, is to use the semantics learned from a large corpus and to obtain new estimates based on the semantic similarity of words in a vector space built from a text corpus. Thus, to create an automatic dictionary in any language, it is necessary to have a large corpus of texts (in that language). On this basis, pre-trained vector representations of words (word embeddings) can be built. Major ideas to leverage pre-trained word embeddings in building a C/A dictionary can be reduced to the following two approaches.

Using an estimator (such as regressor or classifier, e.g. a neural network). The estimator is trained on a portion of the expert assessments and then tested on the rest of the dictionary.

Using a small sample of “reference” abstract and concrete words (called ‘a core’ or ‘a seed’). A C/A rating for a given word is calculated by a comparison of similarity to the both parts of the core.

Most researchers take the first approach. It is very convenient in view of the development of the theory and technology of machine learning, especially taking into account the progress in deep learning. An obvious disadvantage of this approach is that it needs a lot of data to train classifiers, therefore initial dictionaries with expert judgements should be large. Only a few languages have dictionaries with thousands of words. Thus, for the vast majority of languages, this approach is not available.

In contrast, the second approach does not need a large train set. The core can be quite small, so that manual creation is quite realistic for any language. However, there is currently neither a general theory for the creation of such core words, nor empirical studies related to selection of the size of a seed. The importance of seed selection, its size and affect on the resulting dictionary have been studied by [17]: “If an optimal set of seed words would increase the accuracy of the extrapolation methods, it would be good to know this”. In our work, we try to give an answer to this question.

Within the framework of the two mentioned approaches, C/A ratings can be generated for arbitrarily large dictionaries, as long as word embeddings are available for a given language. For example, [32] built a dictionary of 114,501 words; [14] created even bigger one (350,000 words). However, a common problem remains to assess the quality of the dictionaries obtained in this way. Almost all algorithms are tested on small collections of several thousand words, and it is unclear how the algorithms will perform on large collections of words. Most of the existing resources are either created for the most frequent words or have other limitations on input words. For example, The Toronto Word Pool [11], used in [5], contains only fairly frequent common words of the English language, containing a maximum of 2 syllables or 8 letters. A dictionary with C/A ratings created in [4] also contains the most common, frequent English words. As more frequent words have more contexts in corpus, building C/A dictionaries them is easier, than for less frequent words. In contrast, estimating C/A ratings for low-frequency words (on average) will be worse because language models can capture fewer contexts of such words. In [25], this assumption was confirmed on the material of Russian language.

In present article, we analyze a large number of semantic cores (or seeds) on the basis of a large dictionary proposed in [4] which contains human assessments for over 39 thousand words. We study general rules for selecting the size of the semantic C/A core. The C/A ratings for the seed words can be derived by a survey of respondents-experts. The main finding of the paper is that the amount of data collected in such a survey (that is enough to generate a high-quality C/A dictionary) may be tens of words (not hundreds or thousands). In addition, for the first time we apply contextualized word embeddings to extrapolate C/A ratings, and show that contextualized word embeddings (ELMo and BERT) provide better source for semantic core selection then ‘classical’ word embeddings (e.g. fastText). Finding a small semantic core for this problem is directly related to the fundamental problem of reducing the size of the training set for tuning neural networks.

2 Related works

Most of the work on the extrapolation of human ratings was carried out on the material of English using two main digital resources [4, 7]. Among the first works, one can mention [29] with the correlation coefficient (Spearman’s rank coefficient, r_s) between the dictionary and the human ratings being 0.64. The Table 1 below summarizes results of studies carried out within the classification-based approach. In some works, instead of building a dictionary with ratings, a binary classifier is built to distinguish between concrete and abstract words. In this case, instead of Spearman’s correlation coefficient (r_s), the accuracy of binary classification is calculated.

Table 1
Summary of studies carried out within the classification-based approach (in the chronological order). Asterisk () denotes German language

Paper Corpus Semantic space Method Train size Split Performance

[17] [4] LSA kNN, RF 37,058 25% /75% .796 (r_s)

[10] [7] LSA SVR 3,521 67% /33% .802 (r_s)

[31] [7] vector space LR 2,450 98% /2% .76 (acc.)

[12] [4] skip-gram SVR 37,058 50% /50% .829 (r_s)

[20] [4] GloVe NB, RNN, kNN 2,580 81% /19% .740 (ρ)

[16] [4] fastText SVR 22,797 67% /33% .887 (r_s)

[5] [4] fastText SVM 32,783 90% /10% .900 (r_s)

[6] [6] fastText SVR 4,182 80% /20% .861 (r_s)

Paper	Corpus	Semantic space	Method	Train size	Split	Performance
[17]	[4]	LSA	kNN, RF	37,058	25% /75%	.796 (r_s)
[10]	[7]	LSA	SVR	3,521	67% /33%	.802 (r_s)
[31]	[7]	vector space	LR	2,450	98% /2%	.76 (acc.)
[12]	[4]	skip-gram	SVR	37,058	50% /50%	.829 (r_s)
[20]	[4]	GloVe	NB, RNN, kNN	2,580	81% /19%	.740 (ρ)
[16]	[4]	fastText	SVR	22,797	67% /33%	.887 (r_s)
[5]	[4]	fastText	SVM	32,783	90% /10%	.900 (r_s)
[6]	[6]*	fastText	SVR	4,182	80% /20%	.861 (r_s)

Here are important notes about the papers from Table 1. The works [17, 20] present a comparison of different methods. Authors of [20] trained a neural network on sentences, and not on separate words (800,000 sentences with 2,580 words annotated as abstract and concrete were used). Aiming at exploiting both embeddings and textual data, authors utilized a bidirectional recurrent neural network (RNN) with one layer of forward and backward LSTM cells. Each cell has width of 128, and is wrapped by a dropout wrapper with keep probability 0.85. The output of the LSTM cells is passed to the attention layer which reduces it to the size of 100. GloVe embeddings with 300 dimensions were used as word representations. Given a set of sentences containing a test concept, its final abstractness score was computed by applying the averaging based on the Naive Bayes classifier.

In [5], several extrapolations are given, including those using additional linguistic information for training classifiers, such as a list of suffixes typical for abstract words. For comparison with the results of other studies, the table shows the results of the algorithm without using this additional information.

Note that in almost all works, except [17], the test set is much smaller than the training set (in [12] it is equal to the training set in size). The possibilities of applying the algorithms described in these papers to the rating of large dictionaries are not clear, because they vastly depend on the generalizing ability of a proposed model. Although, [17] use a large test set (almost 30 thousand words), the training set is also very large; it is more than 9 thousand words. Accordingly, the possibilities of applying the algorithm in those languages in which there are only small dictionaries with human estimates seem limited. The state-of-the-art result is r_s = 0.9. In 2014 [4] estimated the correlation of two human ratings of works [7] and [4]. The value of Spearman’s r_s was 0.92. Later, in 2017 [12] have suggested that this estimate can be treated as the upper limit for the quality estimates of other automatic ratings.

Several papers have used a small core of concrete and abstract words. The results are shown in Table 2. In all the works, a greedy algorithm was used to build the core, and the core size, following the work [32], was a priori set equal to 40 words. Table 2 shows the size of dictionary in which the core was searched as the fraction of a training set. Within the framework of this approach, sufficiently high correlation coefficients were obtained. However, it should be noted that, for example, [6] use the size of the set in which the core is sought is more than three thousand words, while the test set was slightly more than 800 words. Thus, this approach retains the same drawbacks as the one with classifiers. It should be noted that in all works that use the core, the influence of either the core size or influence of the training/test set size were not studied.

Table 2

Papers that use core-based approach, i.e. use a small core with concrete and abstract words. Asterisk (*) denotes German language. All works apply ‘greedy forward search’ when searching for a better core

Paper	Corpus	Semantic space	Train size	Split	Performance
[32]	[7]	LSA	4,295	50% /50%	.810 (r_s);.847 (acc.)
[14]	[14]*	word2vec	5,237	90% /10%	.825 (r_s)
[6]	[6]*	fastText	4,182	80% /20%	.849 (r_s)

Let us note several noteworthy results of the recent works, in addition to those given in the table 12. The work done by [23] stands apart; authors extrapolate ratings in a diachronic manner, building a dictionary for a 200-year interval. In works [16, 30], extrapolation is carried out not within one language, but between languages. [30] use a multilingual skip-gram model. In this case, the interpolation method is trained on the full set of available data of one language. [30] transfer assessments to 77 languages, but data for all the languages are not provided. When transferring estimates from English to Dutch, the correlation coefficient with expert estimates in Dutch from [3] turned out to be 0.76.

When starting the algorithm from a small core (as in works [23, 32]), the question arises about the choice of words in the core. In [23], the core of a fixed size included the most frequent and at the same time extremely concrete and abstract, according to expert estimates, words. In [32] there is also a core of a predetermined size of forty words. It is formed iteratively, starting from an empty set and sequentially adding words that are closest in semantic space with abstract or concrete words from the training. It is shown that when the core is expanded to 100 words, the correlation coefficient on the test set drops.

[10] have shown that that words with higher concreteness ratings were more likely to be categorized as artifacts, foods, animals, people, substances, plants, or body parts. Less concrete words were more likely to be categorized as related to cognition, action, shapes, communication, relations, states, events, time, or motives. In addition, in the same work, authors made the following statement: “... we hypothesized that more concrete words would be less ambiguous (i.e., less polysemous). This was not the case.”

Automatic ratings are constructed for several other languages: for Chinese [33], for Persian [8], for Russian [26, 27].

3 Data and methods

3.1 Datasets and performance measures

In our experiments we use a dictionary with expert C/A ratings, the BRM dictionary for English from [4]. The BRM dictionary contains 39,954 words and phrases (bigrams). It is widely used in studies of concreteness/abstractness estimation. The dictionary has C/A ratings ranging from 1 (most abstract) to 5 (most concrete) for 37,058 unigrams and 2,896 bigrams. Average C/A rating is 3.04.

Second type of resource used in experiments is the frequency dictionaries. For English we use the dictionary of 1/3 million most frequent words available online 2 , for more information on the dataset, see Chapter 14 in [22]. The frequency dictionary overlaps with the BRM dictionary and we combine them, because having access to word frequencies is important for the proposed method.

Finally, we employ three pre-trained word embeddings to calculate similarity between words: fastText [2], ELMo [19] and BERT [9] (for BERT-based experiments we use bert-base-uncased model and aggregate outputs of last layers of the encoder). Vector representations for words are derived via the Flair Framework [1]. FastText representations of words use subword n-grams, which is useful for C/A prediction, because concreteness/abstractness of a word can depend not only on the context of the word, but also on subword information.

Usage of fastText-based method enables its application to many languages (for which fastText representations exist), while pre-trained contextualized word embeddings (based on BERT encoder) exist for a smaller number of languages. To assess the performance of fastText embeddings we compare them to ELMo and to a BERT-based representations of words. The corresponding datasets and models are freely available. We made this choice due to the availability of the word embeddings in many languages.

To evaluate the performance of C/A ratings estimation we use three common metrics: Spearman’s rank correlation (r_s), Pearson correlation coefficient (ρ) and accuracy of binary classification (Acc). The former two metrics can be evaluated on the results of the regression task (predicting C/A rating of a word), while the accuracy (rate of correct predictions) is measured on the binary classification task.

3.2 Methods for estimating C/A ratings

We follow a general semantic-based approach used in the above works and do not deal with syntactic information, POS-tags, etc.

3.2.1 Semantic core method

Here we briefly describe the method based on a semantic core. The key aspect of the method is selection of a semantic core, that is formed by two sets of the abstract and concrete words (with known C/A ratings). The second step of the method is just calculating similarities between word embeddings in a semantic space. Therefore, a rigorous analysis of semantic core selection and optimization of the core is the main contribution of the paper, while using the semantic space is secondary.

First, we define a semantic core, which consists of two sets of words: seed_A with Z abstract words and seed_C with Z concrete words 3 . $semantic core = {seed}_{A} \cup {seed}_{C}$

Next, given a word w, the method retrieves an embedding for the word w and calculates cosine similarity between the word w and words from both seeds. The result of this step is average similarity between the word w and each seed. We assume the following: the closer the word w to the concrete seed (seed_C); and farther it is from the abstract seed (seed_A), the higher the concreteness rank of the word w should be. The corresponding concreteness score of word w can be calculated with the following formulas: $Rating (w) = \frac{sim (w, {seed}_{C})}{sim (w, {seed}_{A})}$ where $sim (w, {seed}_{C}) = \frac{1}{| {seed}_{C} |} \sum_{c \in {seed}_{C}} \frac{w \cdot c}{‖ w ‖ ‖ c ‖}$ The cosine similarity is traditionally used in this research domain. Other measures of similarity lead to like results. The score of the word w depends on the selection of the semantic core. Therefore, selection and optimization of the seeds is the important step of the method that worth studying.

3.2.2 Selection and optimization of the semantic core

Selection of the semantic core is possible from a larger dictionary. In this case, we fix several parameters such as size of the larger dictionary (X) and two subsets of larger dictionary (each of size Y). Then seeds are sampled from the subsets (X > Y > Z). This approach resembles real-world situation, when usually one experts are able to collect C/A ratings of X words, but want to minimize the effort, especially, if Z words would be enough to build a high-quality dictionary.

Take in it the sub-dictionary of the most frequent words of size X (Base dictionary).

We select two subsets (each of size Y) of the most concrete and the most abstract words.

Then, we randomly choose two seeds (each of size Z) 4 .

Using this core, we calculate Rating (w) for each word from the Base dictionary.

We calculate Spearman’s correlation (r_s) between the test and the predicted ratings.

Search for better X, Y, Z and build a complete dictionary for the derived core. (Optimization step)

Optimal values X, Y, Z are used for testing on the entire test dictionary (calculate r_s between the test dictionary and predictions for all words test dictionary).

The Step 6 above mentions searching for a better (in terms of r_s) core. In this optimization step, the goal is to find a better semantic core. We apply a brute-force strategy that randomly picks and evaluates seeds.

We start with initial values for X (500 words)

Set up Y=50 words.

We iterate over all core sizes Z greater than ten words and less than or equal to Y.

For each Z, we iterate over all cores of size Z (if the number of all combinations is too big, then we stop the search after 100 random cores) and keep the best core.

Increase Y with an increment of 50 up to X/3. Repeat steps 3-4.

Increase X with an increment of 500 up to 2,500. Repeat steps 2-5.

Selection of the initial and maximum values for X in the optimization strategy is justified as follows. Apart from a few large expert C/A dictionaries, the majority of them have between 1,000 and 2,700 words. For example, a dictionary for German includes 1,698 words [34]. The database [15] contains 2,654 words. The database [13] includes 1,000 nouns. For English language the database [11] contains 1,080 words, a dictionary from [18] contains 925 words. This indicates real limitations on the size of the expert C/A dictionaries. Based on these data, we made a decision to study methods of automatic dictionary construction, provided that 500–2,500 words with expert assessments are available.

4 Experiments

The experimental results are summarized in Tables 3–8 and presented in Fig. 1. We will start considering the results obtained with the fastText-based model (Tables 3, 6).

Table 3
Comparison of the best results and different values of X for fastText

X Y Z r _s Difference from the best

500 50 30 0.710 10%

1000 300 130 0.763 4%

1500 200 50 0.775 3%

2000 400 70 0.790 1%

2500 600 90 0.797 0

X	Y	Z	r _s	Difference from the best
500	50	30	0.710	10%
1000	300	130	0.763	4%
1500	200	50	0.775	3%
2000	400	70	0.790	1%
2500	600	90	0.797	0

Table 4

Comparison of the best results and different values of X for ELMo

X	Y	Z	r _s	Difference from the best
500	150	110	0.800	6.5%
1000	300	30	0.825	3.5%
1500	350	30	0.842	1.5%
2000	350	30	0.848	0.8%
2500	200	30	0.855	0

Table 5

Comparison of the best results and different values of X for BERT

X	Y	Z	r _s	Difference from the best
500	150	110	0.788	5.3%
1000	100	90	0.829	0.4%
1500	150	50	0.825	0.8%
2000	600	210	0.829	0.4%
2500	550	270	0.832	0

Table 6

Five best values of correlation at X = 1000 for fastText

Y	Z	r _S
300	130	0.763
100	50	0.760
150	70	0.760
250	70	0.759
100	70	0.759

Table 7

Five best values of correlation at X = 1000 for ELMo

Y	Z	r _S
300	30	0.825
150	50	0.824
100	70	0.824
100	90	0.823
300	210	0.822

Table 8

Five best values of correlation at X = 1000 for BERT

Y	Z	r _S
100	90	0.829
100	70	0.828
250	170	0.827
250	190	0.826
200	70	0.823

From the Table 3 one can see the following:

The more vocabulary available, the better

The difference between levels of 2000 words and 2500 words is less less than 1%

To obtain a dictionary of an acceptable level of quality with the least effort, it is enough to create an expert dictionary of 1000 words (X = 1000)

The optimal value of Y is on average 20% of X. The optimal value of Z is on average 25-30% of Y.

In this table, the average is Y = 180, Z = 78. Recommended values for use in practice are Y = 100, Z = 50. These setting gives a correlation coefficient that differs from the best by no more than 5%. For reasonable values of the value of the expert vocabulary (e.g. up to 2500 words, a core with parameters X = 2500, Y = 600, Z = 90 was found, giving the best result 0.797. Figure 1 illustrates the graph of correlation coefficient obtained with different values of Z. There is a maximum (at Z = 90) It is clear that increasing of Z does not improve the quality of the C/A dictionary. Similar trend is derived for other combinations of X and Y values.

Fig. 1

Comparison of results obtained for different types of embeddings and for different values of Z with fixed values of X = 2500 and Y = 600.

The recommendation is to limit the creation of a 1000-word dictionary, which will allow obtaining the correlation coefficient only 4% worse than the best. When using the exhaustive search algorithm proposed in the article and limited resources, it is recommended to limit the search procedure by a seeds of size Z = 50, with Y = 100, which gives a result that differs from the best by 5% (Table 6).Similar patterns were found for ELMo (Tables 4, 7) and for BERT (Tables 5, 8).

Comparing the results for these three models, we see that ELMo provides the absolute best result (0.855). At the same time, if we initially limit ourselves to a dictionary of 1000 words, then the ELMo and BERT results are very close, and the BERT result (0.829) which is slightly better. The fastText results are noticeably worse. Figure 1 shows relationship between core size (Z) and r_S for all three models. The three plots are qualitatively the same. The ELMo-based model reaches a local maximum already for a core size 30, and the further growth of the core size does not improve results. We provide words from the typical seeds with reasonable size in Table 9.

Table 9

Example of seeds obtained from fastText semantic space with following parameters: X = 2500, Y = 100, Z = 10; r_s = 0.764

Concrete seed	Abstract seed
‘shoe’, ‘clock’, ‘bracelet’, ‘computer’, ‘bird’, ‘bed’, ‘bean’, ‘pantyhose’, ‘neck’, ‘oven’	‘desire’, ‘moment’, ‘reliability’, ‘opportunity’, ‘choice’, ‘concept’, ‘value’, ‘peace’, ‘sensitivity’, ‘democracy’

5 Conclusion

A number of studies have addressed the problem of extrapolating human C/A scores to words that do not have such a rating.

It is fundamentally important to be able to extend human assessments from a small initial set of words to a much larger one. Almost all the previous work, as well as our estimates, were obtained by dividing the initial data set into parts (e.g. 80-90% for training set and 20-10% for test set). Meanwhile, only for two languages (English and Dutch) large human C/A dictionaries for tens of thousands of words have been built. The problem of building a large C/A dictionary for other languages based on a small set of words with human ratings is an urgent problem.

Even smaller sets of initial data are used in the approach associated with the choice of the core of concrete and abstract words and the subsequent calculation of the distance from the evaluated word to the words of the core. Despite its simplicity, the proposed method is able to calculate C/A scores that strongly correlate with human ratings.

However, to date, no systematic study of issues related to the choice of the core has been undertaken. Using the brute-force method, we evaluated a large number of core selection situations, varying in several parameters, including the size of the source data and the size of the core.

The research is carried out for the main word embeddings models: ELMo, BERT and fastText. FastText has the advantage of having pre-trained vectors for 170 languages, which makes it useful for building machine dictionaries for languages other than the most common. However, if there is a pre-trained ELMo or BERT for the language, then our research shows that it is better to use them.

Comparing with other works in which the core-based approach was used, it can be noted that we obtain a result slightly higher than the best previously published one: the correlation coefficient is 0.855 versus 0.849 in [6]. We have the required core size of 30, which is also better than the best previous result with a core size of 40 [32]. Earlier, in articles with a core-based approach, various variants of the semantic space were used: LSA, word2veec, fasttext.

We found a core of size 30 that (in combination with the ELMo semantic space and the limitation of 2,500 initially specified words) gave the result 0.855 (r_S). At the same time, it was shown that initial dictionary can be even smaller (1,000 words) that leads to r_S =0.825, which is only a small (3.5%) deterioration of the result in comparison with the best found.

Comparing our results with previous works, one should pay attention to the following. Within the framework of the core-based approach, the best result was obtained in [6] for the German language (0.849). At the same time, a set of words of a larger size was used for training (3,300 words). For the English language, the best result (0.810) presented in [32] when using a dictionary with 2,150 words. In the framework of the classification-based approach, the best result is 0.9 (r_S), but it requires a large training set of almost 30,000 words, which is currently available only for two languages. But for these two languages (English and German), a problem of constructing a large set with C/A ratings is already solved and the machine approximation to solve this problem is not relevant.

Dependencies of the result on the parameters under consideration have been established, which generally clarifies the structure of the semantic space of words from the standpoint of concreteness/abstractness. With a fixed size of the original set of words with human ratings, the core size has some optimal value, and as the core size increases, the results deteriorate. Earlier, a similar dependence was announced, but not published in works using other semantic models and methods for constructing the core. Thus, it seems that a certain universal pattern has been established, independent of the specific technique used.

In general, the conclusion from this study is a recommendation for those languages where C/A dictionaries have not yet been built, to limit themselves to building a small human dictionary and then apply computational method to automatically extrapolate estimates to a large dictionary. It is an opportunity to quickly and inexpensively get a high-quality C/A dictionary. The study demonstrates (and, for the first time, validates) a possibility to derive large high-quality language resources (C/A ratings) using a small training set. Usage of the simple (or even trivial) methods is definitely an advantage of the approach that enables its application to many low-resource languages.

Our future plans include conducting similar studies for other semantic characteristics of words as well as estimation of C/A ratings for other languages.

Footnotes

Acknowledgments

This paper has been supported by the Kazan Federal University Strategic Academic Leadership Program.

We will denote such dictionaries in abbreviated form as ‘C/A dictionaries’

Further, we refer to the two sets as ‘seeds’. Also, we use terms ‘concrete seed’ and ‘abstract seed’ to refer to the corresponding set.

In particular, Z words from the concrete subset comprise a concrete seed, while other randomly selected Z words from the abstract subset comprise an abstract seed. After this step we have a semantic core with 2Z words.

References

Akbik

Alan

, Blythe

Duncan

and Vollgraf

Roland

, Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, 2018, pp. 1638–1649.

Bojanowski

Piotr

, Grave

Edouard

, Joulin

Armand

and Mikolov

Tomas

, Enriching word vectors with subword information, Transactionsof the Association for Computational Linguistics5 (2017), 135–146.

Brysbaert

Marc

, Stevens

Michaël

, De Deyne

Simon

, Voorspoels

Wouter

, and Storms

Gert

, Norms of age of acquisition andconcreteness for 30,000 Dutch words, Acta psychologica150 (2014a), 80–84.

Brysbaert

Marc

, Warriner

Amy Beth

and Kuperman

Victor

, Concretenessratings for 40 thousand generally known English word lemmas, Behavior Research Methods46(3) (2014b), 904–911.

Charbonnier

Jean

and Wartena

Christian

, Predicting word concreteness and imagery, In Proceedings of the 13th International Conference on Computational Semantics-Long Papers, 2019, pp. 176–187.

Charbonnier

Jean

and Wartena

Christian

, Predicting the concreteness of German words. In Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS). CEUR-WS. Vol-2624, 2020.

Coltheart

Max

, The mrc psycholinguistic database, The Quarterly Journal of Experimental Psychology Section A33(4) (1981), 497–505.

Dadras

Parinaz

and Ramezani

Majid

, Codac: Concreteness degreeauto-calculator of persian words, International Journal of Computer Science and Information Security (IJCSIS)15(5) (2017).

Devlin

Jacob

, Chang

Ming-Wei

, Lee

Kenton

and Toutanova

Kristina

, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

10.

Feng

Shi

, Cai

Zhiqiang

, Crossley

and S McNamara

Danielle

, Simulating human ratings on word concreteness. In Flairs conference, 2011.

11.

Friendly

Michael

, Franklin

Patricia E

, Hoffman

David

and Rubin

David C

, The Toronto word pool: Norms for imagery, concreteness,orthographic variables, and grammatical usage for 1,080 words. Behavior Research Methods & Instrumentation14(4) (1982), 375–399.

12.

Hollis

Geoff

, Westbury

Chris

and Lefsrud

Lianne

, Extrapolating human judgments from skip-gram vector representations of word meaning, Quarterly Journal of Experimental Psychology70(8) (2017), 1603–1619.

13.

Kanske

Philipp

and Kotz

Sonja A

, Leipzig affective norms for German: A reliability study, Behavior Research Methods42(4) (2010), 987–991.

14.

Köper

Maximilian

and Im Walde

Sabine Schulte

, Automatically generated affective norms of abstractness, arousal, imageability and valence for 350 000 German lemmas. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) 2016, pp. 2595–2598.

15.

Lahl

Olaf

, Göritz

Anja S

, Pietrowsky

Reinhard

, and Rosenberg

Jessica

, Using the world-wide web to obtain large-scale wordnorms: 190,212 ratings on a set of 2,654 German nouns, Behavior Research Methods41(1) (2009), 13–19.

16.

Ljubešić

Nikola

, Fišer

Darja

and Peti-Stantić

Anita

, Predicting concreteness and imageability of words within and across languages via word embeddings, arXiv preprint arXiv:1807.02903, 2018.

17.

Mandera

Paweł

, Keuleers

Emmanuel

and Brysbaert

Marc

, How usefulare corpus-based methods for extrapolating psycholinguisticvariables?Quarterly Journal of Experimental Psychology68(8) (1623), 1642.

18.

Paivio

Allan

, Yuille

John C

and Madigan

Stephen A

, Concreteness,imagery, and meaningfulness values for 925 nouns, Journal of Experimental Psychology76(1p2) (1968), 1.

19.

Peters

Matthew E.

, Neumann

Mark

, Iyyer

Mohit

, Gardner

Matt

, Clark

Christopher

, Lee

Kenton

and Zettlemoyer

Luke

, Deep contextualized word representations. In Proc. of NAACL, 2018.

20.

Rabinovich

Ella

, Sznajder

Benjamin

, Spector

Artem

, Shnayderman

Ilya

, Aharonov

Ranit

, Konopnicki

David

and Slonim

Noam

, Learning concept abstractness using weak supervision. arXiv preprint arXiv:1809.01285, 2018.

21.

Schmid

Hans-Jörg

, English abstract nouns as conceptual shells. De Gruyter Mouton, 2012.

22.

Segaran

Toby

and Hammerbacher

Jeff

, editors, https://www.safaribooksonline.com/library/view/beautiful-data/9780596801656/Beautiful Data: The Stories Behind Elegant Data Solutions. O’Reilly, Beijing, 2009.

23.

Snefjella

Bryor

, Généreux

Michel

and Kuperman

Victor

, Historical evolution of concrete and abstract language revisited, Behavior Research Methods51(4) (2019), 1693–1705.

24.

Solovyev

Valery

, Concreteness/abstractness concept: State of the art. In Proceedings of the 9th International Conference on Cognitive Sciences. Advances in Cognitive Research, Artificial Intelligence and Neuroinformatics. Advances in Intelligent Systems and Computing1358 (2021), 275–283. Springer International Publishing.

25.

Solovyev

Valery

, Bochkarev

Vladimir

and Khristoforov

Stanislav

, Generation of a dictionary of abstract/concrete words by a multilayer neural network. Journal of Physics: Conference Series, 2020.

26.

Solovyev

Valery

and Ivanov

Vladimir

, Automated compilation of a corpus-based dictionary and computing concreteness ratings of russian, In Proceedings of the 2020 Conference on Speech and Computer, SPECOM 2020, 2020, pp. 554–561. Springer, Cham.

27.

Solovyev

Valery

, Ivanov

Vladimir

and Akhtiamov

Rauf

, Dictionary of abstract and concrete words of the Russian language: a methodology for creation and application. Research in Applied Linguistics, 10(Proceedings of the 6th International Conference on Applied Linguistics Issues (ALI 2019) July 19-20, 2019, Saint Petersburg, Russia), 2019, 218–230.

28.

Spreen

Otfried

and Schulz

Rudolph W

, Parameters of abstraction,meaningfulness, and pronunciability for 329 nouns, Journal of Verbal Learning and Verbal Behavior5(5) (1966), 459–468.

29.

Theijssen

Daphne

, van Halteren

Hans

, Boves

Lou

and Oostdijk

Nelleke

, On the difficulty of making concreteness concrete, Computational Linguistics in the Netherlands Journal1(2011), 61–77.

30.

Thompson

Bill

and Lupyan

Gary

, Automatic estimation of lexical concreteness in 77 languages. In The 40th Annual Conference of the Cognitive Science Society (CogSci 2018), 2018, pp. 1122–1127. Cognitive Science Society.

31.

Tsvetkov

Yulia

, Mukomel

Elena

and Gershman

Anatole

, Cross-lingual metaphor detection using common semantic features. In Proceedings of the FirstWorkshop on Metaphor in NLP, 2013, pp. 45–51.

32.

Turney

Peter

, Neuman

Yair

, Assaf

Dan

and Cohen

Yohai

, Literal and metaphorical sense identification through concrete and abstract context. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, pp. 680–690.

33.

Wang

Xiaomei

, Su

Chang

and Chen

Yijiang

, A method of abstractness ratings for chinese concepts. In UK Workshop on Computational Intelligence, 2018, pp. 217–226. Springer.

34.

Wippich

Werner

and Bredenkamp

Jürgen

, Bildhaftigkeit und Lernen, volume 78. Springer-Verlag, 2013.

Automatic generation of a large dictionary with concreteness/abstractness ratings based on a small human dictionary

Abstract

Keywords

1 Introduction

2 Related works

3.1 Datasets and performance measures

3.2 Methods for estimating C/A ratings

3.2.1 Semantic core method

3.2.2 Selection and optimization of the semantic core

4 Experiments

Table 3 Comparison of the best results and different values of X for fastText X Y Z r s Difference from the best 500 50 30 0.710 10% 1000 300 130 0.763 4% 1500 200 50 0.775 3% 2000 400 70 0.790 1% 2500 600 90 0.797 0

Footnotes

Acknowledgments

References

Table 3
Comparison of the best results and different values of X for fastText

X Y Z r _s Difference from the best

500 50 30 0.710 10%

1000 300 130 0.763 4%

1500 200 50 0.775 3%

2000 400 70 0.790 1%

2500 600 90 0.797 0