A semantic textual similarity measurement model based on the syntactic-semantic representation

Abstract

Measuring semantic textual similarity (STS) lies at the core of many applications in natural language processing (NLP). Recently, most models have considered semantic information or syntactic information, but seldom an unified model to make full use of these two kinds of information. Based on the knowledge from the trained word vectors, this paper proposes a semantic-embedded dependency tree (SEDT) model based on word2vec and glove, which can be treated as a syntactic-semantic representation. In consideration of the words in a sentence for the contribution of the semantic are different, this model extends the semantic-embedded dependency tree model to an enhanced semantic-embedded dependency tree (ESEDT). And a modified partial tree kernel (MPTK) is proposed to automatically extract the syntactic-semantic patterns in this tree. Because the syntactic information, semantic knowledge, and the contribution distribution of the word attention model are all considered in this model, it can measure more comprehensive sentence semantics to improve the accuracy of STS results. Finally, SEDT/E-SEDT is applied to SemEval semantic textual similarity tasks and evaluate its performance through two widely used benchmarks: the Pearson correlation coefficient and the Spearman correlation coefficient. The experimental results show that SEDT/E-SEDT can effectively improve the accuracies of sentence similarity judgments. Compared with the other similar methods to calculate the semantic similarity, such as some neural network models, SEDT/E-SEDT can obtain better performance on most dataset.

Keywords

Semantic textual similarity sentence structural representation structural kernel word embedding attention mechanism

1. Introduction

The similarity measurement of textual is one of the basic tasks of natural language processing. In many scenes, textual semantic similarity measurement plays a very important role. In most of the present studies, the main goal of textual semantic similarity is to give the appropriate scoring model associated with human judgment to calculate the similarity of short texts. The semantic textual similarity (STS) can be defined by a set of metrics over a set of documents, and the main idea is to find the semantic similarities,which can be quantized by the semantic relations that exist between them [10, 37]. In initial researches, the STS methods were usually for the short texts [26, 29], and few studies have been concerned with the larger linguistic levels [3].

Measuring the semantic similarity of text pairs was applied in many natural language processing fields. It can evaluate the output quality of machine translation system [24, 47]. In twitter search [38], it can accurately measure semantic relatedness between concepts or entities. In information retrieval [8], it can retrieve a set of documents that the user wants. Semantic textual similarity is also widely used in paraphrases recognition [18], text classification [36, 48], question answering [19, 15], text summarization [43] and textual entailment [4].

STS has already a large number of solutions, which can be categorized into the following ways: topological/knowledge based, statistical/corpus based and string based, etc. The most core idea behind most solutions is to identify and adjust semantically similar or related terms in both sentences, and to bring these similarities together to aggregate an overall similarity. Most STS methods treat input text pairs as feature vectors, where each element is a score that corresponds to a certain type of similarity. The system that follows this idea has achieved the best results in the SemEval 2012 [3] shared tasks. However, using this method does not fully take into account the syntactic information of the sentences. On the other hand, the tree kernel method computes the sentence similarity can effectively learn the syntactic information of the sentences. But most of the structural kernel methods only use the string hard matching to compare the similarities of the tree structures, and these simple judgments will lead to the loss of semantic information. There are also some researches use the soft matching methods to compare the sentence similarity, such as using LSA [13, 7] for semantic annotation, but these methods are judged by stacking some of the semantic features, not only need massive additional resources to generate the features, but also lack of the expression capacity for the short texts. This inspired us to consider these factors. Through an unified model, we can make full use of syntactic information of sentences and semantic information of words in sentences.

In order to obtain a more complete textual representation, syntactic and semantic knowledge of the sentence must be fully considered. To introduce the semantic information, using word vector to make semantic distance comparison is a good choice. The word vector generated by word embeddings can reflect the semantic relation of the words in high dimensional space. This paper proposes a semantic-embedded dependency tree (SEDT) for semantic textual similarity, which treats the input short text pairs as structural objects, and encodes the word vector knowledge into the structural representation, then relies on the power of kernel learning to automatically extract the implicit feature patterns. Moreover, considering the importance of the words in the sentence are different, we extend the original semantic-embedded dependency tree to an enhanced semanticembedded dependency tree (E-SEDT) for distinguishing the contributions of different importance words in calculating the similarity of sentences. In this way, a more comprehensive sentence representation can be gained for calculating the similarity.

For evaluating the practical effects of SEDT/E-SEDT, we compared the six similarity methods and our methods to evaluate the state-of-the-art performance in the STS-12 [3] sharing tasks. In STS-13 [2], STS-14 [1] sharing tasks, SEDT/E-SEDT is also compared against other six the most advanced neural network approachs. The results show that E-SEDT can get the state-of-the-art effect in two of the three STS tasks and achieve the best results on 7 out of 14 datasets. The contribution of this paper is as follows.

•
Considering the complexity of the sentence semantics, we propose an enhanced semantic-embedded dependency tree that integrates syntactic information, semantic knowledge, and the contributions of different importance words.
•
A modified partial tree kernel (MPTK) is proposed based on the partial tree kernel. MPTK is a tree kernel function that can be calculated using the novel structural tree.
•
We apply SEDT/E-SEDT to semantic textual similarity tasks based on SemEval and evaluate its performance for some of the most common benchmarks. Experiment results show that SEDT/E-SEDT can effectively improve the accuracies of sentence similarity judgments.

The rest of the paper is organized as follows. Section 2 surveys the related works. Section 3 introduces the overall process for STS calculation. Section 4 proposes the semantic-embedded dependency tree and its extended model. The performance evaluation is given in Section 5. Section 6 concludes the paper.
2. Related works

The current methods of STS under study include topological/knowledge based [28], statistical/corpus based and machine learning based etc. Topological/knowledge based methods include node-based [34], edged-base [21] and hybrid which combine node and edge-based. The methods of statistical/corpus are usually based on a statistical model to estimate the semantic similarity. Explicit Semantic Analysis (ESA) [16] is another vectorial representation of texts that uses document corpus as the knowledge base, which can be applied to the individual words or entire documents. In this type of models, a word is represented as a column vector in the TF-IDF matrix of the text corpus. Latent Semantic Analysis (LSA) [13, 7] assumes that words that are close in meaning will occur in similar pieces of text. In these models, the matrixes containing the co-occurrence words per paragraph are constructed from large corpus of text. In this way, the most popular processing means are the machine learning models based on neural network.

Recent works have moved away from handcrafted features and towards modeling with distributed representations and neural network architectures. Kalchbrenner et al. [22] introduced a convolutional neural network for sentence modeling that uses dynamic k-max pooling to better model inputs of varying sizes. Socher et al. [42] used a recursive neural network to model each sentence, recursively computing the representation for the sentence from the representations of its constituents in a binarized constituent parse. Kiros et al. [23] proposed Skip-thought vector, which trains an encoder decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Wang et al. [44] assign different weights to the vectors of the words in order to implement the sentence embedding, which can improve semantic textual similarity performance. A variety of other neural network models have been proposed for similarity tasks [45, 5]. The method of neural network has been greatly improved in the semantic textual similarity task. However, the training processes of these methods are often time-consuming, and the syntactic features of sentences often cannot be fully exploited.

Previous works show the necessity of syntactic information in sentence semantics. Structured trees which include the syntactic information are another way of sentence representation [12]. But it is hard for the sentence structured trees to measure the similarity. Severyn et al. proposed kernel trick to measure the similarity between two elements existing in input space by mapping them into feature space [40]. Collins et al. proposed the concept of tree kernels [9], which can help the structured trees to overcome this shortcomings which are hard to compare efficiently. Haussler et al. proposed convolution kernel [20], which can use the substructure of the discrete structure (e.g. string, tree, and graph) to measure the similarity. Collins and Duffy et al. also proposed convolution tree kernel [9] for natural language processing problems. Duffy proposed the SST kernel [14], which is experimented on the voted perceptron for the parsetree re-ranking tasks. Through the most common tree kernels, such as the partial tree kernel (PTK) [31] and the smoothed partial tree kernel (SPTK) [11], the structured trees can generate various specific structures that in turn represent the syntactic or shallow semantic features.

Most of the studies described above regarded the input text pairs as the feature vectors where each element was a score corresponding to a certain type of similarity, such as lexical, syntactic and semantic similarity metrics. These methods above can be grouped into two categories: (1) those that view a text as a combination of words and calculate the similarity of two texts by aggregating the similarities of word pairs across two texts, and (2) those that model a text as a whole and calculate the similarity of two texts by comparing the two models obtained. To this extent, SEDT/E-SEDT can be classified as the second category.

The following studies are more relevant to our works. Pilehvar and Navigli [33] developed a new algorithm called ADW for measuring semantic similarity, which is an uniform approach that allows for efficient comparison of language items at different language levels. Using personalized PageRank algorithm, they can generate the semantic signatures of language items through external semantic networks such as WordNet, and then compare the language items based on how the language items are characterized. However, it cannot make full of exploit grammatical information due to the simple addition of lexical semantic signatures used to generate the semantic signatures of short texts. We are similar to this method in this point: (1) both for the task of semantic textual similarity; and (2) both represent words or text as vectors through modeling. But its method did not involve the word embeddings knowledge and the dependency structure information, which may cause some degree of semantic loss.

Severyn and Moschitti [39] defined a supervised approach to learning the reordering models by exploiting the structural relationships between the questions and candidate answer sections using sequences and tree kernels. Although they can automatically extract the relational features between relevant questions and answers, using hard-string matching to extract features can only extract the syntactic features. In this case, the semantic information would be lost when calculating the similarity. Therefore, as long as the grammar of the two sentences is completely different and the expression has the same meaning, the calculation score would be very low. SEDT/E-SEDT is similar to this method in this point of making full of using the syntactic tree of sentences. But there are still two drawbacks in this way: (1) the method did not uniformly model the dependency structure information and word embeddings knowledge and (2) the method did not discriminate the importance of the different words to the sentence semantics.

In conclusion, the above models did not give an effective way to fully fuse the factors of the semantic and syntax of the sentences. As a tentative improvement, we propose a semantic-embedded dependency tree that integrates both the syntactic information and the semantic knowledge. We represent a sentence as the dependency tree structure and proposed a modified partial tree kernel function that can calculate the similarity score of the two structural trees with their syntactic information. Moreover, considering the different of influence for each word to the sentence semantics, we also propose an enhanced semantic-embedded dependency tree to quantize the differences in word semantics.

3. Process overview

The modeling of text similarity becomes complicated due to the ambiguity and variability of language expression. It is difficult to model the syntactic grammatical and semantic information effectively through an unified approach. As a solution, we design a semanticembedded dependency tree (SEDT) by encoding word embeddings knowledge into the dependency relation of the sentence. Considering that most of the existing models often treat each word in a sentence equally, we design an enhanced semantic-embedded dependency tree (E-SEDT) to solve this defect. In the textual similarity task, important words should be given more attentions. As shown in Fig. 1, an attention weight of each word can be obtained by analyzing the data set. We can construct the E-SEDT tree of a sentence based on the information integration of words vector, weights of words and the dependency relations of the sentence. In this way, for a sentence, this type of tree which contains syntactic information, semantic knowledge and weight distribution of word in sentence is a quite effectively syntactico-semantic representation of sentence.

Figure 1.

Construction process of the enhanced semantic-embedded dependency tree (E-SEDT).

Figure 2.

Measure the similarity process using the new sentence tree structure.

Figure 2 shows the overall process of measuring the similarity of two specific sentences. For the given sentence pairs, they are first converted into the tree structures. In order to build the SEDT and E-SEDT trees, a novel tree kernel function MPTK is proposed on the basis of PTK. Note that MPTK is an universal tree kernel function that can be calculated using both SEDT and E-SEDT without additional modifications.

4. Structure modeling for STS

4.1 Tree structure

Previous work has shown the importance of encoding information about lexical semantic between sentences into their structural representations [6, 40]. In this section, we design and compare several syntactic and semantic structural representations of sentence pairs, including dependency tree, semantic-embedded dependency tree, and enhanced semantic-embedded dependency tree. As the core module, an improved tree kernel is given in order to automatically extract the syntactico-semantic patterns of new tree structure.

Dependency tree. Similarity to the passage reranking models [41], we represent a pair of sentences as two trees with lemmas at leaf levels and their dependency-relation tags at the preterminal levels. That means a sentence can be represented as a dependency tree that each leaf node in the tree represent a word in a sentence and each preterminal node represent the dependency-relation between two words. The dependency-relation tags represent the dependency relation between two words in a sentence, that indicate the relationships between “head” words and words which modify those heads. We get the syntactic dependency relation of words in sentences at first. Figure 3a shows the dependency-relation of a sentence for an example. Then we use a variation of the dependency tree, where dependency relations are altered in such a way that the words are always at the leaf level. This reordering of the nodes in the dependency tree, s.t. words do not form long chains, which is typical in the standard dependency tree representation, is essential for PTK to extract meaningful fragments. Figure 3b shows a dependency tree (DT) of an example. We also add part-of-speech tags between the words and the nodes encoding their grammatical roles (provided by the original dependency parse tree). Figure 4 shows a dependency tree (DT) of an example with POS tags.

Figure 3.

The dependency-relation and it’s dependency tree (DT).

Figure 4.

The dependency tree (DT) with POS tags.

Semantic-embedded dependency tree. The leaf nodes in dependency tree are compared as hard-matching in most semantic textual similarity tasks. This would cause the lost of the semantic information of the lexical nodes to a certain level. We proposed a new tree structure representation of a sentence through encoding word embeddings knowledge into semantic-embedded dependency tree, in which lemmas are represented by n-dimensional real valued vectors. Each dimension represents a latent feature of the word that reflects its semantic and syntactic properties. On the basis of dependency tree, we add “::” and the lexical vector generated by word2vec [30] or glove [32] from a large unlabeled corpus to the leaves, e.g., “joint::vec”. Figure 5 illustrates a SEDT representation of a sentence. Because of the different knowledge of word embeddings, the semanticembedded dependency trees can be divided into the word2vec-embedded dependency tree and the glove-embedded dependency tree.

Figure 5.

Semantic-embedded dependency tree (SEDT).

Word2vec-embedded dependency tree is encoded word embeddings knowledge which is trained by word2vec tool.1

https://code.google.com/archive/p/word2vec/source.

Word2vec was introduced by Mikolov et al. in 2013, which implements two model architectures: skip-gram and CBOW [30]. In skip-gram model, we are given a corpus of words

w

and their contexts

c

. Considering the conditional probabilities

p(c|w;\theta)

with a given corpus T, the goal of Eq. (1) is to calculate an appropriate parameter value

\theta

p(c|w;\theta)

to maximize the corpus probability:

$\displaystyle\mathop{\textit{argmax}}\limits_{\theta}\prod_{w\in T}\left[\prod% _{c\in C(w)}p(c|w;\theta)\right]$ (1)

where $C(w)$ is the set of contexts of word $w$ .

Glove-embedded dependency tree is similar to word2vec-embedded dependency tree, and the only difference is the way to generate the word embeddings knowledge, which is based on glove tool.2

https://nlp.stanford.edu/projects/glove/.

Enhanced semantic-embedded dependency tree. Considering that the different lexical of the sentence has different semantic contributions, we proposed a new method for assigning different lexical weights to sentences. Combined with weighted of lemmas and word embeddings, the enhanced semantic-embedded dependency tree includes not only syntactico-semantic knowledge, but also the weight of lexical information. On the basis of semantic-embedded dependency tree, we added “::wei” for each leaf node, e.g., “joint::wei::vec”. Figure 6 illustrates a E-SEDT representation of a sentence. And the Fig. 7 is the E-SEDT representation with POS tags.

Figure 6.

Enhanced semantic-embedded dependency tree (E-SEDT).

Figure 7.

Enhanced semantic-embedded dependency tree (E-SEDT) with POS tags.

The weights are defined by the TF-IDF scheme in E-SEDT. And the TF part of the TF-IDF scheme counts the number of times each word occurs in each sentence, which is called its term frequency. Just treat each sentence as a “document”, the value of $\textit{IDF}_{w}$ can be worked out as Eq. (2):

$\displaystyle\textit{IDF}_{w}=\textit{log}\left(\frac{1+N}{1+N_{w}}\right)$ (2)

where $N$ is the total number of sentences, $N_{w}$ is the number of sentences containing $w$ , and 1 is added to avoid division by 0. In the experiments, the value of parameter $\textit{IDF}_{w}$ is calculated based on current used textual similarity datasets. Note the TF-IDF enhanced semantic-embedded dependency tree includes TF-IDF word2vec-embedded dependency tree and TF-IDF glove-embedded dependency tree.

4.2 Tree structure pruning

In the similarity task, an efficient representation for the sentences is essential. A straightforward approach is to prune away all the nodes that are stop words, as they presumably play insignificant role in relating sentence pair, and they do not carry any important contextual information. Removing the stop words in sentence is equivalent to pruning the dependency tree by cutting off the tree nodes of the corresponding lemmas and its corresponding path. For instance, Fig. 8a shows an original sentence of the dependency tree, and Fig. 8b shows the dependency tree of this sentence after structure pruning. It is a concise version of the original sentence while preserving the most essential parts but also reducing the complexity of tree kernel computing.

Figure 8.

An example of structure pruning syntactic tree.

4.3 Tree kernel

In general, the main usage of tree kernels is to compute the number of common substructures between two trees $T_{1}$ and $T_{2}$ without explicitly considering the whole fragment space. Tree kernel models are very effective means for automatic feature engineering for natural language texts. A tree kernel function over $T_{1}$ and $T_{2}$ is defined as $\textit{TK}(T_{1},T_{2})=\sum_{n_{1}\in N_{T_{1}}}\sum_{n_{2}\in N_{T_{2}}}% \Delta(n_{1},n_{2})$ , where $N_{T_{1}}$ , $N_{T_{2}}$ are the sets of nodes in $T_{1}$ and $T_{2}$ . The essential distinction between different tree kernel functions is the $\Delta$ function that calculates similarity of subtrees. In this section, the modified partial tree kernel (MPTK) in SEDT is a simplified description of the kernel. The kernel score $\Delta_{\textit{PTK}}$ of PTK can be computed according to the productions from the tree nodes $n_{1}$ and $n_{2}$ . If the productions at $n_{1}$ and $n_{2}$ are different, the kernel score can be calculated as Eq. (3):

$\displaystyle\Delta_{\textit{PTK}}(n_{1},n_{2})=0$ (3)

Otherwise,

$\displaystyle\Delta_{\textit{PTK}}(n_{1},n_{2})=\upsilon\left(\lambda^{2}+\sum% _{p=1}^{l_{m}}\Delta_{p}(c_{n_{1}},c_{n_{2}})\right)$ (4)

where $\Delta_{p}$ counts the number of common sub-trees rooted in subsequences of exactly $p$ children (of $n_{1}$ and $n_{2}$ ), $c_{n_{1}}$ and $c_{n_{2}}$ refer to the list of children nodes of $n_{1}$ and $n_{2}$ respectively, and $l_{m}=\textit{min}(\textit{length}(c_{n_{1}}),\textit{length}(c_{n_{2}}))$ . $\upsilon$ and $\lambda$ are the two decay factors: $\upsilon$ is the height of the tree and $\lambda$ is the length of the child sequences. Given the two child sequences $s_{1}a=c_{n_{1}}$ and $s_{2}b=c_{n_{2}}$ , where $a$ and $b$ are the last children, we can calculate $\Delta_{p}(s_{1}a,s_{2}b)$ as Eq. (5):

$\displaystyle\Delta(a,b)\times\sum_{i=1}^{|s_{1}|}\sum_{r=1}^{|s_{2}|}\lambda^% {|s_{1}|-i+|s_{2}|-r}\times\Delta_{p-1}(s_{1}[1:i],s_{2}[1:r])$ (5)

where $s_{1}[1:i]$ and $s_{2}[1:r]$ are the child subsequences of $s_{1}$ and $s_{2}$ , which represent the set of elements from 1 to $i$ and from 1 to $r$ respectively.

4.4 Modified partial tree kernel

SEDT/E-SEDT are constructed with the word embeddings, dependency relations and attention weight of the words. The following describes how the modified partial tree kernel handles sentence pairs of new tree structure.

The function $\Delta$ of MPTK is defined recursively when input is semantic-embedded dependency tree:

(1)
If the node $n_{1}$ and $n_{2}$ are the leaf nodes, the kernel score can be calculated as Eq. (6):

$\displaystyle\Delta_{\textit{MPTK}}(n_{1},n_{2})=\textit{Weight}_{n_{1}}\times% \textit{Weight}_{n_{2}}\times\textit{cos}(\textit{Vec}_{n_{1}},\textit{Vec}_{n% _{2}})$ (6)

Where $\textit{Weight}_{n_{1}}$ and $\textit{Weight}_{n_{2}}$ is the weight of $n_{1}$ and $n_{2}$ respectively, which are defined by the TF-IDF scheme with the Eq. (2). $\textit{Vec}_{n_{1}}$ and $\textit{Vec}_{n_{2}}$ are the word vector of $n_{1}$ and $n_{2}$ , respectively; and $\textit{cos}(\textit{Vec}_{n_{1}},\textit{Vec}_{n_{2}})$ is a function to measure the cosine similarity between vectors.
(2)
If the node $n_{1}$ and $n_{2}$ are not pre-terminals and their labels are the same, the kernel score can be calculated as Eq. (7):

$\displaystyle\Delta_{\textit{MPTK}}(n_{1},n_{2})=\upsilon\left(\lambda^{2}+% \sum_{p=1}^{l_{m}}\Delta_{m}(c_{n_{1}},c_{n_{2}})\right)$ (7)

The meaning of the parameters appearing in the Eq. (7) corresponds to the parameters in Eq. (4).
(3)
If the node $n_{1}$ and $n_{2}$ are not pre-terminals and their labels are different, the kernel score can be calculated as Eq. (8):

$\displaystyle\Delta_{\textit{MPTK}}(n_{1},n_{2})=0$ (8)
(4)
If there is only one leaf in two nodes( $n_{1}$ and $n_{2}$ ) and the other node is preterminals, the kernel score can be calculated as Eq. (9):

$\displaystyle\Delta_{\textit{MPTK}}(n_{1},n_{2})=0$ (9)

Algorithm 1
MPTK for STS
0:
The node list of $\textit{Tree}_{1}$ and $\textit{Tree}_{2}$ : $T_{1}$ , $T_{2}$ ;The node number of $\textit{Tree}_{1}$ and $\textit{Tree}_{2}$ : $N_{1}$ , $N_{2}$ ;The word vector tag of Node $n_{i}$ : $\textit{Vec}_{n_{i}}$ .
0:
The similarity score Sum.
1:
initialize an intermediate matrix $M$ . $M_{i,j}\leftarrow 0,0\leqslant i<N_{1},0\leqslant j<N_{2};$
2:
for each node $n_{i}$ in $T_{1}$ do
3:
for each node $n_{j}$ in $T_{2}$ do
4:
//if $n_{i}$ and $n_{j}$ is both lemma;
5:
if $\textit{isleaf}(n_{i},n_{j})=1$ then
6:
//get the weighted cosine similarity;
7:
$\textit{Sim}\leftarrow\textit{cos}(\textit{Vec}_{n_{i}},\textit{Vec}_{n_{j}})$
8:
if $\textit{isEDST()}=1$ then
9:
$\textit{Sim}\leftarrow\textit{Sim}\times\textit{weight}_{n_{i}}\times\textit{% weight}_{n_{j}}$
10:
end if
11:
$M[n_{i}][n_{j}]\leftarrow Sim$
12:
else
13:
if $\textit{isleaf}(n_{i},n_{j})=0$ then
14:
//if $n_{i}$ and $n_{j}$ is both syntex node, get similarity by hard matching;
15:
$M[n_{i}][n_{j}]\leftarrow\textit{issame}(n_{i},n_{j});$
16:
else
17:
//if there is one leaf and one syntex node in $n_{i}$ and $n_{j}$ ;
18:
$M[n_{i}][n_{j}]\leftarrow 0;$
19:
end if
20:
end if
21:
end for
22:
end for
23:
for each element $M[n_{i}][n_{j}]\in M$ do
24:
if none of $n_{i}$ and $n_{j}$ is a leaf and $n_{i}=n_{j}$ then
25:
$K(n_{i},n_{j})=\upsilon(\lambda^{2}+\sum_{p=1}^{l_{m}}\Delta_{m}(c_{n_{i}},c_{% n_{j}}))$
26:
end if
27:
end for
28:
for each element $M[n_{i}][n_{j}]\in M$ do
29:
$Sum\leftarrow Sum+M[n_{i}][n_{j}]$
30:
end for
31:
return Sum.

The pseudo-codes of the above calculation process are described as Algorithm 1. The function $\textit{isleaf}(n_{i},n_{j})$ returns the type of node $n_{i}$ and $n_{j}$ . If $n_{i}$ and $n_{j}$ are both leaf nodes, the function returns 1. In this case, the similarity scores of nodes are determined by their word vectors and weight tag. If the nodes are both syntactic nodes, the return value is 0, otherwise it represents that their node types are different. $\textit{issame}(n_{i},n_{j})$ function is to determine whether the node labels of $n_{i}$ and $n_{j}$ are the same. Parameter Sum denotes the similarity between structured tree pairs. Finally, we use the normalized formula to get the final result score.
5. Experiments and results

This section illustrates a group of experiments to verify the effect of SEDT/E-SEDT. All source codes of our experiments are now available in our open-source code repository in GitHub.3

³
https://github.com/yuquanle/MPTK.git.

5.1 Tasks and datasets

5.1.1 Word analogies datasets

We conduct these experiments on the word analogy tasks.4

⁴
https://code.google.com/archive/p/word2vec/source/source/browse/trunk/questions-words.txt.

This dataset consists 19544 questions, which all like the structure of: “a is to b as m is to – ?”. In this group of experiments, the questions are dived into a semantic questions subset and a syntactic subset. The experiment process is to answer the question by finding the word

n

whose word vector vector(n) is closest to

\textit{vector(b)}-\textit{vector(a) }+\textit{vector(m)}

according to the cosine similarity.

5.1.2 Semantic textual similarity datasets

We test our methods on the 14 textual similarity datasets including all the datasets from SemEval semantic textual similarity (STS) tasks (2012–2014) [3, 2, 1], except for the STS2013¡¯s SMT dataset, since it was set up with permission, we could not get it. The goal of these tasks is to predict the similarity between the two given sentences. The similarity score is from 0 to 5, where a scale of 0 means unrelated and 5 means complete semantically equivalence. We present the statistics of STS datasets by year. Each year there are actually 4 to 6 STS tasks, as shown in Table 1. Note that tasks with the same name in different years are actually different tasks.

SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems, with the aim of comparing systems that can analyse diverse semantic phenomena in text. Each dataset contains many pairs of sentences (e.g. MSRvid dataset in 2012 contains 750 pairs of sentences). These datasets cover a wide range of domains such as news, web forum, images, glosses, twitter, and so on.

Table 1
Statistics of the provided datasets for the SemEval Semantic Textual Similarity Tasks (2012–2014)

Tasks	Test
MSRpar	750
MSRvid	750
SMTeuroparl	459
OnWN	750
SMTnews	399
STS’12	3108

Tasks	Test
Headline	750
OnWN	561
FNWN	189
SMT	–
STS’13	1500

Tasks	Test
Deft forum	450
Deft news	300
Headline	750
Images	750
OnWN	750
Tweet news	750
STS’14	3750

5.2 Experiment settings

5.2.1 Word embedding experiment

To generate high quality word vectors, we use Wikipedia Extractor tool5

⁵
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor.

to deal with the large-scale 2017 Wikipedia (Wiki) dump6

⁶

https://dumps.wikimedia.org/enwiki/.

for learning word embedding. By removing the stop words and symbols and converting the characters to the lowercase, we can get a lexicon containing about 2 billion tokens for training. The word2vec tool and glove tool are applied to train the word embeddings in our experiment.

In word embedding experiments, considering the two key parameters, which are the number of iterations and vector dimension, we set different values to evaluate our trained word embeddings on word analogy task datasets. For the part of the number of iterations, iteration number is set to 5, 10, 15, 30, 50, 100, and the other parameters are the default settings in word2vec software or glove tool. As shown in Fig. 9, the total accuracy gains which include semantic accuracy and syntactic accuracy are not obvious when the iterations exceeds 30. For the part of dimension experiment, vector dimension is set to 50, 100, 200, 300, and the other parameters are the default settings in word2vec software or glove tool. As shown in Fig. 10, the accuracy gain is not obvious when the dimension exceeds 300. Based on the analysis above, we set the number of dimensions in word2vec word embeddings and glove word embeddings as 300: $\textit{Dim}=$ 300 to get the best result.

Figure 9.

The number of iteration.

Figure 10.

The dimension of word vector.

5.2.2 STS experiment settings

To evaluate the effects of different word embedding techniques on the performance, two word embedding techniques were employed in the experiment: word2vec and glove. The Stanford Parser tool7

⁷
https://nlp.stanford.edu/software/.

is used in this experiment to obtain the dependency syntax parse, and transform the dependency relations into a dependency syntax tree. And the final similarity scores between sentences are worked out through the PTK kernel function provided in the uSVM-TK tool.8

⁸

http://ikernels-portal.disi.unitn.it/repository/usvmtk/.

To measure the accuracy, this experiment follows the STS tasks and evaluates the performance of the sentence similarity measure according to the pearson coefficient with the gold standard scores. In addition, we also provide the results in terms of the spearman rank correlation on the STS-12 tasks.

As mentioned in Section 2, the similarity measurement method is evolved from the corpus based, string based and other methods to the current neural network approach, so the experiments are mainly designed for two parts:

First of all, SEDT/E-SEDT is compared against other five lexical resource-based measures for their accuracies: WUP [46], JCN [21], LIN [27], LCH [25], and RES [35] on STS-12. And we also choose Explicit Semantic Analysis (ESA) [16] as one of comparative methods, which is one of the best similarity measurements in STS-12. ESA represents a term as a high-dimension vector, where the dimension is an article in Wikipedia and the weight is the TF-IDF value of the term in the corresponding articles. In order to compute the similarity of the sentence pairs using these five concept-level similarity measures, this experiment implements the aggregation method proposed by Corley and Mihalcea [10] based on the solution proposed by Mohammad Taher Pilehvar [33].

The aggregation method works out the similarity between sentence $T_{1}$ and $T_{2}$ for each word $w$ in $T_{1}$ , and the most similar word in $T_{2}$ as Eq. (10):

$\displaystyle\textit{sim}(T_{1},T_{2})=\frac{\sum_{w\in T_{1}}\textit{maxSim}(% w,T_{2})idf(w)}{\sum_{w\in T_{1}}idf(w)}$ (10)

where $\textit{maxSim}(w,T_{2})$ returns the similarity value between $w$ and its most similar word in $T_{2}$ , and $idf(w)$ is the inverse document frequency of word $w$ . The final similarity score is calculated as the average similarity in the two directions, i.e., the average of $\textit{sim}(T_{1},T_{2})$ and $\textit{sim}(T_{2},T_{1})$ .

Secondly, SEDT/E-SEDT is compared against other six neural network measures: RNN, iRNN, LSTM, ST, GloVe. ST and GloVe are unsupervised methods, where ST denotes the skip-thought vectors [23], and GloVe denotes the unweighted average of the GloVe vectors [32]. RNN, iRNN, LSTM are supervised methods, where RNN denotes the classical recurrent neural network, and iRNN denotes a variant with the activation being the identity and the weight matrices initialized to identity. The LSTM is the version from [17], either with output gates (denoted as LSTM(o.g.)) or without (denoted as LSTM (no)).

Finally, in our paper, we use the Pearson (r) and Spearman ( $\rho$ ) correlation as the evaluation criterion to measure the correlation between the predicted scores and the ground-truth scores. They are the measure of the linear correlation between two variables, they have a value between $+$ 1 and $-$ 1, where 1 is total positive linear correlation, 0 is no linear correlation, and $-$ 1 is total negative linear correlation. They are widely used in the research field.

5.3 STS-12 results

Table 2 shows the performance of the proposed semantic-embedded dependency tree , enhanced semantic-embedded dependency tree, semantic-embedded dependency tree with POS tag , and enhanced semantic-embedded dependency tree with POS tag, together with the six other similarity measures on the five datasets of STS-12. And the full name of the abbreviated items in Table 2 are illustrated in Table 3. DT is the result of similarity measure through the original dependency tree based on PTK.

Table 2
Performance of our methods together with other similarity measures on the five datasets of the SemEval-2012 Semantic Similarity task in terms of the Pearson (r) and Spearman ( $\rho$ ) correlations. The right-most columns show the average (Avg) performance across the five datasets

System	MSRvid		MSRpar		SMTeuroparl		OnWN		SMTnews		Avg
	$\rho$	$r$	$\rho$	$r$	$\rho$	$r$	$\rho$	$r$	$\rho$	$r$	$\rho$	$r$
LCH	0.68	0.67	0.39	0.45	0.19	0.18	0.50	0.53	0.24	0.27	0.42	0.45
JCN	0.65	0.65	0.39	0.44	0.20	0.20	0.50	0.53	0.24	0.26	0.43	0.45
WUP	0.65	0.64	0.39	0.44	0.22	0.21	0.52	0.55	0.25	0.28	0.44	0.45
LIN	0.70	0.70	0.41	0.45	0.23	0.23	0.54	0.57	0.25	0.28	0.46	0.48
RES	0.74	0.73	0.41	0.46	0.25	0.25	0.55	0.59	0.27	0.30	0.48	0.50
ESA	0.75	0.74	0.43	0.44	0.38	0.48	0.62	0.62	0.33	0.40	0.53	0.56
DT	0.03	0.02	0.17	0.16	0.49	0.37	0.45	0.40	0.29	0.30	0.27	0.23
glove-EDT	0.69	0.69	0.48	0.55	0.61	0.56	0.59	0.57	0.40	0.50	0.57	0.58
w2v-EDT	0.72	0.73	0.51	0.54	0.56	0.52	0.61	0.64	0.47	0.55	0.59	0.61
tfidf-glove-EDT	0.72	0.73	0.48	0.54	0.60	0.51	0.59	0.56	0.40	0.48	0.57	0.58
tfidf-w2v-EDT	0.75	0.77	0.54	0.57	0.59	0.51	0.65	0.66	0.48	0.52	0.62	0.62
w2v-pos-EDT	0.69	0.69	0.48	0.49	0.57	0.52	0.61	0.64	0.47	0.57	0.56	0.58
tfidf-w2v-pos-EDT	0.70	0.70	0.49	0.52	0.58	0.52	0.62	0.65	0.45	0.51	0.57	0.58

The rightmost columns in the Table 2 show the average performance on the five datasets (Avg). As can be seen from the table, SEDT/E-SEDT can achieve the satisfactory effect on all the datasets. Particularly it can achieve the best overall performance when construct an enhenced semantic-embedded dependency tree by using word2vec knowledge. (i.e., tfidf-w2v-EDT) according to both Spearman (Avg 0.62) and Pearson (Avg 0.62) correlations. The overall performance of glove-EDT is slightly inferior to the overall performance of w2v-EDT, and their weighted version is as well.

Traditional tree kernels often compare the tree nodes by hard string matching. However, for the similarity on the sentence levels, the direct use of the tree kernel function can only calculate the similarity of the syntax structure, and the word pairs can only be hard string matched. It will lose a lot of semantic knowledge in this way. Because for the sentences with different words, even though they have the same meanings, their similarity score is likely to be zero under those comparison methods. So the method DT which we use the original dependency tree and PTK for similarity measure has the lowest score in the experiment.

As the Table 2 shows, our approach cannot perform well on some tasks, we try to find out the reasons. As for STS12’ SMTeuroparl and STS12’ SMTnews tasks, it could be due to that some particular properties such as a large number of numerical items or special characters in these tasks weaken the performance of our approach. For example, in the STS12’ SMTeuroparl task, the items like “5.30 pm” and “(A5-0323/2000)” account for around 10% of the total test sample; in the STS12’ SMTnews task, the items like “5.2%” and “24 May” account for around 8% of the total test sample. We must know that, our approach currently lacks for the strong ability to handle the numerical items and special characters.

When we add the POS tags to the original dependency tree, the experimental results do not preferment better than the E-SEDT. We speculate the reason may be that: we adopt the dependency-relations of words to form the dependency tree, the preterminal nodes in the tree represent the grammatical structure between two words, and the addition of the part-of-speech tags to the leaf nodes may affect the calculation of the tree core. The dependency tree is formed by the sentence which removed from the stop words, that makes the part-of-speech tag less important in the sentence.

Table 3

The description of methods

Symbol	Meaning
glove-EDT	Use glove-embedded dependency tree and MPTK for STS.
w2v-EDT	Use word2vec-embedded dependency tree and MPTK for STS.
tfidf-glove-EDT	Use tfidf glove-embedded dependency tree and MPTK for STS.
tfidf-w2v-EDT	Use tfidf word2vec-embedded dependency tree and MPTK for STS.
w2v-pos-EDT	Use word2vec-embedded dependency tree with POS tags and MPTK for STS.
tfidf-w2v-pos-EDT	Use tfidf word2vec-embedded dependency tree with POS tags and MPTK for STS.

5.4 Results on three STS tasks

In Table 2, it is easy to see that w2v-EDT and tfidf-w2v-EDT based on word2vec are superior to glove-EDT and glove-tfidf-EDT based on glove. Hence, we can select the two best methods (i.e., w2v-EDT and tfidf-w2v-EDT) to perform experiments on multiple STS tasks, and compare them with the neural network approaches.

Table 4 provides the results for each task. Our methods get better or comparable performance compared to others. tfidf-w2v-EDT gets the best results in two of the three STS tasks and achieves the best results on 8 out of 14 datasets. In general, tfidf-w2v-EDT performs better compared to w2v-EDT, though it would decrease the performance on rare cases such as SMTeuroparl. Compared with these methods, even if tfidf-w2v-EDT does not get the best performance on all datasets, it does not differ by much compared to the best, such as OnWn and SMTnews datasets in 2012, image datasets and tweets news datasets in 2014. It is just as that the result of tfidf-w2v-EDT based on the tweets news datasets is 0.75, and the best method is 0.77.

As can be seen from Table 4, our approach differs far from the best method on the FNWN datasets and deft forum datasets (over 0.05). The FNWN dataset contains gloss pairs FrameNet-WordNet (FNWN), and the differences in length of its sentence pairs are too much. Facts proved that tfidf-w2v-EDT is suitable for comparing two linguistic level of comparable length. Deft-forum contains many forum post sentences, and in general, the authors do not consider syntactic rigor when writing in forums. Hence, the accuracy of the grammar of these sentences cannot be guaranteed, and it is also doped with a large number of colloquial terms and network abbreviations.

Table 4
Experimental results (Pearson’s r) on semantic textual similarity tasks (2012–2014). The highest score in each row is in boldface. See the main text for the description of the methods

	Result collected from [45]						Our approach
Tasks	RNN	iRNN	LSTM(no)	LSTM(o.g.)	ST	GloVe	w2v-EDT	tfidf-w2v-EDT
MSRpar	0.19	0.43	0.16	0.09	0.17	0.48	0.54	0.57
MSRvid	0.67	0.73	0.71	0.71	0.42	0.64	0.73	0.77
SMTeuroparl	0.41	0.47	0.42	0.44	0.35	0.46	0.52	0.51
OnWN	0.63	0.70	0.65	0.56	0.30	0.55	0.64	0.66
SMTnews	0.51	0.58	0.61	0.51	0.31	0.50	0.55	0.52
STS’12	0.48	0.58	0.51	0.46	0.31	0.53	0.60	0.61
headline	0.60	0.73	0.57	0.49	0.35	0.64	0.71	0.73
OnWN	0.55	0.69	0.69	0.50	0.10	0.49	0.67	0.82
FNWN	0.31	0.45	0.25	0.38	0.30	0.34	0.31	0.36
STS’13	0.48	0.63	0.50	0.46	0.25	0.49	0.56	0.64
Deft forum	0.42	0.49	0.44	0.46	0.13	0.27	0.39	0.41
Deft news	0.54	0.72	0.53	0.39	0.24	0.68	0.73	0.72
Headline	0.58	0.70	0.58	0.51	0.38	0.60	0.68	0.70
Images	0.68	0.78	0.69	0.63	0.51	0.61	0.72	0.74
OnWN	0.68	0.79	0.77	0.62	0.23	0.58	0.76	0.85
Tweet news	0.58	0.77	0.59	0.48	0.40	0.51	0.75	0.75
STS’14	0.58	0.71	0.60	0.52	0.31	0.54	0.67	0.70

6. Conclusion

As a typical semantic-embedded dependency tree structure, SEDT/E-SEDT provide a novel combined model for measuring semantic similarity of sentences based on the semantic enhanced tree kernel. The approach proposed that the word embedding technique was encoded into the dependency tree structure to get the structured representation which contains the semantic knowledge. In this model, a much richer representation is proposed for the learning algorithm to extract useful syntactic and semantic patterns. Experiments on multiple STS tasks prove that this structure can effectively reflect the semantics of sentences. Moreover, we extend the enhanced semantic-embedded dependency tree with leaf weights, and make it can distinguish the contribution of different words in sentences. Experiments show that SEDT/E-SEDT can outperform the most advanced methods in most datasets. And this also proves that the proposed method considering syntactic information and semantic knowledge can understand the practical semantics of sentences more fully.

Footnotes

Acknowledgments

The work is supported by the National Natural Science Foundation of China (Grant Nos. 61572176, L1624040), the National Key Research and Development Program of China (2017YFB0202201).

References

Agirre

et al., Semeval-2014 task 10: Multilingual semantic textual similarity, In International Workshop on Semantic Evaluation, 2014, pp. 81–91.

Agirre

et al., * Sem 2013 shared task: Semantic textual similarity, In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, volume 1, 2013, pp. 32–43.

Agirre

et al., Semeval-2012 task 6: a pilot on semantic textual similarity, In Joint Conference on Lexical and Computational Semantics, 2012, pp. 385–393.

Androutsopoulos

and Malakasiotis

, A survey of paraphrasing and textual entailment methods, Journal of Artificial Intelligence Research 38 (2010), 135–187.

Arora

et al., A simple but tough-to-beat baseline for sentence embeddings, In ICLR, 2017.

Bilotti

M.W.

et al., Rank learning for factoid question answering with linguistic and semantic constraints, In ACM International Conference on Information and Knowledge Management, 2010, pp. 459–468.

Chien

J.T.

and Wu

M.S.

, Adaptive Bayesian Latent Semantic Analysis, IEEE Press, 2008.

Coelho

T.A.S.

et al., Image retrieval using multiple evidence ranking, Knowledge and Data Engineering, IEEE Transactions on 16(4) (2004), 408–417.

Collins

et al., Convolution kernels for natural language, In NIPS, volume 14, 2001, pp. 625–632.

10.

Corley

and Mihalcea

, Measuring the semantic similarity of texts, In Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment, Association for Computational Linguistics, 2005, pp. 13–18.

11.

Croce

et al., Semantic convolution kernels over dependency trees, In CIKM, 2011, pp. 2013–2016.

12.

Croce

et al., Structured lexical similarity via convolution kernels on dependency trees, In EMNLP, 2011, pp. 1034–1046.

13.

Deerwester

et al., Indexing by latent semantic analysis, Journal of the American Society for Information Science 41(6) (1990), 391.

14.

Duffy

and Duffy

, New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron, In ACL, 2002, pp. 263–270.

15.

Fader

et al., Open question answering over curated and extracted knowledge bases, In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 1156–1165.

16.

Gabrilovich

and Markovitch

, Computing semantic relatedness using wikipedia-based explicit semantic analysis, In IJcAI, volume 7, 2007, pp. 1606–1611.

17.

Gers

F.A.

and Schraudolph

N.N.

, Learning precise timing with lstm recurrent networks, JMLR.org (2003).

18.

Glickman

and Dagan

, Acquiring lexical paraphrases from a single corpus, Recent Advances in Natural Language Processing III. John Benjamins Publishing, Amsterdam, Netherlands, 2004, pp. 81–90.

19.

Golub

and He

, Character-level question answering with attention, arXiv preprint arXiv:1604.00727, 2016.

20.

Haussler

, Convolution kernels on discrete structures, Tech Rep 7 (1999), 95–114.

21.

Jiang

J.J.

and Conrath

D.W.

, Semantic similarity based on corpus statistics and lexical taxonomy, arXiv preprint cmp-lg/9709008, 1997.

22.

Kalchbrenner

et al., A convolutional neural network for modelling sentences, Eprint Arxiv 1 (2014).

23.

Kiros

et al., Skip-thought vectors, Computer Science (2015).

24.

Lavie

and Denkowski

M.J.

, The meteor metric for automatic evaluation of machine translation, Machine Translation 23(2) (2009), 105–115.

25.

Leacock

and Chodorow

, Combining local context and wordnet similarity for word sense identification, WordNet: An Electronic Lexical Database 49(2) (1998), 265–283.

26.

et al., Sentence similarity based on semantic nets and corpus statistics, IEEE Transactions on Knowledge and Data Engineering 18(8) (2006), 1138–1150.

27.

Lin

et al., An information-theoretic definition of similarity, In ICML, volume 98, Citeseer, 1998, pp. 296–304.

28.

Majumder

et al., Semantic textual similarity methods, tools, and applications: A survey, Computacion y Sistemas 20(4) (2016).

29.

Mcinnes

B.T.

Pedersen

and Pakhomov

S.V.

, Umls-interface and umls-similarity: Open source software for measuring paths and semantic similarity, Amia Annu Symp Proc 2009 (2009), 431–435.

30.

Mikolov

et al., Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013.

31.

Moschitti

, Efficient convolution kernels for dependency and constituent syntactic trees, In European Conference on Machine Learning, Springer, 2006, pp. 318–329.

32.

Pennington

et al., Glove: Global vectors for word representation, In EMNLP, volume 14, 2014, pp. 1532–1543.

33.

Pilehvar

M.T.

and Navigli

, From senses to texts: An all-in-one graph-based approach for measuring semantic similarity, Artificial Intelligence 228 (2015), 95–128.

34.

Resnik

, Wordnet and distributional analysis: A class-based approach to lexical discovery, In AAAI workshop on statistically-based natural language processing techniques, 1992, pp. 56–64.

35.

Resnik

, Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007, 1995.

36.

Rocchio

J.J.

, Relevance feedback in information retrieval, 1971.

37.

Rus

et al., Semilar: The semantic similarity toolkit, In ACL 2013 Demo Track, 2013.

38.

Salton

et al., Automatic text structuring and summarization, Information Processing and Management 33(2) (1997), 193–207.

39.

Severyn

and Moschitti

, Structural relationships for large-scale learning of answer re-ranking, In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, ACM, 2012, pp. 741–750.

40.

Severyn

et al., Building structures from classifiers for passage reranking, In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, ACM, 2013, pp. 969–978.

41.

Severyn

et al., Learning adaptable patterns for passage reranking, In CoNLL, 2013.

42.

Socher

et al., Dynamic pooling and unfolding recursive autoencoders for paraphrase detection, Advances in Neural Information Processing Systems 24 (2011), 801–809.

43.

Wang

et al., Multi-document summarization via sentencelevel semantic analysis and symmetric matrix factorization, In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2008, pp. 307–314.

44.

Wang

et al., Learning sentence representation with guidance of human attention, In IJCAI, 2017, pp. 4137–4143.

45.

Wieting

et al., Towards universal paraphrastic sentence embeddings, Computer Science, 2015.

46.

and Palmer

, Verbs semantics and lexical selection, In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, 1994, pp. 133–138.

47.

Zhang

et al., Variational neural machine translation, arXiv preprint arXiv:1605.07869, 2016.

48.

Zhang

et al., A new feature selection approach to naive bayes text classifiers, International Journal of Pattern Recognition and Artificial Intelligence 30(2) (2016), 1650003.

A semantic textual similarity measurement model based on the syntactic-semantic representation

Abstract

Keywords

1. Introduction

3. Process overview

4.1 Tree structure

3 https://github.com/yuquanle/MPTK.git.

5.1.1 Word analogies datasets

4 https://code.google.com/archive/p/word2vec/source/source/browse/trunk/questions-words.txt.

Table 1 Statistics of the provided datasets for the SemEval Semantic Textual Similarity Tasks (2012–2014)

5.2.1 Word embedding experiment

5 http://medialab.di.unipi.it/wiki/Wikipedia_Extractor.

7 https://nlp.stanford.edu/software/.

Table 2 Performance of our methods together with other similarity measures on the five datasets of the SemEval-2012 Semantic Similarity task in terms of the Pearson (r) and Spearman ( ρ ) correlations. The right-most columns show the average (Avg) performance across the five datasets

Table 4 Experimental results (Pearson’s r) on semantic textual similarity tasks (2012–2014). The highest score in each row is in boldface. See the main text for the description of the methods

Footnotes

Acknowledgments

References

³
https://github.com/yuquanle/MPTK.git.

⁴
https://code.google.com/archive/p/word2vec/source/source/browse/trunk/questions-words.txt.

Table 1
Statistics of the provided datasets for the SemEval Semantic Textual Similarity Tasks (2012–2014)

⁵
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor.

⁷
https://nlp.stanford.edu/software/.

Table 2
Performance of our methods together with other similarity measures on the five datasets of the SemEval-2012 Semantic Similarity task in terms of the Pearson (r) and Spearman ( $\rho$ ) correlations. The right-most columns show the average (Avg) performance across the five datasets

Table 4
Experimental results (Pearson’s r) on semantic textual similarity tasks (2012–2014). The highest score in each row is in boldface. See the main text for the description of the methods