Measuring interpretable semantic similarity of sentences using a multi chunk aligner

Abstract

This work focuses on bolstering the pre–existing Interpretable Semantic Textual Similarity (iSTS) method, that will enable a user to understand the behaviour of an artificial intelligent system. The proposed iSTS method explains the similarities and differences between a pair of sentences. The objective of the iSTS problem is to formalize the alignment between a pair of text segments and to label the relationship between the text fragments with a relation type and relatedness score. The overall objective of this work is to develop a 1:M multi chunk aligner for an iSTS method, which is trained on SemEval 2016 Task 2 dataset. The obtained result outperforms many state–of–art aligners, which were part of SemEval 2016 iSTS task.

Keywords

WordNet interpretability semantic semilarity Natural Language Processing cosine similarity

1 Introduction

Assessing Semantic Textual Similarity (STS) between text is a central problem in Natural Language Processing (NLP) due to its importance to a variety of applications. It has wide range of applications such as: ad–hoc table retrieval, question answering, text classification, Natural Language Understanding (NLU) and controversial agent in Intelligent Tutoring System (ITS). Text similarity is measured at various level, starting from word to phrase followed by a sentence to paragraph and ends with the similarity between documents.

A metric over a set of documents defines the semantic similarity between them, by measuring the direct and indirect relationships [11, 33]. These relationships can be measured and recognized by the presence of any semantic relationships. The similarity between short text was reported in [22, 26] and similarity between two parallel sentences was introduced in Semantic Evaluation (SemEval) workshop 1 . In SemEval a pair of sentences have been given as input, and a score ranging from 0 (having different semantic meaning) to 5 (complete semantic equivalence) was considered as a similarity score 2 between them. After that, the STS problem has seen a large number of solutions in a relatively small amount of time. The central idea behind most of the solution is the identification and alignment of semantically similar or related words across the two sentences and the aggregation of these similarities to generate an overall similarity [16 , 38].

This paper does not only measure the similarity score, but it will also explain why two sentences are related or unrelated, and in literature, this problem is known as Interpretable Semantic Textual Similarity (iSTS), which adds fine–grained information while evaluating the similarity between text snippets [2, 4]. Adding an explanatory layer will help an artificially intelligent agent to prove its ability to a user, which requires an explanation to understand their behaviour [14]. This explanatory layer is achieved by aligning the text segments of one sentence with another, and for each alignment an aligned type with a relatedness score is assigned. Details about the alignment type and relatedness score is reported in Section 2.

This paper is organized as follows. Section 2 describes the interpretable STS (iSTS) problem with its required components. Section 3 discusses about the related works, and individual modules of this proposed method has been discussed in Section 4. System performance and result analysis of the various components are discussed in Section 5, and Section 6 mentions the conclusion of the proposed method.

2 Defining the interpretable STS problem

An interpretable STS (iSTS) can explain the differences and commonalities between two sentences. To understand this, consider the following two parallel sentence, which are taken from the headline dataset of SemEVal 2015 iSTS pilot task 3 :

US drone strike kills 5 militants in Pakistan

Drone strike kills four suspected militants in Pakistan

The output of such method would be something like the following: two sentences talk about a Drone strike in Pakistan conducted by the US. But they differ in number of ways, such as the number of militants killed (5 vs. 4) and level of detail: the first one specifies as US drone, which is missing in second sentence. The first one clearly mentioned about the militants, whereas second sentence talked about suspected militants.

It is very natural for human to give such explanations, but for an algorithm or a computational model, it’s like a Natural Language Understanding (NLU) problem. In computation, the problem is conventionally named as interpretable STS (iSTS), and an architecture comprising of five different modules to solve the problem is shown in Fig. 1.

Fig.1

Modules of an interpretable STS method.

Input handling and chunking: accept a pair of sentence as input and process them to find the individual chunks of each sentences.

Alignment: aligns the segments of two sentences by considering the most similar to the weaker chunk pair of two sentences.

Classification: assigns a relation type to an alignment. Scoring: measures the relatedness score between a pair of aligned chunk.

Explanatory layer: explains the similarities and differences by considering the relation type and similarity score for all chunks.

The alignment types can be one of the following:

Equivalent (EQUI): chunks convey an equivalent meaning (e.g., to the police vs. by police)

Opposite (OPPO): chunks are equivalent but convey opposite meaning (e.g., lower vs. higher)

More general (SPE1): this chunk conveys a more general meaning than the other chunk (e.g., to American adoption ban vs. to adoption ban)

More specific (SPE2): this chunk conveys a more specific meaning than the other chunk (e.g., Friday vs. midday Friday)

Similar (SIMI): chunks are similar in meaning and shares similar attributes and are not in EQUI, OPPO, SPE1, or SPE2 (e.g., increases security vs. bracing)

Related (REL): chunks share minimum information (e.g., about filming vs. to play)

No alignment (NOALI): chunks are completely unrelated.

With these reasoning types, two other types can be used either in isolation or together, meaning either use both the types or none.

Factuality (FACT): when the aligned chunks are not a fact or speculation (e.g., 5 militants vs. four suspected militants), assigned as SIMI_FACT.

Polarity (POL): when chunks expressed opinion, which can be positive, negative or neutral (e.g., Syrian regime vs. Syria), assigned as EUQI_POL.

Before assigning any of these labels, the relatedness score from 0 (no relation) to 5 (maximum similarity) will be measured first.

5, if the meanings of both the chunks are completely similar/equivalent.

4 or 3, if the chunks are very similar or closely related.

2 or 1, if chunks are similar or somehow related.

0 for unaligned chunks

For an aligned pair of chunk, the similarity is always above 0, which means that chunks can be left as unaligned. For an unaligned chunk, ‘NOALI’ label will be assigned. After alignment it is compulsory to check the following things:

NOALI should not have any score

EQUI should have a score 5 and if meaning of aligned chunk is completely opposite, EQUI must be replaced by OPPO

The other labels must have score greater than 0 and less than 5

3 Related work

Early work on adding an explanatory layer is an important task for a tutorial system, where Intelligent Tutoring System (ITS) interacts with the students through natural language. In most cases, applications have focused on problem–dependent and question–dependent knowledge [5, 18]. However, some alternatives are also available those are based on NLP techniques [28]. Work reported in [28], is related to an educational domain and much more similar to a textual entailment problem. It defines facets (words under some syntactic/ semantic relation) in the response of a student answer (called the hypothesis), was linked to a reference answer. The link would signal whether each facet in the response was entailed by the reference answer or not. The initial motivation for this proposed method is similar to this entailment work. As best of our knowledge, we think interpretability is related to Natural Language Understanding (NLU) problem and especially useful in the field of ITS.

Towards the development of such method, SemEval 2015 and 2016 took the first initiative. In this event, the participant systems has to explain why two sentences are similar or unrelated by supplementing the similarity score with an explanatory layer. In 2015, it was introduced as a pilot task with the restrictions of chunk alignment and allow the systems only 1:1 chunk alignment. For the unaligned chunks, two other class labels are considered. Such as (i) not aligned and (ii) context alignment [4]. In 2016, this restriction was withdrawn and 1: M (multi) aligner (means one chunk can be aligned with multiple chunks) was considered. The proposed work is inspired from SemEval, and the results of the proposed method have been analyzed and compared against the methods those were reported in SemEval 2015 and 2016 [3].

In SemEval 2015 iSTS pilot task, a modular approach was proposed to identify the chunks [2]. At first, Stanford NLP parser was used to extract the information like part of speech analysis and the dependency structure. The Apache OpenNLP API was used to train the ixa–pipes–chunker to identify the system chunks. Further, to improve the chunking result, four rules were developed to handle the preposition, conjunction, and punctuations. For alignment, a ready–to–use monolingual word aligner was used, which provides all token to token alignment information of two sentences [36]. Finally, the Hungarian–Munkres algorithm was used to decide the maximum link ratio, when a token of one sentence was aligned with multiple tokens of other.

For classification, two approaches like näive based and a machine learning were used to assign an alignment type. The näive approach, directly assigns the equivalence (i.e. EQUI) tag for the alignments with the highest weight, and –not aligned–for the unaligned chunks. To improve the efficiency of näive approach the Support Vector Machine (SVM) was used as a classifier and features like (i) Jaccard overlap; (ii) segment length; (iii) WordNet similarity among the segment heads and (iv) WordNet depth were considered. In näive approach relatedness score was assigned directly for ‘EQUI’ and –not aligned–chunks. For the other tag those similarity score ranges between 1 to 4, a regression approach was used and resources like (i) Euclidean distance between Collobert and Weston word vector [10]; (ii) Euclidean distance between Mikolov Word Vector and (iii) PPDB Paraphrase database values were considered.

A system called VRep [15], considered the WordNet Vector relatedness measure [30] with a threshold value to compute the similarity between the two words. Further, the similarity between the chunks were identified as the sum of the maximum word to word similarities, and the similarity score was normalized by the number of words in the shorter of the chunk pair. For alignment, chunk similarity was the key and was computed between each chunk of two aligned sentences. For alignment, highest chunk similarity score was considered, which prevents wrong chunk alignment.

To classify an alignment, a set of syntactic and semantic features for each chunk pair was considered. These features were extracted from the chunk pair itself and in literature methods like NeRoSim [7] and SVCSTS [19] also followed the same procedure for chunk alignment. NeroSim gave its focus over the semantic relationship, and SVCSTS shifted its focus more on the syntactic form, such as the number of words or part of speech in a chunk pair to align a chunk. But, VRep combined both the feature set, and SemEval 2015 Task 2 test data was used to train the classifier. The JRIP algorithm [9], which creates a decision list for classification and WEKA machine learning tool was adopted for classification purpose.

In SemEval 2016, iUBC represented the iSTS as a problem of classification and regression. To solve this, a set of RNNs and LSTMs were used to assign a relation type and measure the similarity score for an aligned pair of chunk [23]. This method was introduced as iUBC at SemEval 2016 task 2, which composed of three components. Such as (i) input handling and chunking; (ii) alignment and (iii) joint classification and scoring.

For alignment, a token–token alignment matrix was initialized, in which each element determines there exist a connection. To initialize the token–token matrix, a weighted sum of lowercase token overlap, lemmatized token overlap, cosine similarity between pre–trained word vector and the alignment prediction of monolingual word aligner was taken into account. Further, Hungarian–Munkres algorithm was used to find the strength of each segment connection and developed a chunk–chunk matrix, by considering the chunk boundary.

4 Proposed method

The proposed method is composed of five modules. The first module, preprocessing (Section 4.1), is responsible for reading inputs and identifying part–of–speechs, named entities, tokenization and lower case convertion. The second module, chunking (Section 4.2), identifies the individual chunks of each sentences. The third module (Section 4.3), explains the individual components and required features to align the corresponding chunk of two sentences. The classification and scoring modules are listed in Section 4.5 and 4.4 respectively.

4.1 Preprocessing

Following five (5) preprocessing steps have been carried out such as:

Part–of–Speech (POS) tagging: Apache OpenNLP 1.8.1 POS tagger 4 based on probability model and maximum entropy based Stanford bi-directional English tagger [37] is used here.

Named Entity (NE) Identification: Stanford Named Entity Recognizer (NER) [12] is used to recognize the names and further, these names are used for abbreviation normalization and case correction. In the given example, NEs are marked as ‘slash tags (/)’ with their respective tokens.

– UK/LOCATION alert on Syrian chemical arms

Tokenization and Case Correction: tokenization is performed using OpenNlp tokenizer 5 and the NEs and individual tokens are considered for case correction.

John Demjanjuk, convicted Nazi death camp guard, dies aged 91 (from dataset)

John Demjanjuk, convicted nazi death camp guard, dies aged 91 (after preprocessing)

Here in the given example, first two tokens are part of the NEs and these tokens remain unchanged.

Abbreviation Normalization: to normalize abbreviation, the NEs are considered and replaced with their full forms. This information is extracted from a gloss of individual WordNet synsets and the MIT Java WordNet interface 6 with WordNet 3.0 7 is used here.

UK alert on Syrian chemical arms (from dataset)

United Kingdom alert on Syrian chemical arms (after preprocessing)

Words are in opposite meaning: chunks are equivalent, but the meaning of the aligned chunks are in the opposite. So, it is necessary to align the chunks as ‘OPPO’ relation and to find the opposite word pairs the WordNet antonyms have been used. Highlighted chunks in the given example possess opposite meaning.

China stocks close lower on Friday

China stocks close higher on Wednesday

4.2 Chunking

According to Abney: a chunk is an intra–causal constituent including pre–head as well as post–head modifiers, but not pp–attachment or sentential elements [1]. Stanford Dependency Parser [21] is used to process the source and translation sentences linguistically. The output of the parser, such as lower-cased token information, part–of–speech (POS) analysis and dependency structure have been recorded. Based on the POS tags, an automatic chunking algorithm is developed to identify the chunks (a group of words) and for this CoNLL–2000 Chunking shared task guidelines are adopted here [35]. The proposed algorithm is summarized in Algorithm 1, and the chunking rules are listed below:

possessive NP constructions are split in front of the possessive marker

[Shinzo Abe] [is] [Japan] [’s prime minister]

an ADJP constituent inside an NP constituent becomes part of the NP chunk

[Gunman] [among 7 dead] [after Fla. apartment shootout]

two VP constituents did not overlap

[US allies] [get] [help] [to repel] [Iranian computer attacks]

adverbs/ adverbial phrases are marked as different chunk, if they are part of the VP chunk

[G20 Summit] [ends] [divided] [over Syria]

predicate adjectives of the verb are not part of the VP chunk

[Hundreds] [fall] [sick] [in Bangladesh factory]

ADVPs that contain an NP make two chunks

[Militants] [kill] [6 soldiers] [in northwest Pakistan]

all NP chunks inside a PP chunk is marked as PP

[Drone strike] [kills] [four suspected militants] [in Pakistan]

Algorithm 1 Chunking
1: ProcedureChunkingparsetree, tokens
2: Extract the phrase and pos from parsetree
3: initialize the chunk← “[”
4: Repeat following steps till the length of pos
5: ifpos = N and pos - 1 ∈{ N, DT, A, IN }
6: add token to current chunk
7: ifpos = IN and pos - 1 ∈ {IN, JJR}
8: add token to current chunk
9: ifpos = DT and pos - 1 ∈ {N, V}
10: start new chunk
11: ifpos = CC
12: add token to the current chunk
13: ifpos = CD and pos - 1 ∈ {IN, V, R}
14: add token to the current chunk
15: ifpos = A and pos - 1 ∈ {N, CD, R}
16: add token to the current chunk
17: ifpos = TO and phrase = PP OR
18: pos + 1 ∈ {N, A, R}
19: add token to the current chunk
20: ifpos = V and pos - 1 ∈ {N, R, A}
21: start a new chunk
22: returnchunks
23: end procedure

Penn Treebank POS tags [24], are considered here to implement this algorithm and to understand the flow of the algorithm following things need to be considered.

parsetree is the analysis of a sentence and have the information like (i) phrases; (ii) POS tags of tokens; (iii) lower case tokens; and (iv) dependency structure

chunk is a segmented version of a sentence

phrase and pos are the collections of each phrases and POS tags of individual tokens

N, V, A, R represents all POS tags of noun, verb, adjective and adverb class

The system generated chunks are considered as ‘SYS’ chunks and evaluated against the manually identified chunks, which are identified by human annotator and considered as gold standard ‘GS’ chunks. These GS chunks are publicly available with the SemEval – 2016 Task 2 dataset 8 and details about the dataset have been reported in Section 5.1. The headlines training dataset had 756 sentence pair and proposed chunking method has identified the correct chunking for 572 pairs of sentences. Based on an experiment result three reasons are listed here, which affect the performance of the proposed chunking module. In these examples, errors are marked in boldface, in which square brackets ([]) represent chunks boundary.

punctuations - in alignment punctuations remain as unaligned chunks.

[Mall attackers] [used] [‘] [less is more] [’] [strategy] (SYS)

[Mall attackers] [used] [‘ less is more ’ strategy] (GS)

ADJP followed by NP - for this task ADJP considered as a different chunk

[South Africans mark] [ailing] [Mandela] [’s 95th birthday] (SYS)

[South Africans] [mark] [ailing Mandela] [’s 95th birthday] (GS)

phrasal verb - for this task chunks are identified by considering the POS tags only. Words like strikes off are part of two different SYS chunks.

[6.8 quake strikes] [off Solomon Islands] (SYS)

[6.8 quake] [strikes off] [Solomon Islands] (GS)

4.3 Alignment

The proposed token to chunk multi aligner is a collective approach of two primary components. The first component, 1:1 token aligner, which operates as a pipeline of alignment modules that differ in the types of word pairs. The second component, 1:M (multi) chunk aligner considers all the 1:1 token alignment information and a set features to align the tokens into chunk level as well as to handle the unaligned tokens or chunks. This alignment module is developed and verified on the SemEval 2015 and 2016 iSTS dataset and to satisfy the criterion of alignment task following rules are adopted:

contextual meaning is key for the alignment, which means that the deep sense of the sentences is considered instead of surface meaning.

[Red double decker bus] [driving] [through the streets]

[Double decker passenger bus] [driving] [with traffic]

The surface representation of the sentences are different and deep meaning of the chunks shares the same meaning, so chunks are aligned with each other.

after 1:1 token alignment, if alignment results suggest multiple alignments for a token, the strongest (based on similarity score) alignment will be considered

[Hundreds] [of Bangladesh clothes factory workers] [ill]

[Hundreds] [fall] [sick] [in Bangladesh Factory]

In the given example, ill can be aligned with fall or sick. But, similarity score between (ill vs. sick) is higher than (ill vs. fall). So, ill is aligned with sick.

during alignment, it is possible to align one chunk with multiple chunks and 1: M chunk aligner is performing this task

[2 dead], [2 injured] [in Nevada middle school shooting]

[Nevada]: [2 dead], [2 heart] [in middle school shooting]

Here, Nevada and in middle school shooting of second text is aligned with in Nevada middle school shooting of the first.

For unaligned chunks, followings rules are adopted:

insert the unaligned chunk or group of chunks to 1:1 aligned chunks

measure the similarity score of newly aligned chunk and chunks can be remained unaligned if no similarity exist

punctuation is left unaligned

4.3.1 1:1 token aligner

It is one of the primary components of this alignment module, and in this section, the alignment procedure related to individual word pairs are discussed. Before aligning individual tokens of parallel sentences, each word pair are categorized into four groups, such as (i) identical words; (ii) named entities; (iii) content words; and (iv) stop words.

Named Entity (NE): NEs are aligned separately, which enables the alignment of full and partial entities of an entity. Incomplete mentions are aligned to all terms of the complete entities. This module recognizes only the first letter of the acronyms and aligns an abbreviation to all part of the full acronyms. The given example shows full and partial alignment of NEs. To handle wrong alignment of partial entities, order of tokens is one of the key features.

[North Korea] Postpones Family Reunions with [South]

[North Korea] ‘ postpones ’ family unions with [South Korea]

identical word sequence: The presence of common word sequence in S and T is an indication of availability of the same word in a sentence pair. We observe the results of aligning identical words in such sequences of length n containing at least one content word.

Palestinians clash with Israeli forces in West Bank, Jerusalem

Palestinians clash with security forces in W. Bank

aligning content words: content words are aligned based on the contextual evidence and to measure this two system components, such as measuring word similarity between word pairs and the existence of contextual evidence, are considered here.

4.3.2 Measuring word similarity

Three level of similarities are defined to identify the semantic similarity between words. The first level is an exact word or lemma matching, which considers the similarity score of 5. The next level represents the word similarity for nonidentical word pairs and has a similarity score (<5). The WordNet relatedness score and Paraphrase Database score are the keys to align a non–identical word pair. An average similarity score is measured to identify such word pairs, and for these JCN Distance [17], WUP similarity [39], and Resnik score [33] and Paraphrase Database (PPDB) score (of value in (0, 1)) of the most massive lexical package are considered [29]. Later, this similarity score is used to assign similarity score to an alignment.

4.3.3 Gather contextual evidence

Contextual evidence is collected from two sources: syntactic dependencies and cosine similarity between word vectors. To understand how this module works consider w_i ∈ S and w_j ∈ T as candidate pair for alignment and S and T are the two given sentences, which satisfy the following criteria:

(w_i, w_j) shares some semantic information, in this case existence of a WordNet relatedness score and PPDB score is an evidence for alignment

assume there exist a w_s ∈ S and w_t ∈ T, such that (w_i, w_s) of S have same dependency like (w_j, w_t) in T

For alignment of non–identical word pairs, dependencies are the important source for contextual evidence. If $(w_{i}, w_{s}) \in R$ and $(w_{j}, w_{t}) \in R$ in the same relation $(R)$ , then (w_s, w_t) are also in alignment. For this alignment module the Stanford universal dependencies are adopted. An example of such alignment is shown in Fig. 2. In the given example, water is alignied with sea as they are in same dependency relation $(R)$ with boat.

Fig.2

Direct dependency between (w_i, w_j).

We have also measured the contextual evidence for intra–category alignment of content words in four major categories: noun, verb, adjective, and adverb. The Stanford POS tagger is used to identify the categories. Table 1 shows each equivalence type dependency for each lexical category of S and T. This list of dependency relation constitutes the key feature for a bi–directional alignment of (w_i, w_j) and (w_s, w_t). This intra–category alignment of content words is considered as equivalence type dependency and an example of such alignment is shown in Fig. 3. The dobj dependency in S is equivalent to the nsubj dependency in T, since they represent the same semantic relation.

Fig.3

Equivalence dependency between (w_s, w_t).

Table 1

Direct and equivalence dependency structures

POS (w_i, w_j)	POS (w_s, w_t)	Dependency Structure
verb	verb	nmod:into, nmod:against,mark
	noun	nsubj, dobj
noun	verb	infmod, rcmod acl
	noun	nmod
	adjective	amod, rcmod
adjective	adjective	conj_and, conj_or
adverb	adverb	conj_and, conj_or

aligning stop words: We follow the contextual evidence, and textual neighboring approach to align stop words. Some of the stop words get aligned as part of an identical word sequence. In the given example of aligning identical words, in of S and T is already aligned as part of the West Bank alignment. For the remaining, we applied dependency structure and textual neighboring with the following assumption.

At first, without considering all semantically similar word pairs, the aligned pairs are only examined. A window length of n = 3 is considered here for aligning stop words those appear with the identical words. Secondly, many stop words such as determiners and modals have a very little effect on the dependency structure they engage in, so only exact matching of dependencies is considered here.

4.3.4 1:M (multi) chunk aligner

This is the second primary alignment component, which accepts all the 1:1 token aligned pairs and the chunks of S and T as input. The fundamental goal of this module is as follows:

aligning the chunks by considering all 1:1 token alignment information and

leaving minimum chunks as unaligned, by considering minimum to maximum contextual evidence between word pairs. The formation of contextual evidence is already discussed in Section 4.3.2 and 4.3.3.

the unaligned chunks of S can be aligned with any one of the aligned chunks of T or vice versa

if alignment is possible by considering the minimum similarity or contextual evidence, then assign a new relation to this alignment and a new similarity score to this. Assignment of a relation type and similarity score is discussed in Section 4.5 and 4.4.

on failing of above conditions chunks left as unaligned

To align chunks of S and T, the system has taken two inputs, all the chunks those are identified during chunking of individual sentences and output of 1:1 token aligner. To shift from a token-alignment to chunk alignment following features set is considered:

Consider the boundary of each chunk of S and T, if token t_i ∈ c_i of S is already aligned to a token of c_j of T, all token of c_i is aligned with c_j.

The cosine similarity between the chunks is the final consideration for alignment of an unaligned fragment with a threshold value (th ≥ 0.3) and (th < 0.7). Reason to choose this threshold value are discussed in Section 5.3.

Table 2, represents an output of token to chunk multi aligner. At first, individual tokens of S and T are aligned, where t_i and t_j represents the token of S and T. Then chunking of S and T is carried out and output of 1:1 token aligner is refined, by considering the chunk boundary. At last unaligned chunks of S and T are considered for alignment.

Table 2
Output of a token to chunk multi aligner

Inputs Russia condemns North Korean nuclear test (S) South Korea confirms that North Korea has conducted controversial third nuclear test (T)

1:1 Alignment (North, North) (Korean, Korea) (nuclear, nuclear) (test, test)

Chunking [Russia] [condemns] [North Korean nuclear test] [South Korea] [confirms] [ that] [ North Korea] [has conducted] [controversial third nuclear test]

1:M Alignment (North Korean nuclear test, North Korea) (North Korean, controversial third nuclear test)

4.4 Similarity/Relatedness score

Alignment scores are assigned as direct assignment between the aligned chunks or average similarity/ relatedness score for each token [20]. For direct assignment, 0 and 5 are assigned for ‘NOALI’ and ‘EQUI’ relation types, respectively. To measure the chunk similarity, we have adopted the method described by [15], given in Equation 1. $chunkSim (c_{1}, c_{2}) = \frac{\sum_{i = 1}^{n} \max_{j = 1}^{n} sim (w_{i}, w_{j})}{\min (n, m)}$ (1) where c₁ and c₂ are two aligned chunks, n and m are the number of token in c₁ and c₂ respectively. chunkSim takes two chunks (c₁, c₂) as input and computes the weighted sum of maximum word to word similarities, sim (w_i, w_j). To do this, we have considered the following semantic similarity measures:

Measuring WordNet relatedness score between content words and the SEMILAR Toolkit is used here [34].

The available similarity score between word pair in the largest (XXXL) PPDB’s lexical praphrase package [29].

Google pre–trained Word2Vec model (of 300 dimension) [27] is used to measure the cosine similarity between the two word vectors and Python implemented gensim [32] library is used for this purpose.

4.5 Assign relation/Class label

This module takes a pair of chunk as an input and provides a reason why the particular pair is aligned. This reasoning module get inspired from NeroSim [7], VRep [15] and SVCSTS [20] methods, and these methods classified a pair of chunks using the features extracted from the chunk itself. We have adopted all the semantic (such as antonyms, synonyms etc.) and syntactic (such as number of word counts, part of speech etc.) features used in NeroSim and SVCSTS and jointly a set of features have been designed to assign the relation type. In Table 3, we have listed out required features for assigning a relation to each aligned chunk pair.

Table 3
Relation Types with Required Feature Sets

Relation Type Features

EQUI (a) NP chunks with equal length

(b) cosine similarity score between word vectors of ≥0.7

(c) Wu and Palmer relatedness score in between (0,1)

(d) similarity score of PP chunks (after removing stop words)

(e) string matching for chunk of equal length

EQUI/ SPE1/ SPE2/ SIMI (a) named entity overlapping score

OPPO (a) if chunks contains antonyms

(b) ratio of number of adjectives in chunk of S to the number of content words

SPE1/ SPE2 (a) difference in number of content words

(b) if chunk contains equivalent nouns

SPE1 (a) percentage of unmatched content words in chunk2≤0

(b) difference in number of verbs

SIMI (a) if chunks contains numerical quantities

(b) unmatched content word percentage≥0.33

REL/ SPE1/ SPE2 (a) chunks has an adjective and an adverb

Relation Type	Features
EQUI	(a) NP chunks with equal length
	(b) cosine similarity score between word vectors of ≥0.7
	(c) Wu and Palmer relatedness score in between (0,1)
	(d) similarity score of PP chunks (after removing stop words)
	(e) string matching for chunk of equal length
EQUI/ SPE1/ SPE2/ SIMI	(a) named entity overlapping score
OPPO	(a) if chunks contains antonyms
	(b) ratio of number of adjectives in chunk of S to the number of content words
SPE1/ SPE2	(a) difference in number of content words
	(b) if chunk contains equivalent nouns
SPE1	(a) percentage of unmatched content words in chunk2≤0
	(b) difference in number of verbs
SIMI	(a) if chunks contains numerical quantities
	(b) unmatched content word percentage≥0.33
REL/ SPE1/ SPE2	(a) chunks has an adjective and an adverb

5 Experiment results

5.1 Dataset information

The SemEval–2016 Task 2 dataset is used for experimentation which comprises of sentence pair from news headlines (tagged as Headlines) and image description (tagged as Images). The Headlines dataset gathered by the Europe Media Monitor engine from several different news sources (from April 2nd, 2013 to July 28th, 2014) mentioned in [8]. The Images dataset is a part of PASCALVOC-2008 dataset [31] which consists of 1000 images with around 10 descriptions each. The different statistics of dataset is reported in Table 4.

Table 4
Statistics of Dataset with Relation Type

Dataset Sentence Pair EQUI REL SIMI SPE1 SPE2 OPPO NOALI

Headlines Train 756 1323 128 324 200 193 19 22651

Test 375 686 99 158 107 108 13 11343

Images Train 750 1029 67 340 245 229 3 27152

Test 375 624 32 185 150 152 1 13641

Dataset	Sentence Pair	EQUI	REL	SIMI	SPE1	SPE2	OPPO	NOALI
Headlines	Train	756	1323	128	324	200	193	19	22651
	Test	375	686	99	158	107	108	13	11343
Images	Train	750	1029	67	340	245	229	3	27152
	Test	375	624	32	185	150	152	1	13641

5.2 Evaluation measures

The Word Alignment Evaluation method is adopted here, which is based on F1 score of precision and recall of token alignments [25]. In literature, it has been argued that F1 is better measure than Alignment Error Rate [13]. Precision and recall are measured as the ratio of token–token alignments of SYS and GS files. For precision the ratio is divided by the number of alignments in SYS files and for recall it is divided by the number of alignments in the GS file.

Performance of the proposed method is measured by four distinct evaluation matrices such as: (i) F1–Ali (for correct alignment), (ii) F1–Type (for correct alignment with relation type), (iii) F1–Score (for correct alignment with relatedness score) and (iv) F1 Type + Score (alignment correctness with relation type and relatedness score).

5.3 Significance of threshold (th) value

From the analysis of train and test dataset, it is clear that the availability of ‘EQUI’ type is higher than other types and detail statistics on dataset is listed in Table 4). Required features with individual alignment types have been listed in Table 3, and cosine similarity is one of the features that we have used for this task. To extract the cosine similarity score between two chunks, the Google pre–trained Word2Vec model is used here.

The goal of the alignment module is to align as many as chunks of S and T. For this purpose, all unaligned chunks are aligned against the aligned chunks of each sentences to find any semantic relation. On the availability of semantic relations, 1: M multi chunk aligner extracts the cosine similarity score from Google pre–trained data. To assign correct alignment (by considering minimum evidence), an experiment has been conducted to fix a threshold value (th). It is found that if th considers a value of 0.7 or above as similarity score, chunks are sharing maximum semantic relation, which is valid for ‘EQUI’ type. But if system considers a th value as less than 0.3 (cosine similarity score), chunks having no semantic relationship also get aligned. This simple heuristic achieves an F1–Ali score of 0.95.

5.4 Result analysis

In this work, a 1:M (multi) chunk aligner is proposed, which is based on two major modules. The proposed chunk aligner is verified over the data of two individual domains. Next, we have reported the performance of all modules and also compared the performance of proposed aligner against the other state-of-art alignment methods. Results of the proposed method, with other state–of–art methods are listed in Table 5.

Table 5
Results of Headline (H) and Image (I) Dataset with other State-of-Art Methods

Method Results

Ali Type Score Score+Type

H I H I H I H I

UBC 0.78 0.89 0.52 0.61 0.70 0.80 0.50 0.60

VRep 0.89 0.85 0.60 0.55 0.80 0.77 0.60 0.55

DTSim 0.93 0.91 0.70 0.69 0.84 0.86 0.70 0.67

Proposed Method 0.95 0.93 0.71 0.72 0.85 0.88 0.70 0.68

iUBC 0.91 0.90 0.70 0.69 0.84 0.84 0.70 0.67

5.4.1 Comparative analysis

We have compared the performance of the proposed alignment method against the alignment methods, which are reported in SemEval 2016 Task 2 [3]. UBC [2] had adopted the monolingual word aligner [36] for 1:1 token alignment and Hungarian–Munkres assignment problem was used to align chunks. In VRep [15], a similarity score between a pair of the chunk was considered for alignment. To measure the similarity score, the vector relatedness measure [30] was used. A token–token aligner was proposed in iUBC [23] and similarity matrices like lowercased token overlap, stemmed or lemmatized token overlap, cosine similarity between Mikolov’s pre–trained word vectors were used [27] and based on different semantic relations and scores an aligner was proposed in DTSim [6], which was based on 1:1 token aligner [7].

The proposed aligner uses the features like word vectors and adopts the dependency structures those were reported in [36], to measure the contextual evidence between a pair of word. The difference with the aligner as mentioned above is measuring three level of similarity score between words and divide the 1:1 alignment task into four sub–modules. For this task, the minimum contextual evidence is considered to align an unaligned chunks. Majority of the state–of–art methods have used two matrices for alignment, which cost an extra burden of computation. In proposed method, it only considers the chunk boundary, until the alignment of all tokens of each chunk, which reduces the computational burden.

6 Conclusion

In this paper, we have proposed a token to chunk multi aligner as a part of an interpretable semantic textual similarity method. The proposed work is inspired from SemEval 2016 Task 2, which has addressed the issues related to interpretability of two sentences. Three level of word and contextual similarity of word are considered for 1:1 token alignment. The proposed method has been evaluated and tested over SemEval 2016 Task 2 dataset and experimental results shows that the proposed aligner outperforms many state–of–art aligner which were part of SemEval 2016 Task 2. These comparisons have been performed on state–of–art methods, which were developed over same datasets.

Footnotes

wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz

Acknowledgment

The work presented here falls under the Research Project Grant No. YSS/2015/000988 and supported by the Department of Science & Technology (DST) and Science and Engineering Research Board (SERB), Govt. of India.

References

Abney

S.P.

, Parsing By Chunks, Dordrecht, Springer Netherlands, 1992, pp. 257–278.

Agirre

, Gonzalez-Agirre

, Lopez-Gazpio

, Maritxalar

, Rigau

and Uria

, Ubc: Cubes for English semantic textual similarity and supervised approaches for interpretable sts, In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, 2015, pp. 178–183. Association for Computational Linguistics.

Agirre

, Gonzalez-Agirre

, Lopez-Gazpio

, Maritxalar

, Rigau

and Uria

, Semeval-2016 task 2: Interpretable semantic textual similarity, In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, 2016, pp. 512–524. Association for Computational Linguistics.

Agirrea

, Baneab

, Cardiec

, Cerd

, Diabe

, Gonzalez-Agirrea

, Guof

, Lopez-Gazpioa

, Maritxalara

, Mihalceab

, Rigaua

, Uriaa

and Wiebe

, Semeval-2015 task 2: Semantic textual similarity, english, Spanish and pilot on interpretability, In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval’15, Denver, Colorado, 2015, pp. 252–263.

Aleven

, Popescu

and Koedinger

K.R.

, Pedagogical content knowledge in a tutorial dialogue system to support self-explanation, Working Notes of the AIED 2001 Workhop Tutorial Dialogue Systems (2001).

Banjade

, Maharjan

, Niraula

N.B.

and Rus

, Dtsim at semeval-2016 task 2: Interpreting similarity of texts based on automated chunking, chunk alignment and semantic relation prediction, In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016, pp. 809–813. Association for Computational Linguistics.

Banjade

, Niraula

N.B.

, Maharjan

, Rus

, Stefanescu

, Lintean

and Gautam

, Nerosim: A system for measuring and interpreting semantic textual similarity, In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, ACL, pp. 164–171.

Clive

, van der Goot

, Blackler

, Garcia

and Horby

, Europe media monitor–system description, In EUR Report 22173-En, 2005.

Cohen

W.W.

, Fast eective rule induction, In Proceedings of the Twelfth International Conference on Machine Learning, ICML’95, 1995, pp. 115–123. Morgan Kaufmann Publishers Inc.

10.

Collobert

and Weston

, A unified architecture for natural language processing: Deep neural networks with multitask learning, In Proceedings of the 25th International Conference on Machine Learning, New York, NY, USA, ACM, pp. 160–167.

11.

Corley

and Mihalcea

, Measuring the semantic similarity of texts, In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, EMSEE ’05, Stroudsburg, PA, USA, 2005, pp. 13–18. Association for Computational Linguistics.

12.

Finkel

J.R.

, Grenager

and Manning

, Incorporating non-local information into information extraction systems by gibbs sampling, In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA. ACL, 2005, pp. 363–370.

13.

Fraser

and Marcu

, Measuring word alignment quality for statistical machine translation, In Computational Linguistics, volume 33, Cambridge, MA, USA, MIT Press, 2007, pp. 293–303.

14.

Grosan

and Abraham

, Rule–Based Expert Systems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 149–185.

15.

Henry

and Sands

, Vrep at semeval-2016 task 1 and task 2: A system for interpretable semantic similarity, In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, 2016, pp. 577–583.

16.

Islam

and Inkpen

, Semantic text similarity using corpus-based word similarity and string similarity, In ACM Trans Knowl Discov Data, volume 2, New York, NY, USA, ACM, 2008, pp. 10:1–10:25.

17.

Jiang

J.J.

and Conrath

D.W.

, Semantic similarity based on corpus statistics and lexical taxonomy, In Proceedings of International Conference Research on Computational Linguistics, 1997.

18.

Jordan

P.W.

, Makatchev

, Pappuswamy

, VanLehn

and Albacete

P.L.

, A natural language tutorial dialogue system for physics, In Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference, 2006, pp. 521–526.

19.

Karumuri

, Vuggumudi

V.K.R.

and Chitirala

S.C.R.

, Umduluth–blueteam: Svcsts – a multilingual and chunk level semantic similarity system (semeval 2015), In Proceedings of the 9th International Workshop on Semantic Evaluation, 2015a, pp. 107–110. Association for Computational Linguistics.

20.

Karumuri

, Vuggumudi

V.K.R.

and Chitirala

S.C.R.

, Umduluth-cs8761-12: A novel machine learning approach for aspect based sentiment analysis, In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, 2015b, pp. 742–747. Association for Computational Linguistics.

21.

Klein

and Manning

C.D.

, Accurate unlexicalized parsing, In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, ACL ’03, pp. 423–430. Association for Computational Linguistics.

22.

, McLean

, Bandar

Z.A.

, O’shea

J.D.

and Crockett

, Sentence similarity based on semantic nets and corpus statistics, IEEE Transactions on Knowledge and Data Engineering18(8) (2006), 1138–1150.

23.

Lopez-Gazpio

, Agirre

and Maritxalar

, iubc at semeval-2016 task 2: Rnns and lstms for interpretable sts, In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, 2016, pp. 771–776. Association for Computational Linguistics.

24.

Marcus

M.P.

, Marcinkiewicz

M.A.

and Santorini

, Building a large annotated corpus of english: The penn treebank, Computational Linguistics - Special Issue on Using Large Corpora: II19(2) (1993), 313–330.

25.

Melamed

I.D.

, Manual annotation of translational equivalence: The blinker project, University of Pennsylvani, Technical report, 1998.

26.

Mihalcea

, Corley

and Strapparava

, Corpus–based and knowledge–based measures of text semantic similarity, In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1, Boston, Massachusetts, AAAI Press, 2006, pp. 775–780.

27.

Mikolov

, Sutskever

, Chen

, Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, In Proceedings of the 26th International Conference on Neural Information Processing Systems – Volume 2, NIPS’13, USA, Curran Associates Inc, pp. 3111–3119.

28.

Nielsen

R.D.

, Ward

and Martin

J.H.

, Recognizing entailment in intelligent tutoring systems*, In Natural Language Engineering, volume 15, New York, NY, USA, Cambridge University Press, 2009, pp. 479–501.

29.

Pavlick

, Rastogi

, Ganitkevitch

, Durme

B.V.

and Callison-Burc

, Ppdb 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classificatio, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), 2015, pp. 425–430. Association for Computational Linguistics.

30.

Pedersen

, Patwardhan

and Michelizzi

, Wordnet:: Similarity: Measuring the relatedness of concepts, In Demonstration papers at Human Language Technology conference / North American chapter of the Association for Computational Linguistics Annual Meeting, HLT/NAACL ’04, Stroudsburg, PA, USA, 2004, pp. 38–41. Association for Computational Linguistics.

31.

Rashtchian

, Young

, Hodosh

and Hockenmaier

, Collecting image annotations using amazon’s mechanical turk, In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, Stroudsburg, PA, USA, 2010, pp. 139–147. Association for Computational Linguistics.

32.

Rehurek

and Sojka

, Software framework for topic modelling with large corpora, In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, University of Malta, 2010, pp. 46–50.

33.

Resnik

, Using information content to evaluate semantic similarity in a taxonomy, In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI ’95, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc, 1995, pp. 448–453.

34.

Rus

, Lintean

, Banjade

, Niraula

and Stefanescu

, Semilar: The semantic similarity toolkit, In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 163–168. Association for Computational Linguistics.

35.

Sang

E.F.T.K.

and Buchholz

, Introduction to the conll-2000 shared task: Chunking, In Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning-Volume 7, ConLL ’00, pp. 127–132. Association for Computational Linguistics.

36.

Sultan

M.A.

, Bethard

and Sumner

, Back to basics for monolingual alignment: Exploiting word similarity and contextual evidence, Transactions of the Association for Computational Linguistics2 (2014), 219230.

37.

Toutanova

, Klein

, Manning

C.D.

and Singer

, Feature-rich part-of-speech tagging with a cyclic dependency network, In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, Edmonton, Canada, 2003, pp. 173–180. Association for Computational Linguistics.

38.

Šarić

, Glavaš

, Karan

, Šnajder

and Bašić

B.D.

, Takelab: Systems for measuring semantic text similarity, In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval ’12, Stroudsburg, PA, USA, 2012, pp. 441–448. Association for Computational Linguistics.

39.

and Palmer

, Verb semantics and lexical selection, In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, ACL’94, 1994, pp. 133–138.

Inputs	Russia condemns North Korean nuclear test (S) South Korea confirms that North Korea has conducted controversial third nuclear test (T)
1:1 Alignment	(North, North) (Korean, Korea) (nuclear, nuclear) (test, test)
Chunking	[Russia] [condemns] [North Korean nuclear test] [South Korea] [confirms] [ that] [ North Korea] [has conducted] [controversial third nuclear test]
1:M Alignment	(North Korean nuclear test, North Korea) (North Korean, controversial third nuclear test)

Method	Results
	Ali		Type		Score		Score+Type
	H	I	H	I	H	I	H	I
UBC	0.78	0.89	0.52	0.61	0.70	0.80	0.50	0.60
VRep	0.89	0.85	0.60	0.55	0.80	0.77	0.60	0.55
DTSim	0.93	0.91	0.70	0.69	0.84	0.86	0.70	0.67
Proposed Method	0.95	0.93	0.71	0.72	0.85	0.88	0.70	0.68
iUBC	0.91	0.90	0.70	0.69	0.84	0.84	0.70	0.67

Measuring interpretable semantic similarity of sentences using a multi chunk aligner

Abstract

Keywords

1 Introduction

2 Defining the interpretable STS problem

4 Proposed method

4.1 Preprocessing

4.2 Chunking

4.3 Alignment

4.3.1 1:1 token aligner

4.3.2 Measuring word similarity

4.3.3 Gather contextual evidence

5.1 Dataset information

Table 4 Statistics of Dataset with Relation Type Dataset Sentence Pair EQUI REL SIMI SPE1 SPE2 OPPO NOALI Headlines Train 756 1323 128 324 200 193 19 22651 Test 375 686 99 158 107 108 13 11343 Images Train 750 1029 67 340 245 229 3 27152 Test 375 624 32 185 150 152 1 13641

5.3 Significance of threshold (th) value

5.4 Result analysis

6 Conclusion

Footnotes

Acknowledgment

References

Table 4
Statistics of Dataset with Relation Type

Dataset Sentence Pair EQUI REL SIMI SPE1 SPE2 OPPO NOALI

Headlines Train 756 1323 128 324 200 193 19 22651

Test 375 686 99 158 107 108 13 11343

Images Train 750 1029 67 340 245 229 3 27152

Test 375 624 32 185 150 152 1 13641