Learning pairwise patterns in Community Question Answering

Abstract

In recent years, forums offering community Question Answering (cQA) services gained popularity on the web, as they offer a new opportunity for users to search and share knowledge. In fact, forums allow users to freely ask questions and expect answers from the community. Although the idea of receiving a direct, targeted response from other users is very attractive, it is not rare to see long threads of comments, where only a small portion of them are actually valid answers. In many cases users start conversations, ask for other information, and discuss about things, which are not central to the original topic. Therefore, finding the desired information in a long list of answers might be very time-consuming.

Designing automatic systems to select good answers is not an easy task. In many cases the question and the answer do not share a large textual content, and approaches based on measuring the question-answer similarity will often fail. A more intriguing and promising approach would be trying to define valid question-answer templates and use a system to understand whether any of these templates is satisfied for a given question-answer pair. Unfortunately, the manual definition of these templates is extremely complex and requires a domain-expert.

In this paper, we propose a supervised kernel-based framework that automatically learns from training question-answer pairs the syntactic/semantic patterns useful to recognize good answers. We carry out a detailed experimental evaluation, where we demonstrate that the proposed framework achieves state-of-the-art results on the Qatar Living datasets released in three different editions of the Community Question Answering Challenge of SemEval.

Keywords

Community Question Answering Kernel methods Structured Language Learning

1 Introduction

The huge amount of data that is stored and constantly produced in the Web represents an endless source of information. Unfortunately, only a small percentage of this data has a structured form, such as a Database schema, that can be easily accessed and queried. Most data is constituted by documents written in a natural language, whose treatment requires sophisticated Information Retrieval (IR) and Natural Language Processing (NLP) Techniques in order to extract meaningful information and transform it into usable knowledge.

Several applications of these technologies are already part of our everyday life. On a daily basis, millions of users retrieve documents using search engines, or employ automatic machine translation systems to translate texts from one language to another. Spell checkers are implemented in most of the text editors, and spam filtering is automatically performed by email services.

Prior attempts to tackle language-processing tasks were based on the direct hand coding of large sets of rules, whose manual definition is an extremely tedious and labor-intensive process that requires a deep knowledge of the linguistic phenomena characterizing a specific task. For example, early automatic machine translation models were based on thousands of rules, leading to overly complicated systems that were consequently hard to maintain and adapt to new languages or domains [49].

The usage of large corpora became more and more common for tackling several Computational Linguistics problems [15], including Machine Translation [13], and parsing [16].

A paradigmatic shift has characterized modern NLP techniques: the rule-based engineering was successfully replaced by the application of Statistical Machine Learning (ML) methods, able to automatically extract and learn such rules through the analysis of large collections of texts.

Nowadays, the most common approaches to tackle NLP tasks are based on supervised learning techniques, which learn to extract knowledge from labeled training data. The training material is a set of pairs, consisting of an input textual object, such as a document or a sentence, and a desired output value, e.g., a document category or a predicate label. The goal of the learning algorithm is to analyze the data and to estimate a decision function that will be used to provide predictions on future unseen instances, i.e., input documents and sentences.

In this framework, a key problem is the definition of expressive representations for the input objects that should allow the learning algorithm to automatically extract meaningful information from the data. The problem of modeling input data is even more acute when the instances are not individual objects but pairs of objects, such as text and hypothesis in Recognizing Textual Entailment, or question and answer in Question Answering.

In this work, we will describe a kernel-based framework in which sentence pairs are modeled using a structured representation derived from their parse trees. The framework automatically induces from training question-answer pairs the syntactic-semantic patterns to be used for future pairs, in order to establish whether a sentence answers a given question. In [25], we initially proposed this framework to solve other tasks involving sentence pairs, namely Paraphrase Identification, which is the problem of understanding whether two text snippets covey the same information, and Recognizing Textual Entailment, which concerns determining if a text implies a hypothesis. Instead, in this work we tackle the task of Answer Selection in community Question Answer (cQA), demonstrating the extreme flexibility of the proposed solution. We will extensively describe and motivate the system we adopted to win the 2016 and 2017 editions of the cQA challenge of SemEval [45, 46]. To better understand the effectiveness of the proposed solution, we will study the impact of different individual kernels and their combinations. Furthermore, we will perform additional experiments on the dataset from the 2015 edition of the cQA challenge [44] reaching the new state-of-the-art results on that dataset.

The rest of the paper is organized as follows. Section 2 discusses the task of answer selection in cQA. Section 3 describes different approaches to model the lexical, syntactic and semantic information of individual texts. It aims at providing all the background material for understanding the following sections. Section 4 describes the task of community Question Answer and the proposed solution. Section 5 reports the experimental results on three editions of the community Question Answering challenge of SemEval. Finally Section 7 draws the conclusion and describes the future work.

2 Answer Selection in Community Question Answering

Community Question Answering (cQA) is a variant of a typical QA setting put in a Web forum context, where users can freely post questions and expect answers from other users, i.e., the community. This kind of interaction allows users to share opinions, experiences and ideas. Online forums such as Quora 1 , Yahoo! Answers 2 , or Stackoverflow 3 are an endless source of information that is largely accessible.

Most forums are moderated only indirectly via the community, and usually there are no restrictions on who can post and answer a question, or on what questions can be asked. This means that users can freely ask any question and can expect a variety of answers. However, it is often the case that many answers are only poorly related to the actual question, and some even change the topic. This is especially common for long threads where, as the thread progresses, users start talking to each other, instead of trying to answer the initial question. Therefore, a user has to go through all possible answers and to make sense of those. This is a real problem, as a question can have hundreds of answers, the vast majority of which would not satisfy the users’ information needs. Thus, finding the desired information in a long list of answers might be very time-consuming. Automatic systems for answer selection can largely improve the user experience.

Recently, three consecutive editions of SemEval focused on the task of selecting the relevant answers is cQA [44 –46]. In particular, given a question and several community answers, the task was to classify each comment as

relevant (good),

potentially useful (potential); and

bad or irrelevant (bad). This includes also dialogues and non-English comments.

The corpus used in the competition is extracted from the Qatar Living forum 4 . A simplified example is shown in Fig. 1, where answers 2 and 3 are good, Answer 1 is potentially useful, and Answer 4 is bad 5 .

Fig. 1

Example from SemEval-2016 Task 3.

Examples are organized in threads, and both questions and comments have a subject (a brief description) and a body, as well as some meta-information, such as the question category (according to the Qatar Living taxonomy), and the user identifier. This kind of information is usually available in every forum; so, even if the models described in the following sections have been developed specifically for the Qatar Living case, they can be effectively applied to different cQA domains and to different corpora.

Table 1 reports some statistics of the Qatar Living datasets of the 2015, 2016 and 2017 editions of the SemEval cQA Challenge [44 –46]. In the 2017 only a test set was released, and the datasets from the previous year were intended to be used as training material. The potential class, which is an intermediate class between good and bad , is the least frequent, while the good and bad classes are pretty balanced.

Table 1

Class distribution in the Qatar Living datasets.

Dataset	#Good	#Pot	#Bad
train 2015	8035	1645	6702
dev 2015	866	184	569
test 2015	991	167	789
train 2016	6651	3110	8139
dev 2016	818	413	1209
test 2016	1329	456	1485
test 2017	1523	0	1407

3 Modeling linguistic information in Computational Language Learning

The performance of machine learning algorithms heavily depends on the representation of the data they are provided with. The traditional ML approach consists in modeling instances as feature vectors. In NLP and IR [4, 54], the most common representation based on feature vectors is the Bag-of-Words (BoW) model [30], where each feature corresponds to a specific word in the dictionary. Each document is then mapped into a vector whose feature values express the occurrence of specific terms in the document. This kind of representation is very simple and neglects some important linguistic information: the words are considered mere symbols whose semantics is completely ignored. Furthermore, the BoW model is unable to capture syntactic information, which instead is central in the approach we will propose. The following sections will describe some alternative representations that we will use in this work. The aim is to provide all the background material to the reader in order to properly understand the model we propose in Section 4 for solving the answer selection task in community Question Answering.

3.1 Generalizing lexical information: Distributional models of lexical semantics

The BoW model does not generalize lexical information and cannot match semantically related words, such as synonyms. For instance, a pure direct matching between words cannot capture any similarity between “buy a car” and “purchase an automobile”, and a lexical generalization is needed to capture the quasi-synonymic usage of car vs. automobile and buy vs. purchase. This kind of information can be extracted from large scale lexicons, e.g., WordNet [21, 40], but their coverage is usually not satisfactory. Moreover, some tasks require a domain-specific notion of similarity that lexicons, which are usually general-purpose, cannot provide. These limitations can be overcome by defining word representations, i.e., vectors embodying the semantics of words. These methods have a significant impact in a variety of applications, such as document classification [55], question answering [66], information retrieval [37], named entity recognition [68], and parsing [61].

Usually words are represented as vectors in a low dimensional space, such that similar, or semantically related concepts are close. In these semantic spaces, the notion of word similarity is captured through cosine similarity, inner product, or other vectorial operations.

In Computational Linguistic there is a wide literature regarding techniques for automatically generating word vectors. These methods rely on the Distributional Hypothesis, which has been synthesized by Firth (see [26]):

a word is characterized by the company it keeps

The underlying intuition behind Distributional Semantic Models is that the meaning of a word can be described by the set of textual contexts in which it appears. In fact, words appearing in similar contexts are semantically related by some type of semantic relation, either paradigmatic (e.g., synonymy, hyperonymy, antinomy) or syntagmatic (e.g., meronymy, conceptual and phrasal association), as observed in [53].

The two main approaches for generating word vectors differ in how the distributional hypothesis is exploited. In the classic count-vector-based models 6 , the word-context co-occurrences are globally counted from a large scale corpus; in the context-predicting models (more commonly known as neural language models), the vector estimation problem is treated as a learning task, in which the values of the word vectors have to be set in order to optimize a cost function defined on the available corpus. Obviously, in both approaches, a good approximation of the word distributional information can be achieved only if a sufficient amount of observations is gathered. Several large-scale corpora are available for English, e.g., the British National Corpus (BNC) [3] made of 100 million words, or the ukWaC corpus [6], made of 2 billions word.

A deep comparison of the effectiveness of different word embedding strategies on downstream tasks can be found in [7].

In this work, we will adopt Word2Vec [39], which is one of the most popular context-predicting models, thanks to its simplicity (the model is a neural network with a single hidden layer) and to the quality of the produced word embeddings.

3.2 Exploiting the syntactic information

Syntactic information is necessary for a complete understanding of the semantics of texts. For instance, two sentence having the same words can have opposite meaning, such as in Vettel passed Hamilton and Hamilton passed Vettel. In this case, a BoW model will result in two identical vectors, and only a syntactic analysis can clarify who is ahead.

A possible approach for capturing syntax can be the manual definition of an artificial feature set that should emphasize the syntactic and semantic aspects useful to solve a target problem. Referring to the previous examples, some informative features can be extracted from their syntactic parse trees in Fig. 2.

Fig. 2

Constituency trees of the sentences Vettel passed Hamilton and Hamilton passed Vettel.

For instance, an artificial feature can be the syntactic path between a given word and the predicate (i.e., the word evoking the underlying situation, in this case passed). The two sentences have different paths from the argument Vettel to the predicate passed, reflecting the different semantic roles of Vettel: he is ahead in Vettel passed Hamilton, and behind in Hamilton passed Vettel.

These kinds of features are basically the building blocks of the underlying rules characterizing an NLP problem, and the task of determining how to exploit these features to generate robust predictive models is left to the learning algorithm. Therefore, the feature engineering approach can be somehow considered a simplification of the development of the rule-based systems mentioned in Section 1. However, the definition of meaningful features is still a rather expensive and complicated process that requires a domain expert; moreover, every task has specific patterns that must be considered, making a manual feature engineering an extremely complex and not portable process.

Kernel methods [18, 52] are an elegant and efficient alternative to the manual definition of features, thanks to their capability to operate directly on structured data, such as sequences, trees or graphs. In NLP, sentences can be naturally represented through their syntactic parse trees. Then, instead of trying to design a synthetic feature space, tree kernels [17] can be employed to directly operate on the parse tree of sentences, evaluating the tree fragments shared by the input trees. This operation equals to a dot product in the implicit feature space of all possible tree fragments (which, of course, include most of the artificial features that can be engineered, such as the predicate-argument path, see for instance [28]). The dimensionality of such space is extremely large and an explicit representation is not viable due to computational and memory usage issues: the number of different sub-fragments in a tree is combinatorial in the number of tree nodes, therefore, even a small tree can generate up to thousands of billions of features. Tree kernels lead to state-of-the-art performance in several NLP tasks, where syntactic information is crucial, including Question Classification [19], Semantic Role Labeling [43], Paraphrase Identification and Recognizing Textual Entailment [25].

3.2.1 Tree Kernels for NLP

Tree Kernels (TKs) allow to estimate the similarity among texts, by directly comparing their parse trees.

The underlying idea is that the similarity between two trees T₁ and T₂ can be derived from the number of shared tree fragments. Let the set $T = {t_{1}, t_{2}, \dots, t_{| T |}}$ be the space of all the possible substructures and χ_i (n_j) be an indicator function that is equal to 1 if the target t_i is rooted at the node n_j and 0 otherwise. A tree-kernel function over T₁ and T₂ is defined as follows: $TK (T_{1}, T_{2}) = \sum_{n_{1} \in N_{T_{1}}} \sum_{n_{2} \in N_{T_{2}}} Δ (n_{1}, n_{2}),$ (1) where N_{T
₁} and N_{T
₂} are the sets of nodes of T₁ and T₂ respectively, and $Δ (n_{1}, n_{2}) = \sum_{k = 1}^{| T |} χ_{k} (n_{1}) χ_{k} (n_{2}),$ (2) which computes the number of common fragments rooted in the n₁ and n₂ nodes.

Different tree kernels can be defined according to the types of tree fragments considered in the evaluation of the matching structures. In the Subtree Kernel [71], valid fragments are subtrees, i.e., any node of a tree along with all its descendants. Subset trees are exploited by the Subset Tree Kernel [17], which is usually referred to as Syntactic Tree Kernel (STK); they are more general structures since their leaves can be non-terminal symbols. The subset trees satisfy the constraint that grammatical rules cannot be broken. For example, the constituency tree in Fig. 3a has [VP [VBD NP]] as a subset tree. Instead [VP [VBD]] is not a valid subset tree because the rule VP->VBD NP cannot be split. This strict constraint imposed by the STK may be problematic especially when the training dataset is small and only few syntactic tree configurations can be observed. The Partial Tree Kernel (PTK) relaxes this constraint considering partial trees, i.e., fragments generated by the application of partial production rules, and usually leads to higher accuracy, as shown in [42]. Examples of different kinds of tree fragments are illustrated in Fig. 3.

Fig. 3

a) Constituency parse tree for the sentence Vettel passed Hamilton, b) some subtrees, c) some subset trees, d) some partial trees.

Given a sentence s with |s| words, $s [\vec{I_{s}}]$ defines the subsequence of s that includes the words corresponding to the indices $\vec{I_{s}} = {i_{1}^{s}, \dots, i_{l (\vec{I})}^{s}}$ , where $1 \leq i_{1}^{s} < i_{2}^{s} < \dots < i_{l (\vec{I})}^{s} \leq | s |$ , and $l (\vec{I_{s}})$ is the number of indices in $\vec{I_{s}}$ .

We define $d (\vec{I}) = i_{l (\vec{I})} - i_{1}$ as the length of the subsequence in the original sentence (i.e., gaps are included in this count). The PTK computation [42], is carried out by the following Δ_PTK function:

\begin{array}{l} Δ_{P T K} (n_{1}, n_{2}) = 0, if the labels of n_{1} and n_{2} differ \\ Δ_{P T K} (n_{1}, n_{2}) = μ (λ^{2} + \sum_{{\vec{I}}_{1}, {\vec{I}}_{2} : l ({\vec{I}}_{1}) = l ({\vec{I}}_{2})} λ^{d ({\vec{I}}_{1}) + d ({\vec{I}}_{2})} \\ \prod_{k = 1}^{l ({\vec{I}}_{1})} Δ_{P T K} (c_{n_{1}} (i_{k}^{1}), c_{n_{2}} (i_{k}^{2}))), otherwise \end{array}

(3)

where c_n (k) is the k-th child of the node n, and λ and μ are decay factors in (0, 1] that penalize large child subsequences (that can include gaps) and deep partial trees, respectively.

The computational complexity of a PTK computation is $O (η δ^{2} | N_{T_{1}} | | N_{T_{2}} |)$ [42], where η is the largest subsequence of children to be considered, and δ is the maximal number of children observed in the two trees. However, the average running time tends to be linear in the number of nodes for natural language syntactic trees [42]. Moreover, it is worth noting that the number of nodes in a tree is bounded and much lower with respect to the combinatorial number of its possible sub-fragments.

4 Modeling questions and answers for answer selection

In the previous section, we introduced different approaches for modeling individual texts. However, in NLP several tasks are defined over pairs of texts. For instance, Paraphrase Identification (PI) consists of determining whether two sentences are paraphrases or not (see [20]); Recognizing Textual Entailment (RTE) (see [27] for a detailed description) is defined as a directional relation extraction between two text fragments, text and hypothesis. The relation holds whenever the truth of the hypothesis follows from the text. In this section we will propose a framework for modeling these kinds of tasks. In particular, we will focus on the task of Answer Selection in community Question Answering. Our model consists of a multi-kernel SVM operating on different features and data representations that we will describe in the following sections.

4.1 A shallow intra-pair similarity approach

A very simple approach to Answer Selection consists of evaluating the similarity between the question and the answer. For instance, answer A₂ in Fig. 1 contains some relevant keywords appearing in the question, such as gold and sell.

The similarity between two sentences can be captured in different ways and at different levels (lexical, syntactic and semantic). Various similarity measures have been proposed in literature and many of them have been originally developed for the Semantic Textual Similarity task [1].

In the experiments reported in Section 5, we will adopt the following similarity measures:

Lexical Similarities: Cosine similarity, Jaccard coefficient [33] and containment measure [12] of n-grams of word lemmas (n = 1, 2, 3, 4 was used in all experiments); Longest common substring measure [29], Longest common subsequence measure [2], and Greedy String Tiling [76].

Syntactic Similarities: Cosine similarity of n-grams of part-of-speech tags. It considers a shallow syntactic similarity (n = 1, 2, 3, 4 was used in all experiments); Partial tree kernel [42] between the parse trees of the sentences.

Semantic Similarities: Cosine similarity between additive representations of word embeddings generated by applying word2vec [39] to the entire Qatar Living corpus from SemEval 2015 7 . Five features are derived considering (i) only nouns, (ii) only adjectives, (iii) only verbs, (iv) only adverbs and (v) all the above words.

Each similarity generates a feature, so that each question-answer pair can be converted into a feature vector. Then supervised learning algorithm, such as SVM, can learn how to combine these features for solving the classification problem. We define LK_sim as a linear kernel operating on the above described intra-pair similarities. A system based on the direct comparison between question and answer provides a strong and efficient baseline. However, it has some inherent weaknesses that are evident if we consider the answer A₃ from Fig. 1: this comment is definitely good , although it does not contains words similar to the ones in the question.

4.2 Learning from sentence pairs

In question answering, many good question-answer pairs, can be generalized to some latent answering template. For instance, a very high-level generalization of the question in Fig. 1 could be Can I do something?; while valid answers can be Yes you can …, or No you cannot …, which match A₂ and A₃, respectively 8 .

In a general-purpose forum, such as Qatar Living, a large variety of these question-answering templates can be considered. Their manual definition through hand-coded rules is extremely complex and time-consuming, and probably it will lead to a poor and low-coverage solution.

With the assumption that the targeted question-answer templates characterize multiple examples in the dataset, we propose a kernel-based approach to automatically learn syntactic-semantic patterns from training data. These patterns can somehow correspond to the question-answering templates we aim at capturing. The intuition we follow is quite simple: an answer is probably valid for a given question if similar question-answer pairs are labeled as good in the training set. For instance, the following question-answer pairs from the Qatar Living dataset have a very similar structure, although they treat completely different topics:

Good Question-Answer Pair 2

Q₂: Which is the best beach here in Qatar?

A^{Q
₂}: Sealine Resort is the best option

Good Question-Answer Pair 3

Q₃: Which is the best bank around?

A^{Q
₃}: CQB is the best option

This is a paradigmatic shift with respect to the similarity approach described in the previous section: sentences are not only compared with the other element of their pair, but inter-pair analysis is also performed.

4.3 Kernels on question-answer pairs

Pairs which can be associated with a common question-answering template share a similar syntactic structure, as illustrated in Fig. 4 (for the moment ignore the red REL prefixes). Therefore, high syntactic similarity between the two questions and between the two answers can be assumed to be a good indicator that the two pairs respect the same underlying template. We opted for a shallow parse tree representation, i.e., a tree where words have their POS tags as fathers, which are then grouped into chunks. In fact, the quality of more complex tree representation, such as full constituency trees, is largely affected by the noisy nature of the text in community Question Answering. Furthermore, sentences can be very long thus a full syntactic parse tree may result too large for a tree kernel computation. Finally, shallow parsing is significantly more efficient.

Fig. 4

Shallow trees of two question-answer pairs exhibiting the same answering structure.

The tree kernel methods described in Section 3.2 can effectively establish the syntactic similarity between two sentences. We can extend this evaluation to question-answer pairs using a simple tree kernel combination. In particular, given two pairs p_i = 〈q_i, a_i〉 and p_j = 〈q_j, a_j〉, and a tree kernel TK, such as the PTK, the following kernel on pairs can be defined 9 : ${TK}^{+} (p_{i}, p_{j}) = TK (q_{i}, q_{j}) + TK (a_{i}, a_{j}),$ (4)

which simply sums the tree kernel similarities between the two questions and between the two answers. It is worth noting that the core difference between the approach discussed in Section 4.1 and the proposed TK⁺ is that the former evaluates intra-pair similarities, i.e., questions are compared with their answers, while the latter performs an inter-pair analysis, i.e., questions are compared with questions and answers with answers.

As suggested in [25] the text pairs representations are enriched by using a special tag, i.e., the red REL prefix in Fig. 4, to mark the words appearing in both the texts of a pair. In order to involve the syntactic part of the trees, the REL tag is propagated to the father and grandfather of the matching words, i.e., the POS tag and chunk nodes. This is a simple yet effective strategy to establish intra-pair relations and to allow the learning algorithm to distinguish the words having a counterpart in the other element of its pair.

The proposed tree kernel combination implicitly generates a huge feature space that includes relational features, such as [S [NP-REL [JJ-REL] [NN]] [PP [IN]]], [S [NP [NNP]] [VP] [REL-NP [REL-JJS]]]. Those features are interesting pairwise patterns, which a kernel-based learning algorithm, such as SVM, can capture and exploit to create robust classification models.

An interesting aspect of kernel methods is that kernel can be easily combined to create more sophisticated models. In particular, we can combine TK⁺ with a kernel operating on the intra-pair similarity feature vectors described in Section 4.1, for instance a Linear Kernel (LK). This will result in a stronger kernel that can leverage both the intra-pair and inter-pair approaches.

4.4 Additional task-specific features

The explicit similarity features proposed in Section 4.1, as well as the implicit tree kernel features described in Section 4.2 have a large applicability in other tasks such as Paraphrase Identification or Textual Entailment Recognition. Instead, in this section we will present some features, proposed in [8], which have been developed specifically for the cQA scenario.

In particular, forty-four Boolean features express whether a comment:

includes URLs or emails (2 feats.). In fact, exploring the training data, it is possible to notice that many good comments suggest visiting a Web site or contacting an email address;

contains the words “yes”, “sure”, “no”, “can”, “neither”, “okay”, and “sorry”, as well as symbols ‘?’ and ‘@’ (9 feats.);

starts with “yes” (1 feat.); This features, as well as some features from the previous group, indicates the presence of a direct answer in a yes/no question;

includes a sequence of three or more repeated characters, or a word longer than fifteen characters (2 feats.). These features signal the presence of non-proper English words that are usually associated with a very informal text, which is typical of conversations, i.e., bad comments;

belongs to one of the categories of the forum (Socializing, Life in Qatar, etc.) (26 binary feats.). Some categories are less “serious” than others, in the sense that often contain “silly” questions or jokes, which obviously attract bad comments 10 . Therefore, the category represents a sort of a priori information of the comment quality;

has been posted by the same user who posted the question, such a comment can include a question (i.e., it contains a question mark), and acknowledgement (e.g., it contains thank*, acknowl*), or none of them (4 feats.). Usually users do not answer to themselves: their comments are typically requests for further information or acknowledgments to the users who provided helpful answers.

A particular aspect that characterizes Community Question Answering is that the comments in a thread typically reflect an underlying concrete discussion, which is more than a sequence of independent answers 11 . For instance, users replicate to each other, add additional information to someone else’s answers, ask for further details, or tease other users.

We modeled thread-level dependencies by designing specific features that are able to capture the dependencies between the answers in the same thread.

The following notation will be adopted: q is the question posted by user u_q, c is a comment from user u_c, in the comment thread.

Four features indicate whether c appears in the proximity of a comment by u_q. The assumption is that an acknowledgment or further questions by u_q in the thread could signal a good answer. More specifically, they test if among the comments following c there is one by u_q (i) containing an acknowledgment, (ii) not containing an acknowledgment, (iii) containing a question, and, (iv) if among the comments preceding c there is one by u_q containing a question.

The value of these four features —a propagation of the information captured by the feature group 6 — depends on the distance k, in terms of the number of comments, between c and the closest comment by u_q:

$f (c) = {\begin{matrix} max (0, ThickSpace 1.1 - 0.1 k) \\ 0 if no comments by u_{q} exist, \end{matrix}$ (5) that is, the closer the comment to c_q, the higher the value assigned to this feature 12 .

Other six features try to model potential dialogues, which at the end represent bad comments, by identifying interlacing comments between two users. These dialogue features are identifying conversation chains: $\begin{matrix} u_{i} \to \dots \to u_{j} \to \dots \to u_{i} \to \dots \to [u_{j}] \end{matrix}$ Comments by other users can appear in between the nodes of this “pseudo-conversation” chain. Three boolean features consider whether a comment is at the beginning, in the middle, or at the end of such a chain. Three more boolean features exist in those cases in which u_j = u_q, i.e., the user who asked the question is one of the participants in these emerging conversations.

Another interesting aspect is whether a user u_i has been particularly active in a question thread. One boolean feature captures whether u_i wrote more than one comment in the current thread. Three more features identify the first, the middle and the last comments by u_i. One extra feature counts the total number of comments written by u_i.

It can be empirically observed that the likelihood of a comment being good decreases with its position in the thread. Therefore, another real-valued feature was included: max(20, i)/20, where i represents the position of the comment in the thread.

We define LK_qa as a linear kernel operating on the features specific to the task described above.

5 Experimental evaluation

5.1 Evaluation measures

In the 2015 edition [44], the cQA challenge was a classification task where the official measure was the macro F₁ measure of the three classes, namely good, bad and potential . Instead, in the 2016 and 2017 editions [45, 46], the organizers defined the task as a re-ranking problem where the participants had to provide a ranking of the comments in each QA thread. In this case, the official evaluation measure was the Mean Average Precision (MAP) and there was no distinction between the potential and bad classes. This change was partially motivated by the generally low performance observed in detecting the potential comments. In fact, as discussed in [10], the boundaries of this intermediate class are not well defined. We performed experiments with the two class setting also on the 2015 edition. In addition to the MAP, we will also report the Mean Reciprocal Rank (MRR), and the Average Recall (AvgRec). In solving the task, we will use a point-wise approach, where the scores of a binary classifier are used to rank the comments. Thus, we will also provide the standard classification metrics, namely accuracy, precision, recall and F₁ w.r.t. the good class.

Finally, to have a comparison with the 2015 challenge results, we will also run a 3-class setting experiment on the 2015 data. In this case we will report the accuracy and macro-F₁.

5.2 Text preprocessing

As we can observe in Fig. 1, the text is very informal and inherently noisy. It contains spelling errors, special characters, non-standard word forms, grammatical mistakes, usage of multilingual words, and so on. Standard NLP methods are usually affected by noise in text and there is a broad literature discussing text cleaning [62, 63] and the development of NLP algorithms robust to noise [5, 11].

In addition to general text normalization techniques, such as spelling correction and HTML tag removal, the forum setting requires some specific treatment.

Both questions and answers have a subject and a body. The subject can have three different forms:

it is a synthetic sentence summarizing the content of the post, as in Fig. 1 for Q;

it is a copy of the first part of the body;

it has the form “RE:question-subject” (only for comments). This happens for all the comments in Fig. 1, where the subject was omitted for simplicity.

Therefore, in the first case the subject is highly informative; in case (ii) it is only a useless redundancy; in case (iii) the subject introduces fake question-comment similarities, that can significantly impact the system accuracy. We used simple rules to detect cases (ii) and (iii) and remove the uninformative subjects.

Furthermore, several users put a signature at the final part of their posts. For instance, many comments from user U947 end with the following sentence:

dEV,∥eVeRyTHinG dEsIRAblE, Is eITHEr eXPEnSIVe, bAnnED, ilLOgiCal oR SeEING(maRRiED to) sOMEonE ElsE......;-)

User signatures are completely uncorrelated with the thread topic, and introduce a significant amount of noise in the data. In fact, they produce irrelevant lexical overlaps between different comments of a given author, and drastically reduce the similarity between related comments 13 . To detect and remove user signatures we applied a simple procedure. We grouped posts by user, and compared the post in each group to verify the presence of common final parts. Those parts can be considered signatures to be removed. Using this algorithm we discovered and removed a signature from about 10% of the comments.

A manual analysis on the retrieved user signatures showed that the proposed approach is extremely precise in detecting user signatures, i.e., no core part was erroneously considered a signature. When a signature is present in a single comment of the dataset, it cannot be detected with such a procedure. This problem can be partially mitigated removing the text appearing after some “signature separator” 14 in a comment.

5.3 Models and Parameterization

We train our models using the C-SVM algorithm [14] within KeLP 15 , a Kernel-based Machine Learning platform [24] that implements the tree kernels described in Section 3.2. In the 3-class setting we used a one-vs-all schema. With this setting, the data is rather imbalanced, as the potential class occurs far less than the other classes. We took this into account by emphasizing the regularization parameter for the positive class, i.e., $C_{p} = \frac{# negatives}{# positives} C$ , following [41].

In the 2-class setting, we used a binary SVM trained to recognize good answers.

PTK was used with the default parameters, i.e., λ = μ = 0.4, while we selected the hyper-parameter C = 1 of SVM using a tuning stage on the 2015 dev. set.

We generated shallow trees from the question-answer pairs by using the OpenNLP 16 chunker.

To study the impact of the various features proposed in Section 4, we tested different kernels:

LK_sim is a Linear Kernel operating on the intra-pair features defined in Section 4.1;

LK_qa is a Linear Kernel operating on the task specific features defined in Section 4.4;

TK⁺ is the tree kernel combination performing the inter-pair analysis described in 4.3

Finally, we linearly combined these kernels to create a more robust approach.

When testing on the 2015 and 2016 test sets, we trained the models by using the training and development sets of the corresponding year. Instead, for the 2017 test set, we used all the 2016 datasets as training material.

5.4 Results

Table 2 shows the results on the 2015 test set in the 3 class setting. In the bottom part of the table we reported the results of the three best systems at SemEval-2015. The JAIST team [67] won the 2015 competition using an SVM operating on several features that include, intra-pair similarities, alignment-based features, topic-based features and some heuristics. HITSZ-ICRC [32] adopted a hierarchical classifier trained to firstly identify bad comments and then to distinguish the good answers from the potential ones. They employed intra-pair similarities and several linguistic features, such as the counts of name entities, punctuation marks or specific POS in the comments. Finally, the QCRI system [48] used a preliminary version of the features described in sections 4.1 and 4.4.

Table 2
Results on the 2015 test set with the original 3-class setting.

model macro-F₁ Acc

LK_sim 45.79 49.77

LK_qa 53.69 70.62

TK⁺ 52.98 66.92

LK_sim+LK_qa 56.28 65.49

LK_sim+LK_qa+TK⁺ 58.47 73.91

Majority Baseline 22.36 50.46

JAIST 57.19 72.52

HITSZ-ICRC 56.41 68.67

QCRI 53.74 70.50

model	macro-F₁	Acc
LK_sim	45.79	49.77
LK_qa	53.69	70.62
TK⁺	52.98	66.92
LK_sim+LK_qa	56.28	65.49
LK_sim+LK_qa+TK⁺	58.47	73.91
Majority Baseline	22.36	50.46
JAIST	57.19	72.52
HITSZ-ICRC	56.41	68.67
QCRI	53.74	70.50

All the proposed kernels are significantly above the Majority Baseline (i.e., always good).

The LK_sim model is the weakest model, it confirms the issue we highlighted at the end of Section 4.1, which justifies our inter-pair approach: many good comments directly answer without repeating the words (and more in general the concepts) expressed in the question.

The LK_qa model, instead, is rather accurate; this is not surprisingly, as it is based on a set of manually defined features that have been specifically developed to capture some frequent phenomena appearing in general-purpose fora.

The TK⁺ model reaches similar results: this is very impressive considering it does not exploit task or domain specific features, but implements a general approach that can be used in every task defined on text pairs. Furthermore, this result is even more remarkable if we consider that the intra-pair reasoning, which represents the more straightforward approach to solve this problem, is basically neglected, as the TK⁺ model applies an intra-pair similarity only in establishing the REL tags on the parse trees of the question-answer pairs.

Finally, we defined kernels of increasing complexity, by linearly combining the individual kernels. The results are outstanding, demonstrating that the information provided by the different kernels is largely independent. Therefore, a joint learning model (i.e., an SVM equipping a kernel combination) can take the best by the these three different sources of information. The LK_sim+LK_qa is competitive with the best systems, while the LK_sim+LK_qa+TK⁺ achieves the new state-of-the-art on this dataset.

Similar results are obtained in the 2-class setting, reported in Table 3. According to MAP, the best individual kernel is TK⁺. Again, linearly combining the three kernels leads to state-of-the-art results, which largely outperform the best systems at SemEval-2015 17 Furthermore, our best model achieves higher accuracy and F₁ than MaxEnt+GraphCut [36] and FCCRF [35], which are two systems developed after the SemEval-2015 challenge. They both approach the answer selection in cQA as a graph task, where all the comments in a question thread should be globally classified.

Table 3

Results on the 2015 test set with the 2-class setting.

model	MAP	AvgRec	MRR	Prec	Rec	F₁	Acc
LK_sim	72.47	86.83	74.78	66.32	64.38	65.34	65.23
LK_qa	77.61	91.67	79.74	72.46	88.70	79.76	77.09
TK⁺	78.92	92.55	82.03	75.80	79.01	77.37	76.48
LK_sim+LK_qa	77.43	91.74	79.22	76.56	85.37	80.73	79.25
LK_sim+LK_qa+TK⁺	79.91	93.49	82.02	80.93	83.96	82.42	81.77
JAIST	-	-	-	80.23	77.73	78.96	79.10
HITSZ-ICRC	-	-	-	75.91	77.13	76.52	76.11
QCRI	-	-	-	74.33	83.05	78.45	76.97
FCCRF	-	-	-	77.30	86.20	81.50	80.50
MaxEnt+GraphCut	-	-	-	78.30	82.93	80.55	79.80

Finally, Tables 4 and 5 report the official results of the 2016 and 2017 semEval challenges, respectively. We participated with the LK_sim+LK_qa+TK⁺ model under the pseudonym KeLP. In both occasions, we won the competition. In the 2016 edition 12 systems participated to the challenge and apart KeLP, the best two were ConvKN [9], which jointly used convolutional neural networks and kernel methods, and SemanticZ [38], which relied on several semantic similarity features based on fine-tuned word embeddings and topic similarities.

Table 4

Results on the 2016 test set with the 2-class setting.

model	MAP	AvgRec	MRR	Prec	Rec	F₁	Acc
LK_sim	66.02	78.59	73.91	66.31	18.51	28.94	63.06
LK_qa	73.55	84.95	81.71	68.39	39.73	50.26	68.04
TK⁺	75.08	85.30	82.09	73.39	42.14	53.54	70.28
LK_sim+LK_qa	75.42	86.63	82.95	72.71	49.51	58.91	71.93
KeLP (i.e., LK_sim+LK_qa+TK⁺)	79.19	88.82	86.42	76.96	55.30	64.36	75.11
ConvKN	77.66	88.05	84.93	75.56	58.84	66.16	75.54
SemanticZ	77.58	88.14	85.21	74.13	53.05	61.84	73.39
chronological baseline	59.53	72.60	67.83	-	-	-	-
all good baseline	-	-	-	40.64	74.57	52.55	45.26
all bad baseline	-	-	-	-	-	-	59.36

Table 5

Results on the 2017 test set with the 2-class setting.

model	MAP	AvgRec	MRR	Prec	Rec	F₁	Acc
LK_sim	75.79	83.99	84.90	75.73	18.84	30.18	54.68
LK_qa	84.33	90.69	91.07	82.08	49.64	61.87	68.19
TK⁺	82.34	89.50	89.80	82.64	45.96	59.07	66.89
LK_sim+LK_qa	85.85	91.87	91.31	84.52	55.94	67.33	71.77
KeLP (i.e., LK_sim+LK_qa+TK⁺)	88.43	93.79	92.82	87.30	58.24	69.87	73.89
Beihang-MSRA	88.24	93.87	92.34	51.98	100.00	68.40	51.98
IIT-UHH	86.88	92.04	91.20	73.371	74.52	73.94	72.70
chronological baseline	72.61	79.32	82.37	-	-	-	-
all good baseline	-	-	-	51.98	100.00	68.40	51.98
all bad baseline	-	-	-	-	-	-	48.02

In the 2017 edition, there were 14 systems participating to the challenge, and the best systems apart KeLP were Beihang-MSRA and IIT-UHH. Beihang-MSRA [23] is based on a gradient boosted regression tree model that using several features including various NLP features, neural network based matching and task specific features. IIT-UHH [47] is an SVM based system that makes use of textual, domain-specific, word embedding and topic-modeling features.

The results obtained on the 2017 data are significantly higher than those derived on the other two years in terms of ranking metrics. This depends on the fact that there are many threads with mostly good comments, or mostly bad comments. Consequently, their re-ordering impact less on the initial MAP. This is also confirmed by the high results achieved by the chronological baseline. Instead, the accuracy is in line with the one observed in the previous years, demonstrating the consistency of the proposed model.

In order to get an intuition about the effect of the proposed inter-pair approach, we show a comparison between LK_sim and TK⁺ in Fig. 5. All the comments are classified correctly by the TK⁺ model, and incorrectly by the LK_sim. This confirms that the TK⁺ model effectively learned valid question-answer patterns, such as Q:“Is there any …?” A:“Try …”. In contrast, the intra-pair similarities cannot capture any useful information from these question-answer pairs, which are consequently all classified as bad .

Fig. 5

A real question–comments threads from the 2016 test set (ID Q386_R4). The class label before the arrow corresponds to the prediction provided by LK_sim, while the right-hand label is the prediction provided by TK⁺, which always provides the correct class.

6 Related work

The problem of selecting the relevant text passages (i.e., those containing good answers) has been tackled in QA research, either for non-factoid QA or for passage reranking. Usually, automatic classifiers are applied to the answer passages retrieved by a search engine to derive a relative order [31 , 77].

One research direction has been to try to match the syntactic structure of the question to that of the candidate answer. For example, [75] proposed a probabilistic quasi-synchronous grammar to learn syntactic transformations from the question to the candidate answers. [31] used an algorithm based on Tree Edit Distance (TED) to learn tree transformations in pairs. [74] developed a probabilistic model to learn tree-edit operations on dependency parse trees. [77] applied linear chain conditional random fields (CRFs) with features derived from TED to learn associations between questions and candidate answers. Although, these approaches are accurate and inject some structure in the models, they cannot express the same richness of relational features we introduce thanks to our inter-pair kernels.

Regarding the use of syntactic structure, e.g., [73] proposed a retrieval model for finding similar questions based on the similarity of syntactic trees. In contrast, we proposed a tagging solution similar to the one for passage reranking used in [56 , 59], which was further improved in [69, 70], exploiting relational tags and linked open data.

Another important research direction has been on using neural network models for question-answer similarity [22 , 72]. For instance, [65] used neural attention over a bidirectional long short-term memory (LSTM) neural network in order to generate better answer representations given the questions.

Regarding the community scenario of QA, one early work is that of [50], who tried to determine whether a question is “solved” or not, given its associated thread of comments. As a first step in the process, they performed a comment-level classification, considering four classes: problem, solution, good feedback, and bad feedback.

7 Conclusion

This paper describes a kernel-based method for tackling the answer selection task in community question answering. The novel aspect of the proposed approach is its capability to automatically learn pairwise question-answer patterns from training question-answer pairs. For instance, the model can recognize valid question answer templates such as answers to Yes or No questions, e.g., Q:“Can I...?” A:“Sure …, Yes …, You can …, No …”, or answers to a request of indication, e.g., Q:“Which is the best...?” A:“Try …, Go to …”. This allows the model to properly solve those cases in which the question and the answer do not share similar words. We represent question-answer instances as pairs of linked parse trees, and apply a combination of tree kernels to perform an inter-pair reasoning, which leads to state-of-the-art results on three different editions of the community question answering challenge of SemEval.

This is a paradigmatic shift with respect to most other approaches in literature relying on an intra-pair comparison between the question and the answer.

In the future, we plan to apply this approach to other question-answer datasets and to investigate universal dependencies in order to define cross-language solutions.

Footnotes

The complete thread is available at

These methods are rather old but the related terminology was recently introduced in the context of neural networks (see []).

In the task of answer selection in community question answering, the problem is establishing whether a comment directly answers a question, without verifying the correctness of the provided information. This means that the sentence A₂ is to be considered good , although the user who is replying is ignoring the fact that a gold certificate is required.

The tree kernels TK are usually normalized applying the formula $N_{TK} (x, y) = \frac{TK (x, y)}{\sqrt{TK (x, x) TK (y, y)}}$ . In all the experiments performed within this paper, we used this kernel normalization.

For instance in the Qatar Living Lounge category the following question has been posted: “Google: does anyone know how to find Google? I tried to Google it but found myself stuck in a paradox loop.”. Obviously many ironic comments answered this question.

The task organizers report that some comments in the threads were discarded due to disagreement in the annotation process. The extent of discarded comments is unknown.

The constants in equation 5 are so that the information is propagated to 10 comments, with a linear decay factor that depends on their distance k from the comment by u_q.

For instance, after the kernel normalization process, the possible similarity in the core part of two comments is hidden by the presence of different user signatures.

For instance, an horizontal line such as “——–”.

This comparison is not strictly fair as the SemEval-2015 systems were trained for the 3-class setting, and here we remapped their outputs to the 2-class setting. The system outputs are publicly available at .

References

Agirre

, Banea

, Cardie

, Cer

, Diab

, Gonzalez-Agirre

, Guo

, Lopez-Gazpio

, Maritxalar

, Mihalcea

, Rigau

, Uria

and Wiebe

, Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, Association for Computational Linguistics, 2015, pp. 252–263. URL http://www.aclweb.org/anthology/S15-2045.

Allison

and Dix

T.I.

, A bit-string longest-commonsubsequence algorithm, Inf Process Lett 23(6) (1986), 305–310. ISSN 0020-0190. URL http://dl.acm.org/citation.cfm?id=8871.8877.

Aston

, Burnard

, The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh University Press, Scotland, 1998.

Baeza-Yates

R.A.

, Ribeiro-Neto

, Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999. ISBN 020139829X.

Baldwin

, de Marneffe

M.C.

, Han

, Kim

Y.-B.

, Ritter

and Xu

, Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition, In Association for Computational Linguistics (ACL), 2015.

Baroni

, Bernardini

, Ferraresi

and Zanchetta

, The wacky wide web: A collection of very large linguistically processed web-crawled corpora, Language Resources and Evaluation 43(3) (2009), 209–226.

Baroni

, Dinu

and Kruszewski

, Don’t count, predict! a systematic comparison of contextcounting vs. context-predicting semantic vectors, In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, 2014, pp. 238–247. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P/P14/P14-1023.

Barrón-Cedeño

, Filice

, Martino

G.D.S.

, Joty

, Màrquez

, Nakov

and Moschitti

, Thread-level information for comment classification in community question answering, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, 2015, pp. 687–693. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P15-2113.

Barrón-Cedeño

, Martino

G.D.S.

, Joty

, Moschitti

, Al-Obaidli

, Romeo

, Tymoshenko

and Uva

, Convkn at semeval-2016 task 3: Answer and question selection for question answering on arabic and english fora, In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, 2016, pp. 896–903. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/S16-1138.

10.

Barrón-Cedeño

, Martino

G.D.S.

, Filice

and Moschitti

, On the use of an intermediate class in boolean crowdsourced relevance annotations for learning to rank comments, In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Japan, Tokyo, 2017, pp. 1209–1212.

11.

Basili

, Lopresti

D.P.

, Ringlstetter

, Roy

, Schulz

K.U.

, Venkata Subramaniam

, Summary of the 4th workshop on analytics for noisy unstructured text data (AND), In CIKM, ACM, 2010, pp. 1965–1966.

12.

Broder

, On the resemblance and containment of documents, In Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES ’97, Washington, DC, USA, 1997, p. 21. IEEE Computer Society. ISBN 0-8186-8132-2. URL http://dl.acm.org/citation.cfm?id=829502.830043.

13.

Brown

P.F.

, Cocke

, Pietra

S.A.D.

, Pietra

V.J.D.

, Jelinek

, Lafferty

J.D.

, Mercer

R.L.

and Roossin

P.S.

, A statistical approach to machine translation, Comput Linguist 16(2) (1990), 79–85. ISSN 0891-2017. URL http://dl.acm.org/citation.cfm?id=92858.92860.

14.

Chang

C.-C.

and Lin

C.-J.

, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1–27:27.

15.

Church

K.W.

and Mercer

R.L.

, Introduction to the special issue on computational linguistics using large corpora, Comput Linguist 19(1) (1993), 1–24. ISSN 0891-2017. URL http://dl.acm.org/citation.cfm?id=972450.972452.

16.

Church

K.W.

, A stochastic parts program and noun phrase parser for unrestricted text, In Proceedings of the Second Conference on Applied Natural Language Processing, ANLC ’88, Stroudsburg, PA, USA, 1988, pp. 136–143. Association for Computational Linguistics doi: 10.3115/974235.974260. URL https://dx-doi-org.web.bisu.edu.cn/10.3115/974235.974260.

17.

Collins

and Duffy

, Convolution kernels for natural language, In Proceedings of Neural Information Processing Systems (NIPS’2001), 2001, pp. 625–632.

18.

Cristianini

, Shawe-Taylor

, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, United Kingdom, 2000.

19.

Croce

, Moschitti

and Basili

, Structured lexical similarity via convolution kernels on dependency trees, In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, Stroudsburg, PA, USA, 2011, pp. 1034–1046. Association for Computational Linguistics. ISBN 978-1-937284-11-4. URL http://dl.acm.org/citation.cfm?id=2145432.2145544.

20.

Dolan

, Quirk

and Brockett

, Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources, In Proc of COLING ’04, Stroudsburg, PA, USA, 2004. doi: 10.3115/1220355.1220406. URL https://dx-doi-org.web.bisu.edu.cn/10.3115/1220355.1220406.

21.

Fellbaum

, WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998, ISBN 978-0-262-06197-1.

22.

Feng

, Xiang

, Glass

M.R.

, Wang

and Zhou

, Applying deep learning to answer selection: A study and an open task, In Proceedings of the Workshop on Automatic Speech Recognition and Understanding, ASRU ’15, Scottsdale, Arizona, USA, 2015, pp. 813–820.

23.

Feng

, Wu

, Li

and Zhou

, Beihang-msra at semeval-2017 task 3: A ranking system with neural matching features for community question answering, In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, 2017, pp. 280–286. doi:10.18653/v1/S17-2045. URL http://www.aclweb.org/anthology/S17-2045.

24.

Filice

, Castellucci

, Croce

, Martino

G.D.S.

, Moschitti

and Basili

, KeLP: A Kernel-based Learning Platform in java, In The workshop on Machine Learning Open Source Software (MLOSS): Open Ecosystems, Lille, France, International Conference of Machine Learning, 2015.

25.

Filice

, Martino

G.D.S.

and Moschitti

, Structural representations for learning relations between pairs of texts, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 2015, pp. 1003–1013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P15-1097.

26.

Firth

J.R.

, A synopsis of linguistic theory 1930-1955, In Studies in Linguistic Analysis, Oxford: Philological Society, 1957, pp. 1–32.

27.

Giampiccolo

, Magnini

, Dagan

and Dolan

, The third pascal recognizing textual entailment challenge, In Proc of the ACL-PASCAL RTE ’07 Workshop, ACL, 2007, pp. 1–9. URL http://dl.acm.org/citation.cfm?id=1654536.1654538.

28.

Gildea

and Palmer

, The necessity of parsing for predicate argument recognition, In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, 2002, pp. 239–246. Association for Computational Linguistics. doi: 10.3115/1073083.1073124. URL https://doi.org/10.3115/1073083.1073124.

29.

Gusfield

, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, NY, USA, 1997. ISBN 0-521-58519-8.

30.

Harris

, Distributional structure. In Katz

J.J.

and Fodor

J.A.

, editors, The Philosophy of Linguistics. Oxford University Press, 1964.

31.

Heilman

and Smith

N.A.

, Tree edit models for recognizing textual entailments, paraphrases, and answers to questions, In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, Los Angeles, California, USA, 2010, pp. 1011–1019. ISBN 1-932432-65-5. URL http://dl.acm.org/citation.cfm?id=1857999.1858143.

32.

Hou

, Tan

, Wang

, Zhang

, Xu

and Chen

, HITSZ-ICRC: Exploiting classification approach for answer selection in community question answering, In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval ’15, Denver, Colorado, USA, 2015, pp. 196–202. URL http://www.aclweb.org/anthology/S15-2035.

33.

Jaccard

, Étude comparative de la distribution florale dans une portion des Alpes et des Jura, Bulletin del la Société Vaudoise des Sciences Naturelles 37 (1901), 547–579.

34.

Jeon

, Croft

W.B.

and Lee

J.H.

, Finding similar questions in large question and answer archives, In Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM ’05, Bremen, Germany, 2005, pp. 84–90. ISBN 1-59593-140-6. doi: 10.1145/1099554.1099572. URL http://doi.acm.org/10.1145/1099554.1099572.

35.

Joty

, Màrquez

and Nakov

, Joint learning with global inference for comment classification in community question answering, In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, 2016, pp. 703–713. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N16-1084.

36.

Joty

S.R.

, Barrón-Cedeño

, Martino

G.D.S.

, Filice

, Màrquez

, Moschitti

and Nakov

, Global thread-level inference for comment classification in community question answering, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 2015, pp. 573–578. URL http://aclweb.org/anthology/D/D15/D15-1068.pdf.

37.

Manning

C.D.

, Raghavan

, Schütze

,Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.

38.

Mihaylov

and Nakov

, Semanticz at eval-2016 task 3: Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings, In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, 2016, pp. 879–886. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/S16-1136.

39.

Mikolov

, Chen

, Corrado

and Dean

, Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. URL http://arxiv.org/abs/1301.3781.

40.

Miller

, Beckwith

, Fellbaum

, Gross

and Miller

, Introduction to wordnet: An on-line lexical database, International Journal of Lexicography 13(4) (1990), 235–312.

41.

Morik

, Brockhausen

and Joachims

, Combining statistical learning with a knowledge-based approach - a case study in intensive care monitoring, In ICML, San Francisco, CA, USA, 1999, pp. 268–277. Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2.

42.

Moschitti

, Efficient convolution kernels for dependency and constituent syntactic trees, In ECML, Berlin, Germany, 2006.

43.

Moschitti

, Pighin

and Basili

, Tree kernels for semantic role labeling, Computational Linguistics 34 (2008).

44.

Nakov

, Màrquez

, Magdy

, Moschitti

, Glass

and Randeree

, Semeval-2015 task 3: Answer selection in community question answering, In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, 2015, pp. 269–281. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/S15-2047.

45.

Nakov

, Màrquez

, Moschitti

, Magdy

, Mubarak

, Freihat

A.A.

, Glass

and Randeree

, SemEval-2016 task 3: Community question answering, In Proceedings of SemEval-2016, 2016.

46.

Nakov

, Hoogeveen

, Màrquez

, Moschitti

, Mubarak

, Baldwin

and Verspoor

, SemEval-2017 task 3: Community question answering, In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval ’17, Vancouver, Canada, 2017. Association for Computational Linguistics.

47.

Nandi

, Biemann

, Yimam

S.M.

, Gupta

, Kohail

, Ekbal

and Bhattacharyya

, Iit-uhh at semeval-2017 task 3: Exploring multiple features for community question answering and implicit dialogue identification, In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, 2017, pp. 90–97. Association for Computational Linguistics.

48.

Nicosia

, Filice

, Barrón-Cedeño

, Saleh

, Mubarak

, Gao

, Nakov

, Martino

G.D.S.

, Moschitti

, Darwish

, Màrquez

, Joty

and Magdy

, QCRI: Answer selection for community question answering – experiments for Arabic and English, In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval ’15, Denver, Colorado, USA, 2015, pp. 203–209. URL http://www.aclweb.org/anthology/S15-2036.

49.

Nirenburg

, Knowledge-based machine translation, Machine Translation 4(1) (1989), 5–24. ISSN 0922-6567. doi: 10.1007/BF00367750. URL https://dx-doi-org.web.bisu.edu.cn/10.1007/BF00367750.

50.

and Liu

, Finding problem solving threads in online forum, In Proceedings of 5th International Joint Conference on Natural Language Processing, IJCNLP ’11, pp. 1413–1417, Chiang Mai, Thailand, 2011. URL http://www.aclweb.org/anthology/I11-1164.

51.

Radlinski

and Joachims

, Query chains: Learning to rank from implicit feedback, In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, Chicago, Illinois, USA, 2005, pp. 239–248. ISBN 1-59593-135-X. doi: 10.1145/1081870.1081899. URL http://doi.acm.org/10.1145/1081870.1081899.

52.

Müller

K.R.

, Mika

, Rätsch

, Tsuda

and Schölkopf

, An introduction to kernelbased learning algorithms, IEEE Transactions on Neural Networks 12(2) (2001), 181–201.

53.

Sahlgren

, The Word-Space Model. PhD thesis, Stockholm University, 2006.

54.

Salton

, McGill

M.J.

, Introduction to modern information retrieval, McGraw-Hill computer science series. McGraw-Hill, 1983. ISBN 9780070544840. URL http://books.google.it/books?id=7f5TAAAAMAAJ.

55.

Sebastiani

, Machine learning in automated text categorization, ACM Comput Surv 34(1) (2002), 1–47. ISSN 0360-0300. doi: 10.1145/505282.505283. URL http://doi.acm.org/10.1145/505282.505283.

56.

Severyn

and Moschitti

, Structural relationships for large-scale learning of answer re-ranking, In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, Portland, Oregon, USA, pp. 741–750. ISBN 978-1-4503-1472-5. doi: 10.1145/2348283.2348383. URL http://doi.acm.org/10.1145/2348283.2348383.

57.

Severyn

and Moschitti

, Automatic feature engineering for answer selection and extraction, In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP ’13, Seattle, Washington, USA, 2013, pp. 458–467.

58.

Severyn

and Moschitti

, Learning to rank short text pairs with convolutional deep neural networks, In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, Santiago, Chile, 2015, pp. 373–382. ISBN 978-1-4503-3621-5.

59.

Severyn

, Nicosia

and Moschitti

, Learning adaptable patterns for passage reranking, Proceedings of the Seventeenth Conference on Computational Natural Language Learning, 2013, pp. 75–83.

60.

Shen

and Lapata

, Using semantic roles to improve question answering, In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’07, Prague, Czech Republic, 2007, pp. 12–21. URL http://www.aclweb.org/anthology/D/D07/D07-1002.

61.

Socher

, Bauer

, Manning

C.D.

and Ng

A.Y.

, Parsing with compositional vector grammars. In Proceedings of the ACL Conference, 2013.

62.

Sproat

, Black

A.W.

, Chen

S.F.

, Kumar

, Ostendorf

and Richards

, Normalization of non-standard words, Computer Speech & Language 15(3) (2001), 287–333.

63.

Subramaniam

L.V.

, Roy

, Faruquie

T.A.

and Negi

, A survey of types of text noise and techniques to handle noisy text, In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, AND ’09, New York, NY, USA, 2009, pp. 115–122. ACM. ISBN 978-1-60558-496-6. doi: 10.1145/1568296.1568315. URL http://doi.acm.org/10.1145/1568296.1568315.

64.

Surdeanu

, Ciaramita

and Zaragoza

, Learning to rank answers on large online QA collections, In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics and the Human Language Technology Conference, ACL-HLT ’08, Columbus, Ohio, USA, 2008, pp. 719–727. URL http://aclweb.org/anthology/P08-1082.

65.

Tan

, Xiang

and Zhou

, Lstm-based deep learning models for non-factoid answer selection, ArXivpreprint arXiv:1511.04108 (2015).

66.

Tellex

, Katz

, Lin

, Fernandes

and Marton

, Quantitative evaluation of passage retrieval algorithms for question answering, In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), ACM Press, 2003, pp. 41–47.

67.

Tran

Q.H.

, Tran

, Vu

, Nguyen

and Pham

S.B.

, JAIST: Combining multiple features for answer selection in community question answering, In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval ’15, Denver, Colorado, USA, 2015, pp. 215–219. URL http://www.aclweb.org/anthology/S15-2038.

68.

Turian

, Ratinov

and Bengio

, Word representations: A simple and general method for semi-supervised learning, In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, Stroudsburg, PA, USA, 2010, pp. 384–394. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=1858681.1858721.

69.

Tymoshenko

and Moschitti

, Assessing the impact of syntactic and antic structures for answer passages reranking, In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15 New York, NY, USA, 2015, pp. 1451–1460. ACM. ISBN 978-1-4503-3794-6. doi: 10.1145/2806416.2806490. URL http://doi.acm.org/10.1145/2806416.2806490.

70.

Tymoshenko

, Moschitti

and Severyn

, Encoding antic resources in syntactic structures for passage reranking, Association for Computational Linguistics (ACL) 1 (2014), 664–672. ISBN 9781632663962.

71.

Vishwanathan

S.V.N.

and Smola

A.J.

, Fast kernels on strings and trees, In Proceedings of Neural Information Processing Systems, 2002, pp. 569–576.

72.

Wang

and Nyberg

, A long short-term memory model for answer sentence selection in question answering, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, ACLIJCNLP ’15, Beijing, China, 2015, pp. 707–712.

73.

Wang

, Ming

and Chua

T.-S.

, A syntactic tree matching approach to finding similar questions in communitybased QA services, In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, Boston, Massachusetts, USA, 2009, pp. 187–194. ISBN 978-1-60558-483-6.

74.

Wang

and Manning

C.D.

, Probabilistic tree-edit models with structured latent variables for textual entailment and question answering, In Proceedings of the 23rd International Conference on Computational Lin guistics, COLING ’10, Beijing, China, 2010, pp. 1164–1172. URL http://dl.acm.org/citation.cfm?id=1873781.1873912.

75.

Wang

, Smith

N.A.

and Mitamura

, What is the Jeopardy model? A quasi-synchronous grammar for QA, In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’07, Prague, Czech Republic, 2007, pp. 22–32. URL http://www.aclweb.org/anthology/D/D07/D07-1003.

76.

Wise

M.J.

, Yap3: Improved detection of similarities in computer program and other texts, In Proceedings of the Twenty-seventh SIGCSE Technical Symposium on Computer Science Education, SIGCSE ’96, New York, NY, USA, 1996, pp. 130–134. ACM. ISBN 0-89791-757-X. doi: 10.1145/236452.236525. URL http://doi.acm.org/10.1145/236452.236525.

77.

Yao

, Durme

B.V.

, Callison-Burch

and Clark

, Answer extraction as sequence tagging with tree edit distance, In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT ’13, 2013, pp. 858–867. URL http://aclweb.org/anthology/N13-1106.

Learning pairwise patterns in Community Question Answering

Abstract

Keywords

1 Introduction

2 Answer Selection in Community Question Answering

3.1 Generalizing lexical information: Distributional models of lexical semantics

3.2 Exploiting the syntactic information

4.1 A shallow intra-pair similarity approach

4.2 Learning from sentence pairs

4.3 Kernels on question-answer pairs

5.1 Evaluation measures

5.2 Text preprocessing

5.3 Models and Parameterization

5.4 Results

Table 2 Results on the 2015 test set with the original 3-class setting. model macro-F1 Acc LK sim 45.79 49.77 LK qa 53.69 70.62 TK+ 52.98 66.92 LK sim +LK qa 56.28 65.49 LK sim +LK qa +TK+ 58.47 73.91 Majority Baseline 22.36 50.46 JAIST 57.19 72.52 HITSZ-ICRC 56.41 68.67 QCRI 53.74 70.50

7 Conclusion

Footnotes

References