Abstract
Question Answering (QA) research is a significant and challenging task in Natural Language Processing. QA aims to extract an exact answer from a relevant text snippet or a document. The motivation behind QA research is the need of user who is using state-of-the-art search engines. The user expects an exact answer rather than a list of documents that probably contain the answer. In this paper, we consider a particular issue of QA that is gathering and scoring answer evidence collected from relevant documents. The evidence is a text snippet in the large corpus which supports the answer. For Evidence Scoring (ES) several efficient features and relations are required to extract for machine learning algorithm. These features include various lexical, syntactic and semantic features. Also, new structural features are extracted from the dependency features of the question and supported document. Experimental results show that structural features perform better, and accuracy is increased when these features are combined with other features. To score the evidence, for an existing question-answer pair, Logical Form Answer Candidate Scorer technique is used. Furthermore, an algorithm is designed for learning answer evidence.
Introduction
A QA system aims to find a concise answer to a natural language question. QA systems retrieve only the requested information (exact answer), unlike search engines which pass on to the list of relevant documents. For example, given the question “Which sportswoman was made the brand ambassador of the newly formed state of Telangana?”, an ideal QA system would answer Saniya Mirza. Therefore, QA system saves time and provides the required information which is accessible on all devices. To improve performance, recently QA has been drawing attention by using knowledge-bases. True Knowledge [1] is a web-based QA system, and IBM’s Watson [2] is state-of-the-art QA system which defeated the human champions in Jeopardy game show. Wolfram Alpha [3] gives access to the world’s facts and data and calculates answers across a range of topics. Start is a system of recent era designed to answer to the questions asked in natural language. Gathering evidence which supports the answer to a particular question is an important part of QA. Among three stages (question analysis, document retrieval and answer analysis) in QA [4]. After analyzing a question and retrieving a set of available documents having candidate answers. The QA architecture employs a broad set of evidence and scoring section to compute these pieces of evidence. We used a useful dataset for this purpose. The dataset has question-option-answer pairs of a famous Indian game show Kaun Banega Crorepati (KBC) which is similar to very famous game show Jeopardy. Supporting target documents are collected manually from the web.
Now, we focus on option scoring using question and document features. In this, question features are collected to produce an intermediate form of question called Question Feature Form (QFF) and document features are collected to produce an intermediate form of the document called Document Feature Form (DFF). These intermediate forms are mapped with options to learn supporting evidence. Feature scoring sections use ranked text passages. These passages are associated with a candidate answer which is one among four provided options. Now treating all the passages as evidence an intermediate feature form (IFF) of each passage is produced. IFF finds passages which are closely associated with that question for a given option. The final evidence score is calculated after merging the scores of all features. Figure 1 shows the overall architecture of evidence scoring system. It showing the three parts of the system first is related to feature extraction and IFF score generation. Second is related to gathering answer evidence, and third part is the merging and ranking section.
The calculated score is further used to design a learning algorithm to learn answer evidence. The learning algorithm is useful for evidence gathering for complex KBC questions having a question-option-answer dataset. The proposed approach is similar to learn a semantic parser. Semantic parsers map utterances to their formal meaning representations using a logical form. This approach uses question features to represent intermediate form to map ranked documents. For meaning representations, almost every parser uses a predefined set of constants. In this work, instead of using logical terms we focus on question features. These question features provide a significant indication in gathering supported evidence from the document. Let F be a set of features in the question (Q), and R be a set of ranked documents in the collection of documents (D). Also, let Y be an intermediate feature expression that can be executed against R to return evidence for particular answer E = EXECUTE(Y, R). The purpose is to build a feature function for mapping a natural language question Q to an intermediate feature form Y. We assume access to data containing (Qi), (Oi1, Oi2, Oi3, Oi4) where, i = 1... n question-option pairs and a ranked document set R. The learning algorithm estimates the parameters of a linear model for ranking the possible entries in E. Learning of word meaning is removed in the stages of parsing and validated by some features as part of the learning model.
Related work
In the evidence retrieval, Supporting Evidence Retrieval (SER) [5] is a system which put the probable answer into the primary question to make a proposition; and then it uses the DeepQA search techniques [6, 7] to retrieve the passages which are most closely related to the proposition. The scores (including all evidence-scoring components) during this phase are calculated and combined using a statistical model which is later used in answer ranking [8]. The Indri passage retrieval algorithm is also used for finding and generating candidates [9]. It uses a predicate-argument structure (PAS) for the syntactic portions of the graph using an English Slot Grammar (ESG) parse [5]. The Skip-Bigram algorithm for evidence gathering was first introduced in [10] for machine translation, in which the system translations are matched with gold standards. Both, the question and the passage text are mapped with their graph representation using a similar mapping procedure.
Along with the challenges in semantic parsing to find the most appropriate question representation model [5, 12]. The proposed approach is focused on getting a Logical Feature Form (LFF) score from question features used for evidence gathering in ranked documents. There are several efforts done in making a question representation model using Combinatory Categorical Grammar (CCG) [13, 14] and dependency trees [17, 18]. In some approaches, the direct supervision is hidden by latent variables [16, 17] or distant supervision [19]. These latent meaning representations are used for training the semantic parser from utterance and denotations. QA research can be broadly categorized as machine learning-based approaches [1, 13] and knowledge-based approaches [15, 20]. The performance of the machine-learning approaches greatly depends on the effectiveness of the feature extraction process. In this work, the denotation of question and document is replaced by a score of LFF (which includes structural features of both) that is used to retrieve document evidence. Authors [19] use two steps to produce the final logical form, first to map utterance to domain-independent logical form and then ontology matching to get the final logical form. We employed with Dependency Compositional Semantics (DCS) features (mainly with structural features) and produced an intermediate QFF. In another work author [1, 21] attempted to map text to structured form and some authors [17, 22] attempted the natural language predicate and argument triples to map with structured RDF triples. Author [19] created a semantic parser for structured knowledge-base (KB) relations. Precision and recall are used as performance matrices for semantic parsers to calculate the accuracy of the LF. The system learns from QFF and document evidence. We compare our evidence scoring system with those systems which learn from question-answer pairs [20]. The system employs learning with: morphological features, syntactic parse trees, a set of semantic features and structural features. Feature selection methods [23] are used to obtain optimal features for IFF generation.
Feature extraction
A feature extraction vector F = f (Q i , z) is defined where Q i is the provided question, and z is the value obtained from various features. It is the essential part of intermediate QFF model. Features are divided into lexical, syntactic, semantic and appropriate structural features. Structural features are similar to the features used for DCS tree. In DCS, structural features have importance because of the tree representation of the question. In our case, features have importance according to their relevance which is calculated by the feature selection methods discussed in later section. Open domain QA systems use wide-ranging coverage of parsers. The quickness and correctness of CCG parser suggested by [12] are used to parse answer candidates. We are mixing the parser into a QA system for evidence gathering, scoring and learning. Parsers are applied to the questions, for two reasons: 1) the use of question features allows the parser to deal with extraction cases, which is the important part in question parsing for intermediate form generation and, 2) comparison of possible answers from the ranked documents with options. Answer extraction component is simplified if the same parser is used for both question and target documents. Parsing is done on the questions of KBC dataset. In the initial stage of parsing, the results were not as per expectations because the structure of questions was not frequent. For example, there are no “what” questions with the common form of “What is the name of the first president of India?”. In KBC questions, this is a very common form of Wh-question. In KBC question set there is a lesser number of similar question types beginning with “How” or “Who”. As creating a new data set is always expensive, so an existing alternative CCG Lexical Category (CLC) annotation is used with KBC dataset. CLC annotation is easier than annotating a question with its derivations. It can be done with the tools and available resources. In our case, the question is annotated with a super-tagger which uses the output form of Stanford dependency parsers. This tagger is sufficient to give high parsing accuracy on complex KBC questions. For example, Fig. 2 shows the dependency parsing of the question “In a 2014 film, Vidya Balan’s character Bilkis Ahmed is also known as what other name?”.
Lexical features
CCG’s lexical or morphological features are the POS category of the word. That is associated with forward and backward CCG operation rule. In this work, we considered few important Lexical Features (say, L e ). Lexical features tolerate the feature space to reason about the denotation of unobserved words. Unlike the small size domain, large scale and open-domain questions are impossible to notate with limited training data. An n-gram is simply defined as a sequence of n items in the question. An n-gram for n = 1 is referred to a unigram. Similarly, for n = 2 and n = 3n-grams referred as bigrams and trigrams. A wh-word can appear anywhere in the question as in our example question it appears at the position 14. In our case a wh-count and position both are used what, 1, 14, this feature identifies whether the wh-word is at starting or somewhere else in the question. Word shape and question length also used in lexical features. Each lexical feature is explained in Table 1 along with relevance of these features on other feature of similar category. The feature relevance is measured by feature selection methods.
Syntactic features
The Part of Speech (POS) tags and headwords are the most commonly used Syntactical Features (say, S y ). A headword [25, 26] is usually defined as the most descriptive word for a question or a word that defines the purpose of question. We also extracted some syntactic features like tagged unigram, headword tag and focus word explained in Table 1.
Semantic features
For Semantic Features (say, S e ), we require a third party database such as WordNet [27], or a dictionary to extract semantic relations of question. The most commonly using semantic features are headword’s hypernyms, related words, and named entities. WordNet is a lexical database of English words. It gives a lexical hierarchy that links a word with higher level semantics particularly hypernyms. For example, a hypernym of the word “city” is “municipality” of which the hypernym is “urban area” and so on. Named Entities are another very important semantic feature used in some studies [28]. Named entities are semantic categories which can be assigned to a word in a given sentence. These are also shown in Table 1.
Proposed structural features
We defined structure matching operators in ranked documents (R) which produce novel feature vector that is defined by the features in the output of dependency parse. These structural features (say, S t ) are used for complex relations presented in R and used for the uniqueness of efficient constants available in parsing results. The QFF(y) produced for a question contains one composite feature function. Structural features allow the model to adapt for all ranked documents having evidence support for the question structure. Each feature captures properties about (Q, R) which precise the details of the exact occurrence and tolerate to generalize occurrences that share common features. Figures 2 and 3 shows the question’s structural features. There are relations where Vidya Balan has a relation with Ahmed Bilkis which can not be identified directly, Fig. 4 shows is showing this relation. The relation contributes in a structural evidence.
Selection of relevant features
Irrelevant features are eliminated, and feature selection methods collect essential
features. A reduced feature vector with relevant features helps to improve the computation
speed and increases the accuracy of machine-learning methods [15]. Among various feature selection methods, we used the Minimum
Redundancy Maximum Relevance (mRMR) feature selection method suggested by [23, 24]. The
mRMR [29] is a feature selection method used to
identify the discriminant features of a class. Features with high dependency to class
attribute and minimum dependency are selected by mRMR method. It is not always true that
relevant features are non-redundant. The nonlinear correlation between two attributes is
measured by their mutual information [30]. Mutual
information for two variables X and Y can be calculated by using their probabilities P(X),
P(Y) and joint probability P(X, Y) in Equation
(1).
The mRMR selects those features which have high mutual information and eliminates those features which have high mutual information or minimum redundant. This mutual information is used to give the relevance value to the features. The relevance of each feature calculated by this methods is shown in Table 1. To calculate the relevance of a feature over rest of features, in Equation (1) we assume I(x) is the primary feature and, I(y) is set of remaining features.
For the given question-option-answer dataset of KBC, this work is divided into following subtasks: A) question parsing, B) feature extraction and C) intermediate QFF. Later scoring the evidence and learning these evidence with QFF. Now consider the result of question parsing and feature extraction subtasks. In question parsing phase the dependency features of question and passage are collected. These dependency features are collected from the downloaded Stanford Dependency Parser. Various Features are extracted, and relevant are selected in feature extraction phase. Now we are going into the details of intermediate QFF and its score. This will further help in designing a learning algorithm for answer evidence. Figure 5 is showing the evidence gathered using the structural features.
Intermediate QFF and DFF score
A logical form is used to query a knowledge base. Intermediate QFF generated in this work
is not bothered about querying any KB, but it is used for representing the question to its
QFF weight and then to map it with DFF weight. Structural features are of more importance
in this work and rest features contribute equally. Let Le be the lexical feature, ȣ be
the syntactic feature, Âć be the semantic feature and áž-d be the structural
feature. The QFF and DFF scores can be represented in Equation (2).
QFF and DFF scores can be compared to question and document as there are relative to each other. One can also use multiple regression techniques to compare QFF and DFF. In this work, the complete dataset has questions paired with options and answers and documents having the answer-evidence are ranked. The equation is merely showing that all feature are contributing equally, and structural features are more important to calculate QFF and QFF score.
Q/E pairs- (Qi, Ei): i = 1... n, E ∈ Evidence
Ranked Documents- R
DEV(Q, F) that computes derivations of Q. F ∈ Feature
YIELD(d) that returns QFF yield of derivation d
EXEC(y, R) that calculates execution of y in R
Margin- δ
Number of Iterations- N
For n = 1 to N: (i = 1... n)
ε = d: d ∈ DEV(Qi, F); Execute(YIELD(d),
YIELD(d R )) ≠ Ei
ε0 = d: d ε DEV(Qi, F); Execute(YIELD(d),
YIELD(d R )) ≠ Ei
ε* = PARSE(Q,E)
ε0* = d: d ∈ ε; ∃ c ∈ C* s.t. ε(c) φ - ε(d) φ < δ)
If (|ε*| > θ) and (|ε*θ| > θ)
Then:
After extracting prominent features for evidence scoring of the provided options. This section describes an evidence learning algorithm for domain independent KBC data. It learns options and domain independent evidence. The learning Algorithm 1 in the previous page is estimates the learning parameter φ from a set (Q i ), (Oi1, Oi2, Oi3, Oi4) where, i = 1 … n of questions where, Q i is paired with the options O i . Dependency parsing derivation (d) generated by the parser is connected with a QFF(y) = YIELD(d) that can be mapped with R (relevant documents). A QFF derivation (d) of Q i is correct if it supports a particular evidence such as EXECUTE(YIELD(d), YIELD(d R )) = ∈ i . d R stands for DFF derivation and ∈ i is evidence of particular option. For computing learning parameter φ which supports a difference of Ît’ between correct and incorrect evidence.
Evidence scoring
In this section, all evidence scores are merged and provide a single evidence score. This score is the combination of L e , S y , S e and S t feature scores. In the Table 2 value of QFF and DFF calculated from Equations (2) and (3) are treated as the final evidence score and used to score the provided options.
Data set, experimental setup and result
Before setting up for experiments, the difference between the intermediate feature score used in this proposed and traditions logical form should be clear. Let the utterance: “What is the highest point in Florida?” From the Geo dataset has the Logicalform: (A,highest(A,(place(A),loc(A,B),const(B, stated(florida))). Now utterance: “In a 2014 film, Vidya Balan’s character Bilkis Ahmed is also known by what other name?” From the KBC dataset, there is Logical Form Score (LFS) not an LF. LFS, in this case, will be a number, not a representation: say, LFS = 1.183.
Dataset used
To decide the prominent question features and to produce logical forms, we used publically available TREC Question Classification (TQC) dataset and KBC question dataset. In TQC, questions are tagged with their category which is useful in deciding the answer type. The answer type can be used in gathering answer evidence. To perform our experiment more reliably we have developed a more steady dataset that is KBC dataset. KBC dataset is consisting question-option pairs and relevant documents. The questions in KBC dataset vary in various domains including movie, sports, geography and so on. There are about 1000 questions are collected for KBC dataset. All these questions having four options, one answer and, at least, three relevantdocuments.
Evaluation metrics
Precision, Recall and F-measure are used for evaluating the performance of feature
extraction techniques (F
x
) and evidence
gathering (E). Precision (P) for F
x
is the
fraction of a total number of features that are correctly classified and a total number of
features that classified. Similarly, Recall (R) for F is the fraction of a total number of
correctly classified features to the total number of features that belongs to features.
F-measure is calculated as the aggregate of both P and R is given in Equation (3).
F-measure is used to describe the performance of F x (extracted feature) for the evidencegathering.
Before setting up for experiments the difference between proposed intermediate Logical Feature Score (LFS) and Logical Form (LF) should be clear. Let the utterance: What is the highest point in Florida? From the Geo dataset it has the LF: (A,highest(A,(place(A),loc(A,B),const(B,stated(florida))). Now for KBC question: In a 2014 film, Vidya Balan’s character Bilkis Ahmed is also known by what other name? we get an LFS, not an LF. LFS for above KBC question is: say, LFS = 1.183.
Results and discussions
The feature vector (F) formed after question processing is further used for calculating an individual question and document LFS. The LFS is used in evidence learning algorithm. In our experiments, LFACS technique and LFF score (which includes new structural features) is used for scoring the relevant documents.
Question and document features
Ten-fold cross-validation does the evaluation of feature extraction. For feature extraction, linear SVM and Naïve Bayes Multinomial (suggested in the work of Basant et. al) are used with their default setting in WEKA. In the proposed work, each question is parsed to produce output in a form of dependency features. These dependency features are the backbone of structural features. Wh-words (what, which, where, who, when) are essential to handle at first stage of parsing to give a concrete idea of upcoming document evidence. A headword is also important as it gives the idea about lexical answer type. At this point, the accuracy of headword extraction algorithm iscritical.
Accuracy of evidence gathering and scoring
In, Passage Term Match (PTM) question terms are matched to passage terms. The grammatical connection or word sequence is not considered. The Skip-Bigram (SB) technique gives the score by matching pairs of words that are related or nearly related. In Textual Alignment (TA) a score is given by comparing the words and word order of the passage. In TA, the question focus is replaced by the candidate answer. In LFACS technique, the score is assigned by how likely the structure of the question can be mapped with the passage. Table 3 shows the results of passage scorer, from a system that has all four scorers. SB and LFACS (regarding high P and low R) have the excellent outcome. PTM or TA from the system does not show a significant impact on this dataset. LFS is useful for matching the appropriate question with the document having sufficient evidence to support its option. The focus is not only scoring the documents by supported evidence but to design a Evidence Learning Algorithm (ELA). ELA learns the logical form score for particular features and evidence. This learning algorithm is discussed in the previous section with experimental settings which regulates learning algorithm. More the training examples are available for any learning algorithm more accuracy it can attainâĂŤ now the question is, how many training examples are mandatory to get a high level of precision? So the answer is when all training examples are used. ELA substitutes between updating positive and negative candidate sets and update parameters for I iterations.
We use I = 5 (similar to the settings of learning-DCS) as the default value. The calculation of the available evidence is based on search size where every intermediate form has at most J structural evidence features. The default value is J = 20. We compared proposed work of evidence scoring with SER [2]. Skip bigram technique in the SER has the highest accuracy, and that is 3.63% lower than our combined feature accuracy. One main reason for this accuracy is the inclusion of structural features. Structural features are more informative than other features used for evidence gathering.
Conclusion
Proposed work is focused on gathering evidence for question-option from ranked documents. In this approach, initially question is parsed, and lexical, syntactic, semantic and very useful structural features are extracted. These features are used to form an intermediate QFF. Unlike from other logical forms, the QFF is calculated for a single real value. This unique value represents the question. Similarly, document’s intermediate form called DFF. DFF is calculated and mapped with question’s QFF score. At the end of QFF and DFF, evidence gathering is completed. After evidence gathering using QFF and DFF, the evidence is scored with provided options. The evidence learning algorithm learns for the parameter ß. Since ELA learns with evidence, with relevant features. Proposed work is compared with Support Evidence Retrieval (SER). SER uses Passage term-matching, Skip bigram, and Textual alignment. Proposed QFF and DFF model use lexical, syntactic, semantic and structural features. The combination of features gives better results as compared to SER. Although the variation in the result is not major, still it is comparable. The reason is, making structural features of a question and relevant document is easy and efficient.
