Multi-head attention based candidate segment selection in QA over hybrid data

Abstract

Question Answering based on Tabular and Textual data is a novel task proposed in recent years in the field of QA. At present, most QA systems return answers from a single data form, such as knowledge graphs, tables, texts. However, hybrid data including structured and unstructured data is quite pervasive in real life instead of a single form. Recent research on TAT-QA mainly suffers from the higher error of extracting supporting evidences from both tabular and textual content. This paper aimed to address the problem of failure evidence extraction from more complex and realistic hybrid data. We first proposed two types of metrics to evaluate the performance of evidence extraction on hybrid data, i.e. wrong evidence ratio (WER) and missing evidence ratio (MER). Then we utilize a candidate extractor to obtain supporting evidence related to the question. Third, an origin selector is designed to determine from where the question’s answer comes. Finally, the loss of origin selector is fused to the final loss function, which can improve the evidence extraction performance. Experimental results on the TAT-QA dataset showed that our proposed model outperforms the best baseline in terms of F1, WER and MER, which proves the effectiveness of our model.

Keywords

Question answering on tabular and textual data Wrong Evidence Ratio Missing Evidence Ratio multi-head attention

1. Introduction

Question answering (QA) is to answer the corresponding questions giving a certain background information, and the information may be a sub-graph of knowledge graphs [1], a segment of text [2], or a table [3]. Unstructured text data contain implicit knowledge, while they can be easily accessed through the Internet; Structured tabular data emerges with small quantity and good quality. Structured knowledge graphs are more conducive to solve the task of multi-hop QA. Triples in the knowledge graph can be extracted from both structured and unstructured data, however, the knowledge graph is built at the cost of heavy manpower. In the real world, hybrid data including textual and tabular data is more pervasive in different scenarios such as weather forecast scripts, scientific literature, and financial report in which the textual data act as a complement for the tabular data. QA based on hybrid data has attracted great attention from academia and industry in recent years [4, 5].

Figure 1.

An example of TAT-QA.

Existing QA system mainly predicts the answer of questions from a single format of data source, e.g. [6, 7, 8] only utilized unstructured text to answer questions, and [9, 10, 11, 12] utilized semi-structured tables. Since hybrid data is quite common, QA on hybrid data are proposed recently. [13, 14] obtain knowledge from incomplete knowledge graphs and related texts to answer questions, where incomplete knowledge graphs are obtained by randomly hiding knowledge graph triples. However, existing large-scale question answering datasets were originally designed to use either structured or unstructured knowledge during annotation, which greatly hindered the development of QA on hybrid data. Hybrid QA dataset is constructed to develop and evaluate QA systems, which contains about 1700 questions [15]. [2] proposed a large-scale questions answering dataset called HybridQA, in which the supporting information is acquired from hybrid data and the answer is obtained by reasoning. [4] proposed a new dataset containing both Tabular and Textual data, named TAT-QA, where numerical reasoning is usually required to infer the answer, such as addition, subtraction, multiplication, division, counting, comparison/sorting, and their compositions, and the corresponding baseline model TAGOP is proposed, which utilizes sequence tagging to extract relevant information from hybrid data, and uses a series of symbolic operations to infer on relevant information to get the answer to the question. However, the performance of TAGOP, which achieved 51% and 58% in terms of EM and F1 respectively, lags far behind the performance of both human and state-of-the-art KBQA system. In order to demonstrate the difficulty of QA based on hybrid data, an example is shown in Fig. 1, in which as for How much does Leasehold improvements account for the net book value of SJ Facility Leasehold improvements in 2019?, one needs to get Leasehold improvements in 2019, i.e. 3,897 thousand from the table and the net book value of SJ Facility leasehold improvements, i.e. 0.9 million, from the text. Error analysis experiment result from [4] showed that 55% error type comes from Wrong Evidence, which means that the model obtained the wrong supporting evidence, and 29% from Missing Evidence, which means that key evidences are incomplete. These two error types of real examples [4] are shown in Table 1.

Table 1

Examples from TAGOP

Wrong evidence	Question: How much did the level 2 OFA change by from 2018 year end to 2019 year end?
	Ground truth: 375–2,032
	Prediction: 1,941–2,032
Missing evidence	Question: How many years did adjusted EBITDA exceed $4,000 million?
	Ground truth: count (2017, 2018, 2019)
	Prediction: count (2017, 2018)

From above one can conclude that the low performance of current QA models based on Hybrid data attributed to: (1) answer candidate segments are scattered across heterogeneous data, and one should determine whether candidate come from structured data or unstructured data or both; (2) Even model knows where candidates come from, single sequence tagging model such as RNN is not enough to extract supporting evidence from different data source, and it is obvious that different submodule should be used to tackle different types of data format. Both pose challenges in candidate segment extraction. Thus, the key factor to improve the performance of QA on hybrid data is to enhance the ability of candidate segment extraction.

In order to solve challenges mentioned above, in this paper multi-head attention mechanism is proposed to obtain rich semantic information related to the question from hybrid data, and a two-layer FFN is used to determine from where the answer to the question comes, and the loss of answer origination in FFN classifier is fused to the final loss function. In addition, Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP) are used to evaluate the performance of candidate sentence selection model in textual data, however textual span, tabular cell or their compositions are extracted instead of candidate sentence in hybrid data. In order to acquire quantitative evaluation of comprehensive extraction performance, we propose two types of metrics to evaluate the performance of candidate segment extraction on hybrid data, i.e. WER and MER. The main contributions of this paper are as follows:

•

We propose to use candidate extractor, which is composed of two multi-head attention modules, to obtain supporting evidence related to the question from tabular and textual data respectively. Each module can obtain information from different representation sub-spaces. Each head focuses on different positions of the input sequence, which can learn multi-view relevant information from both tabular and textual data;

•

We use origin selector to predict where the answer comes, which narrow the candidate selection scope and thus reduce the computational cost in subsequent steps, and the origin selector can be trained by minimizing the loss of origin selector;

•

Two types of metrics, i.e. WER and MER, are proposed, both of which are used to evaluate the comprehensive performance of candidate segment extraction on hybrid data. Experiment results show that our model outperforms TAGOP in terms of F1, WER and MER on TAT-QA.

2. Related work

Some work in the QA system focused merely on text data, such as machine reading comprehension (MRC), e.g. BiDAF [16], Gated Attention Reader [17], QANet [18]; some work in QA research is based on tabular data, and related datasets includes SPIDER [10], TabFact [19], while other work in QA take the Knowledge Graph as input data for question answering [14]. Structured information based on tables or knowledge graphs has the disadvantage of low knowledge coverage, although data quality can be guaranteed. Textual data has the advantage of large amount and availability while noisy data existed in most cases and knowledge is hard to acquire. QA on hybrid data has just started in recent years. [5] proposed a large-scale QA hybrid dataset named HybridQA, in which the answer to a question needs to be obtained by simply reasoning in knowledge from heterogeneous information. [4] proposes a hybrid dataset TAT-QA related to financial field, which contains a large number of questions that require numerical reasoning, such as addition, subtraction, multiplication, division, counting, comparison, and sorting.

Supporting evidence extraction is a key sub-task in QA, which is to select sentences or critical evidences that contain or verify the right answers given a question. Most researchers proposed to use deep neural network to solve supporting evidence extraction: [20] considers semantic relationship between each question and answer using distributed representations; Since attention mechanism allows the model to selectively focus on critical parts of textual data, it is widely applied in evidence extraction. [21] proposes a model combining convolutional neural network with basic architecture and adding to attention mechanism. Unlike [21] aiming at representation of answers and questions, attention-based CNN is proposed in [22] to model a pair of sentences for answer selection. Sequential attention mechanism [23] is used to obtain the representation of candidate sentences, which is achieved by multiple steps of attention. Aiming at problems of missing clue annotation and multi-hop reasoning in multiple choice reading comprehension task, [24] proposes an evidence sentence extraction model based on multi-module combination. As far as we know, supporting evidence extraction in QA based on hybrid data largely ignored in existing research.

3. Model

3.1 Task description

The concerned task of QA over hybrid data is to answer the question given both tabular and textual data related to a given question. In this paper, we pay more attention to supporting evidence extraction over hybrid data. The goal of the task is to extract supporting evidence related to the question precisely and completely. Specifically, given a table $T=\{t_{1},t_{2},\ldots,t_{n}\}$ , a text fragment $D=\{d_{1},d_{2},\ldots,d_{m}\}$ and a question $q=\{q_{1},q_{2},\ldots,q_{Q}\}$ , where $t_{i}$ is a word in the table flattened by rows. Supporting evidence to the question may come from cells in the table, or spans in the text or both cells and spans. Next, we will give an overview of complete QA model over hybrid data with supporting evidence module.

3.2 Model overview

The mainframe of our model is shown in Fig. 2. The model consists of two parts: Encoder with candidate extractor (3.3) and Reasoning layer with origin selection mechanism (3.4). Our supporting evidence extraction module is composed of origin selector and candidate extractor, in which origin selector is to predict which type of data source candidate segment located in, and candidate extractor executes specific extraction task on specific type of data source. Candidate extractor is the key part of encoder, and origin selector is incorporated into the reasoning layer.

In encoder with candidate extractor, we first apply RoBERTa [25] to encoder the question, tabular and textual data, and the context representation of question with background information, as well as representations of each sub-token are obtained. Then, we utilize two multi-head attention modules to extract supporting evidence related to question from textual and tabular data respectively in candidate extractor, since multi-head attention [26] can obtain information from different sub-spaces. Finally, tag predictor is to predict the label of each sub-token. In reasoning layer with origin selection mechanism, we have three classifiers and one origin selector, in which classifiers include operator classifier, number order classifier, scale predictor. Number order classifier and scale predictor are designed according to [4], while in this paper an origin selector is designed as a gate function that predict whether supporting evidence comes from textual or tabular data, and operator classifier to predict the correct operator obtained the answer to question, which includes eight aggregation operations instead of ten. The scale predictor is to predict the scale corresponding to the answer, and the number order classifier is to predict the order of two input numbers matters in the final result. In one word, Encoder with candidate extractor extract supporting evidence related to the question and Reasoning layer with origin selection mechanism reason over evidence segment to return the final answer.

Figure 2.

The structure of our model.

3.3 Encoder with candidate extractor

In this section, we first construct a sequence with length of $L$ , which is composed of the question, the flattened table by row [27] and the text fragment, according to the input format of RoBERTa:

$\displaystyle V_{[\textit{CLS}]};V_{q};V_{[\textit{SEP}]};V_{t};V_{[\textit{% SEP}]};V_{d};V_{[\textit{SEP}]}=\textit{RoBERTa}([\textit{CLS}];$ $\displaystyle q_{1},q_{2},\ldots,q_{Q};[\textit{SEP}];$

(1) $\displaystyle t_{1},t_{2},\ldots,t_{n};[\textit{SEP}];$ $\displaystyle d_{1},d_{2},\ldots,d_{m};[\textit{SEP}]),$

where the number of tokens in question, sub-tokens in the tabular data and textual fragment are $Q$ , $n$ , and $M$ respectively; $V_{[\textit{CLS}]}$ and $V_{[\textit{SEP}]}$ are representation of [CLS] and [SEP]; $V_{q}$ is a matrix, in which each column is the representation of each word in question. Each column in matrix $V_{t}$ and matrix $V_{d}$ is the representation of each sub-token respectively; the dimension of vector mentioned above is $d$ .

Then we use candidate extractor to extract candidate segments related to question from hybrid data, in which two multi-head attention modules are applied to textual and tabular data respectively:

$\displaystyle T_{q}=\textit{MultiAttention}_{t}(V_{t},V_{q},V_{q})$ (2) $\displaystyle D_{q}=\textit{MultiAttention}_{d}(V_{d},V_{q},V_{q}),$ (3)

where $\textit{MultiAttention}_{t}$ and $\textit{MultiAttention}_{d}$ are used to extract cells and spans related to question from tabular data and textual data respectively. Both $T_{q}\in\mathbb{R}^{n\times d}$ and $D_{q}\in\mathbb{R}^{m\times d}$ are matrix, and $h$ is the number of heads.

Specifically, the vectorization of table $V_{t}$ as the query unit and the vectorization of the question represents $V_{q}$ as the key in $\textit{MultiAttention}_{t}$ , the attention of the $h_{i}$ -th head is calculated as:

$\displaystyle h_{i}=\textit{Attention}(V_{t}\cdot W_{i}^{t},V_{q}\cdot W_{i}^{% q},V_{q}\cdot W_{i}^{v}),$ (4)

where $W_{i}^{t},W_{i}^{q},W_{i}^{v}\in\mathbb{R}^{d\times d_{h}}$ are trainable parameters.

$\displaystyle d_{h}=d/h$ (5) $\displaystyle\textit{MultiAttention}_{t}(V_{t},V_{q},V_{q})=\textit{Concat}(% \textit{head}_{1},\ldots,\textit{head}_{h})\cdot W^{h},$ (6)

where $W^{h}\in\mathbb{R}^{hd_{h}\times d}$ is a trainable parameter. Note that in candidate extractor, parameters of multi-head attentions in tabular data and textual data are different.

Finally, the representation of each sub-token is input into the tag predictor, which predicts labels of each sub-token. Specifically, the sub-token that can provide supporting evidence for the answer is marked as $I$ , and the others is marked as $O$ . This step is analogy to slot filling or schema linking, whose effectiveness has been demonstrated in dialogue systems [28, 29] and semantic parsing [30]. The tag classifier calculates the sub-token $k$ as follows:

$\displaystyle C_{k}^{\textit{tag}}=\textit{softmax}(\textit{GELU}(W_{\textit{% tag}}\cdot V_{t}^{k}+b_{\textit{tag}})),$ (7)

where $V_{t}^{k}\in\mathbb{R}^{d\times 1}$ is the representation of sub-token $t$ , $W_{\textit{tag}}\in\mathbb{R}^{2\times d}$ is a trainable parameter, and GELU is Gaussian Error Linerar Units activation function [31].
3.4 Reasoning layer with origin selection mechanism

Once supporting evidences are extracted from data, it is natural to compose these segments to get the final answer, during which the composition needs further reasoning. We concluded from the question answer pair samples that there are generally eight aggregation operations: None, Sum, Count, Average, Multiplication, Division, Difference, Change ratio. Unlike [4], we put the Span-in-text, Cell-in-table, Spans in a single origin selector, since the composition mode of these three extractions mode is quite different from that of the eight aggregation operators, and we argue that the separation of composition mode is helpful to improve the supporting evidence extraction’s performance since the selection of extraction mode and the aggregation operation are independent of each other. Experimental results in Section 4 also verify this argument.

In order to predict whether supporting evidence comes from textual or tabular data or both, we take the representation of [CLS] as the input of origin selector. The origin selector probability is calculated as follows:

$\displaystyle C^{\textit{origin}}=\textit{softmax}(\textit{GELU}(W_{\textit{% origin}}\cdot V_{[\textit{CLS}]}+b_{\textit{origin}})),$ (8)

where $W_{\textit{origin}}\in\mathbb{R}^{3\times d}$ is a trainable parameter.

In order to get the correct answer to the question, an operator classifier is used to predict right aggregation operator that help reasoning over the candidate segments. Specifically, we take the vector representation of [CLS] obtained from Roberta as input, and the operator classifier is calculated as follows:

$\displaystyle C^{op}=\textit{softmax}(\textit{GELU}(W_{op}\cdot V_{[\textit{% CLS}]}+b_{op})),$ (9)

where $W_{op}\in\mathbb{R}^{8\times d}$ is a trainable parameter. Note that there are eight types of aggregation operator, in which None aggregation operator means that once predicted, no aggregation operation is needed, and only span is given according to the prediction results from origin selector.

Similar to [4], the scale predictor and the number order classifier is used to predict the scale of the answer and the number order in the situation of Difference, Division, and Change ratio. Each probability are formulated as:

$\displaystyle C^{\textit{scale}}=\textit{softmax}(\textit{GELU}(W_{\textit{% scale}}\cdot[V_{[\textit{CLS}]};T_{q};D_{q}]+b_{\textit{scale}}))$ (10) $\displaystyle C^{\textit{order}}=\textit{softmax}(\textit{GELU}(W_{\textit{% order}}\cdot[V_{t}^{k_{1}};V_{t}^{k_{2}}]+b_{\textit{order}})),$ (11)

where $V_{t}^{k_{1}}$ and $V_{t}^{k_{2}}$ represent the vectorized representations of operators $k_{1}$ and $k_{2}$ , respectively. $W_{\textit{scale}}\in\mathbb{R}^{5\times 3d}$ and $W_{\textit{order}}\in\mathbb{R}^{2\times 2d}$ are trainable parameters, and [;] denotes the concatenation of two vectors.

According to origin selector, we can predict data source of the answer in hybrid data; the operator classifier can predict which operation to perform on the dataset. For operators that require the order of the operands, such as Division, Difference, Change ratio, we further use the number order classifier to predict the order of the input two numbers; the scale is obtained by the scale predictor, with which the numerical or string prediction is multiplied or concatenated as the final prediction to compare with the ground-truth answer.

Supporting evidence is obtained from Encoder with candidate extractor (3.3). The operator mode, the number order, the scale are obtained from Reasoning layer with origin selection mechanism (3.4), the final answer related to the question can be easily composited according to the operator mode, the number order, the scale for supporting evidences.

3.5 Training

Considering the loss caused by origin selector, we redesign the loss function for our model:

$\displaystyle\textit{Loss}=\textit{NLL}(\log(C^{\textit{tag}},G^{\textit{tag}}% )){}+\textit{NLL}(\log(C^{\textit{origin}},G^{\textit{origin}})){}+\textit{NLL% }(\log(C^{op},G^{op})){}+\textit{NLL}(\log(C^{\textit{scale}},G^{\textit{scale% }})){}+\textit{NLL}(\log(C^{\textit{order}},G^{\textit{order}})),$ (12)

where $\textit{NLL}(\cdot)$ represents negative log-likelihood loss, and $G^{\textit{tag}}$ , $G^{\textit{origin}}$ , $G^{op}$ , and $G^{\textit{scale}}$ come from supporting evidences extracted from annotated dataset. $G^{\textit{order}}$ is needed when the ground-truth operator matches Difference, Division and Change ratio, which is obtained by mapping the two operands extracted from corresponding ground-truth deviation in the input sequence [4]. If the order of operators is the same as the order in the input sequence, $G^{\textit{order}}=1$ , and 0 otherwise. We apply Adam optimizer to minimize the reconstructed loss function.

4. Experimental results and analysis

4.1 Dataset and metrics

We use TAT-QA [4] dataset to evaluate the effectiveness of our model, which is a large-scale QA dataset containing tabular and textual data. Numerical reasoning is usually required to obtain the correct answer related to given questions, such as Counting, Summing, Multiplication, Subtraction, Division, Comparison/Sorting, and their compositions, since samples are extracted from the financial reports in which there are many numerical arithmetic operations. There are totally 16,552 question samples in TAT-QA, in which 7,431 samples of answers are from table data, 3,902 from text data, and 5,219 from both table and text, and TAT-QA is randomly divided into training set, test set and verification set according to 8:1:1.

Traditional evaluation metrics for candidate extraction are Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP), which can be appropriate for candidate sentence selection model in textual data. However textual span, tabular cell or their compositions are quite complex, we propose two novel metrics to evaluate the performance of candidate segment extraction on hybrid data, i.e. Wrong Evidence Ratio (WER) and Missing Evidence Ratio (MER). Besides we use the popular numeracy-focused F1 [6] and Exact Match (EM) to evaluate the performance of our model. WER and MER are proposed in this paper to evaluate the comprehensive performance of supporting evidence extraction from hybrid data. Specifically, for sample i, the $\textit{WER}_{i}$ and $\textit{MER}_{i}$ are calculated as follows:

$\displaystyle\textit{WER}_{i}=1-\frac{\textit{the total of error spans or % cells in prediction}}{\textit{the total of spans or cells in prediction}}$ (13) $\displaystyle\textit{MER}_{i}=1-\frac{\textit{the total of correct spans or % cells}}{\textit{the total of spans or cells in groundtruth}}.$ (14)

Therefore, $\textit{WER}_{\mathbb{S}}$ and $\textit{MER}_{\mathbb{S}}$ are used to evaluate the average performance of Wrong Evidence Ratio and Missing Evidence Ratio over a target dataset $\mathbb{S}$ :

$\displaystyle\textit{WER}_{\mathbb{S}}=\frac{\sum_{k=1}^{K}\textit{WER}_{k}}{K}$ (15) $\displaystyle\textit{MER}_{\mathbb{S}}=\frac{\sum_{k=1}^{K}\textit{MER}_{k}}{K},$ (16)

where $K$ is the number of samples in dataset $\mathbb{S}$ .

Unlike precision and recall in F1, WER and MER is a soft evaluation metric, and can measure one single sample or a sample set, while precision and recall is unable to evaluate over one sample. The value of WER and MER can take in [0, 1].

From metrics described above, we know that WER can effectively evaluate the ability that the model recognizes wrong evidences, and it is obvious that the lower the WER is, the better the model can recognize the right evidences. MER can effectively evaluate the ability that the model completely recognizes all right evidences, and it is clear that the lower the WER is, the more supporting evidences the model can extract.

4.2 Baseline model

(1)
BERT-RC[32] is a reading comprehension(RC) QA model on SQuAD.
(2)
NumNet $+$ V2[33] is a numerical reasoning model over RC, which comparing information is considered by a numerically-aware graph neural network and executes numerical reasoning over numbers in the question and textual data.
(3)
TaPas for WTQ[27] is an end-to-end QA model without generating logical forms, which employs expanded BERT’s architecture to perform joint pretraining on text segments and tables crawled from Wikipedia.
(4)
HyBrider[5] is a QA model that processes tabular and textual data from Wikipedia, which connects the problem with relevant information by linking modules, and finally performs multi-hop reasoning.
(5)
TAGOP[4] is the baseline model on TAT-QA dataset, which includes two modules: evidence extraction and answer reasoning. It employs sequence tagging to extract information from hybrid data, and perform numerical reasoning with a set of aggregation operations to obtain the final answer.

Note that BERT-RC and NumNet $+$ V2 perform QA over textual data merely, and TaPas for WTQ perform QA over tabular data merely, while HyBrider and TAGOP over hybrid data.
4.3 Experiment setup

We apply RoBERTa-Large to initialize the representations of questions, the row-flattened table data and text data, where the max length of a question is set to 46, and the max length of a row-flattened table and text is set to 463. The output dimension of RoBERTa-Large is 1024. Our model gets the best results when we set the number of heads to 3 and 2 respectively for both multi-head attention module over table and text. We set the max number of epochs to 50, and apply Adam optimizer to minimize the overall loss function with learning rate of 0.0005.

4.4 Experimental results

We compared our model with other baselines in terms of F1, EM, WER and MER on TAT-QA dataset. Experimental results are shown in Table 2.

Table 2
Experimental results on TAT-QA

Method	EM	F1	MER	WER
BERT-RC	9.5	17.9	52.6	55.6
NumNet $+$ V2	38.1	48.3	51.2	53.5
TaPas for WTQ	18.9	26.5	49.1	52.7
HyBrider	6.6	8.3	46.7	45.2
TAGOP	55.2	62.7	36.4	34.9
Ours	56.5	64.1	33.6	32.7

Experimental results show that our model achieves the best performance on TAT-QA dataset, marked in bold, which proves that the candidate extractor can effectively extract supporting evidence related to questions from tabular and textual data. Specifically, two multi-head attention modules in candidate extractor can provide multi-level semantic information related to the question from hybrid data. Origin selector in reasoning layer can correctly locate candidate segments in hybrid data. WER and MER are used to evaluate the performance of extracting evidence from hybrid data. The ability of BERT-RC and NumNet $+$ V2 to extract candidate segments is nearly consistent in terms of both MER and WER, while final result of NumNet $+$ V2 is better than BERT-RC, which proved that NumNet $+$ V2 outperforms BERT-RC in terms of numerical reasoning. TaPas for WTQ lacks the ability to process hybrid data according to Table 2. The poor performance of HyBrider attributes to ignoring the interdependence of tabular and textual data in terms of both EM and F1, while it can effectively extract supporting evidence in terms of MER and WER. Although TAGOP obtains the best performance among all baseline methods, TAGOP is far behind human experts. Moreover, TAGOP is designed specifically over TAT-QA dataset. However, 55% of errors come from wrong evidence, and 29% from missing evidence according to error analysis [4], which means that TAPOP lacks the capability of extracting supporting evidence.

Besides, we further investigate the performance of our model and TAGOP in terms of different selection of operator, as shown in Table 3.

Table 3

The performance of our model and TAGOP in terms of different selection of operator

Operator	Validation set		Test set
	Accuracy		Accuracy
	Our model	TAGOP	Our model	TAGOP
Span-in-text	93.1	92.3	92.2	91.6
Cell-in-table	91.8	91.2	87.1	86.7
Spans	97.1	96.8	94.2	93.8
Sum	86.5	86.0	76.9	76.2
Count	93.5	93.8	99.8	100.0
Average	99.7	100.0	99.8	100.0
Multiplication	33.7	33.3	0.1	0.0
Division	76.9	76.5	88.1	87.5
Difference	96.9	96.6	96.8	96.6
Change ratio	96.8	96.1	95.8	95.3
Other	0.0	0.0	0.0	0.0

Specifically, Table 3 shows the accuracy of the operator classifier and origin selector in the proposed model and TAGOP on the validation set and test set. It can be seen that our model is almost superior to TAGOP in span-in-text, cell-in-table and spans, which benefits from the ability of origin selector and reveals the difference of extraction mode and aggregation operation. Our model is consistent with of TAGOP in terms of the performance of eight aggregation operators, which indicates that our model has excellent classification ability. The performance of our model and TAGOP in multiplication operator is worse than that of other operators and this is because the number of samples that obtain the answer of question through the multiplication operation only accounts for 0.3% of the total number of samples in TAT-QA dataset [4].

4.5 Ablation experiment

4.5.1 Effect of multi-attention module

Candidate extractor can obtain supporting evidence related to the question from hybrid data, in which two multi-head attention modules is used to obtain the multi-level semantic information from the table and the text respectively. In order to evaluate the effectiveness of each part of the model, we conducted a series of ablation experiments on the TAT-QA dataset. Ablation experiment results are shown in Table 4. Specifically, we define three variant models:

•
Multi-head attention module over table removes multi-head attention module to extract supporting evidence related to question from tabular data, which means the supporting evidence from table is directly predicted by the output of Roberta.
•
Multi-head attention module over text removes multi-head attention module over textual data.
•
Multi-head attention modules removes both multi-head attention modules above in candidate extractor.

Table 4
Experimental results of model analysis

Model F1 WER MER

Full Model 64.1 32.7 33.6

Multi-head attention module over table 63.1 35.2 36.0

Multi-head attention module over text 63.6 34.5 35.1

Multi-head attention modules 62.5 36.1 36.3

The second row in Table 4 demonstrates that multi-head attention module over table brings 1, 2.5 and 2.4 improvement of F1, WER and MER; the third row in Table 4 shows that multi-head attention module over text brings 0.5, 1.8 and 1.5 improvement of F1, MER and WER, which implies that these two components are effective for supporting evidence extraction, especially multi-head attention module over table. It can also be seen from the fourth row that full model outperforms model without both two components in terms of F1, WER and MER, brings 1.6, 3.4 and 2.7 improvement respectively.
4.5.2 Effect of the head number of multi-head attention

Model	F1	WER	MER
Full Model	64.1	32.7	33.6
Multi-head attention module over table	63.1	35.2	36.0
Multi-head attention module over text	63.6	34.5	35.1
Multi-head attention modules	62.5	36.1	36.3

Table 5 shows the impact of head number of multi-head attention when applying two multi-attention modules to extract supporting evidence in candidate extractor (3.3). We set the number of heads to around 3, since large number of head is empirically proved to decrease the final F1. We conduct a series of ablation experiments on head number selection with configuration of $<$ 2, 3 $>$ , $<$ 2, 2 $>$ , $<$ 3, 2 $>$ and $<$ 3, 3 $>$ , where $<$ 2, 3 $>$ denotes that the head number of multi-head attention over table and text is 2 and 3 respectively.

Table 5
Effect of the head number of multi-head attention

Head number selection with configuration	MER	WER	F1
$<$ 2, 3 $>$	35.4	33.7	63.4
$<$ 2, 2 $>$	36.3	34.5	62.5
$<$ 3, 2 $>$	33.6	32.7	64.1
$<$ 3, 3 $>$	37.1	35.2	62.8

4.6 Case study

To demonstrate the capability of extracting complete and correct evidence from hybrid data, we sample a typical case from experiment results, as shown in Fig. 3. We found that TAGOP can extract supporting evidence related to question from hybrid data, i.e. Belgium, France, Germany, Russia, but the information is incomplete, leading to a wrong answer 4. Our model can obtain the correct information related to question. Moreover, the information is complete, i.e. Belgium, France, Germany, Russia, South Korea, which proves our candidate extractor can capture complete and correct evidence related to the question, thus the correct answer 5 is naturally acquired.

Figure 3.

Case analysis on TAT-QA.

5. Conclusion

We proposed a new QA model over hybrid data, in which candidate extractor is proposed to obtain rich semantic information related to a given question, origin selector is proposed to determine from where the question’s answer comes and candidate extractor is realized by two multi-head attention mechanisms for tabular and textual data respectively. In order to evaluate the failure evidence extracting performance from more complex and realistic hybrid data, we proposed two types of metrics, i.e. WER and MER. Experiment show that our model outperforms TAGOP by 2.2 and 2.8 in terms of WER and MER respectively in TAT-QA dataset, which proved that our model can capture more supporting evidence instead of wrong evidence from hybrid data, and extract more comprehensive information.

In this paper, we mainly focus on supporting evidence extraction from hybrid data. In the future, we will do further research over numerical reasoning, both of which are two indispensable components in QA in the field of financial reports. Besides, existing work focus on single-layer reasoning, while reasoning in multi-loop is more common in most of the case. We will consider multi-loop numerical reasoning in QA over hybrid data as our future work.

Footnotes

Acknowledgments

This work was supported by National Key Research and Development Program (Grand NO. 2022QY0300-01), Natural Science Foundation of Shanxi Province (Grand NO. 202203021221021, 20210302123468, 202203021221001).

References

Zhang

and Feng

, A Survey of Question Answering over Knowledge Base, in: China Conference on Knowledge Graph and Semantic Computing, 2019.

Huang

Wang

Qiu

Zhao

Peng

and Wang

, Recent trends in deep learning based open-domain textual question answering systems, IEEE Access 8 (2020), 94341–94356.

Jin

Siebert

and Chen

, A Survey on Table Question Answering: Recent Advances, ArXiv, abs/2207.05270, 2022.

Zhu

Lei

Huang

Wang

Zhang

Feng

and Chua

, TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance, ACL/IJCNLP, 2021.

Chen

Zha

Chen

Xiong

Wang

and Wang

W.Y.

, HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data, ArXiv, abs/2004.07347, 2020.

Dua

Wang

Dasigi

Stanovsky

Singh

and Gardner

, DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs, NAACL, 2019.

Liu

Kan

Zheng

Wang

Lei

Liu

and Qin

, Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure, ArXiv, abs/2004.05080, 2020.

Nie

Feng

Song

Wang

and Wang

, Large-scale question tagging via joint question-topic embedding learning, ACM Transactions on Information Systems (TOIS) 38 (2020), 1–23.

Zhong

Xiong

and Socher

, Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning, ArXiv, abs/1709.00103, 2017.

10.

Zhang

Yang

Yasunaga

Wang

I.Z.

Yao

Roman

Zhang

and Radev

D.R.

, Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task, EMNLP, 2018.

11.

Zhang

and Balog

, Auto-completion for Data Cells in Relational Tables, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019.

12.

Zhang

Dai

Balog

and Callan

, Summarizing and Exploring Tabular Data in Conversational Search, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020.

13.

Sun

Bedrax-Weiss

and Cohen

W.W.

, PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text, ArXiv, abs/1904.09537, 2019.

14.

Chen

Chang

Schlinger

Wang

W.Y.

and Cohen

W.W.

, Open Question Answering over Tables and Text, ArXiv, abs/2010.10439, 2021.

15.

Grau

and Ligozat

, A Corpus for Hybrid Question Answering Systems, in: Companion Proceedings of the The Web Conference 2018, 2018.

16.

Seo

Kembhavi

Farhadi

and Hajishirzi

, Bidirectional Attention Flow for Machine Comprehension, ArXiv, abs/1611.01603, 2017.

17.

Dhingra

Liu

Yang

Cohen

W.W.

and Salakhutdinov

, Gated-Attention Readers for Text Comprehension, ArXiv, abs/1606.01549, 2017.

18.

A.W.

Dohan

Luong

Zhao

Chen

Norouzi

and Le

Q.V.

, QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension, ArXiv, abs/1804.09541, 2018.

19.

Chen

Wang

Chen

Zhang

Wang

Zhou

and Wang

W.Y.

, TabFact: A Large-scale Dataset for Table-based Fact Verification, ArXiv, abs/1909.02164, 2020.

20.

Hermann

K.M.

Blunsom

and Pulman

S.G.

, Deep Learning for Answer Sentence Selection, ArXiv, abs/1412.1632, 2014.

21.

Tan

Xiang

and Zhou

, LSTM-based Deep Learning Models for non-factoid answer selection, ArXiv, abs/1511.04108, 2015.

22.

Yin

Schütze

Xiang

and Zhou

, ABCNN: Attention-based convolutional neural network for modeling sentence pairs, Transactions of the Association for Computational Linguistics 4 (2016), 259–272.

23.

Tran

N.K.

and Niederée

, Multihop Attention Networks for Question Answer Matching, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018.

24.

Wang

Guo

and Guan

, Evidence sentence extraction for reading comprehension based on multi-module, Journal of Chinese Information Processing 6 (2022), 109–116.

25.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

and Stoyanov

, RoBERTa: A Robustly Optimized BERT Pretraining Approach, ArXiv, abs/1907.11692, 2019.

26.

Vaswani

Shazeer

N.M.

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

and Polosukhin

, Attention is All you Need, ArXiv, abs/1706.03762, 2017.

27.

Herzig

Nowak

P.K.

Müller

Piccinno

and Eisenschlos

J.M.

, TaPas: Weakly Supervised Table Parsing via Pre-training, ArXiv, abs/2004.02349, 2020.

28.

Lei

Jin

Kan

Ren

and Yin

, Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures, ACL, 2018.

29.

Jin

Lei

Ren

Chen

Liang

Zhao

Y.E.

and Yin

, Explicit State Tracking with Semi-Supervisionfor Neural Dialogue Generation, in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018.

30.

Lei

Wang

Gan

Kan

and Chua

, Re-examining the Role of Schema Linking in Text-to-SQL, EMNLP, 2020.

31.

Dan

and Gimpel

, Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units, ArXiv, abs/1606.08415, 2016.

32.

Devlin

Chang

Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL, 2019.

33.

Ran

Lin

Zhou

and Liu

, NumNet: Machine Reading Comprehension with Numerical Reasoning, EMNLP, 2019.

Multi-head attention based candidate segment selection in QA over hybrid data

Abstract

Keywords

1. Introduction

3. Model

3.1 Task description

3.2 Model overview

4.1 Dataset and metrics

4.4 Experimental results

Table 2 Experimental results on TAT-QA

4.5.1 Effect of multi-attention module

Table 5 Effect of the head number of multi-head attention

Footnotes

Acknowledgments

References

Table 2
Experimental results on TAT-QA

Table 5
Effect of the head number of multi-head attention