A Chinese automatic question answering technique based on semantic similarity

Abstract

In this paper, for the Chinese automatic question answering technology in open domain, in addition to considering the traditional association between questions and questions, the correlation between questions and answers is added. The cosine similarity between questions and answers is used as the semantic similarity between them. A bi-directional long short-term memory network (BiLSTM) is added between the question and question, answer and the answer to seek the association between the contexts. and an attention mechanism is added to make question and answer related. Finally, the experimental verification shows that the accuracy of automatic question answering by the proposed method reaches 70%.

Keywords

1. Introduction

With the explosive growth of web data, there is an increasing demand for automatic question and answering technology. users want to be able to ask questions by entering natural language sentences, and the system can return highly accurate answers automatically. Now, the research direction of Chinese automatic question answering technology is mostly concentrated in professional fields, such as medical and agricultural. There is relatively little research on automatic question answering techniques for open domains, and different datasets require different processing methods, the same model has different effects on datasets of different domains. What we need most in real life is automatic question answering technologies for encyclopedic open filed. This paper focus on the open domain dataset, uses deep learning model, and generalizes it to achieve an automatic question answering technology that is closer to the real application scene.

2. Related works

For the research of automatic question answering technology, foreign countries started earlier and achieved certain development. But because the effect has not been practical, there was a time when it stagnated and no longer developed. In recent years, with the application of Google word2vec word vector tool and the development of deep learning, automatic question answering field has begun a new round of attempts [1, 2, 3, 4, 5]. For example, RNN networks were used to strengthen the relationship between the preceding and following texts in question-answer matching [6], and later LSTM networks were introduced to strengthen relevant memory because of the problem of preventing gradient explosion or gradient disappearance [7, 8], on the basis of this, BiLSTM [9, 10] was proposed, and attention mechanisms were added some time later [11, 12, 13, 14]. Wang [15] proposed an inner attention mechanism for improving the attention offset problem. In this approach, an attention mapping is performed before the sentence vector is fed into the LSTM network to obtain the sentence representation, which effectively solves the attention bias problem. In 2016, dos Santos proposed a bidirectional attention mechanism in his paper “Attentive Pooling Networks” [16], which can significantly improve the performance of this discriminant models in pair-wise ranking or classification, by implementing joint learning of the representation of the two inputs and their similarity measurement. After 2015, CNN networks, which are widely used in computer vision, were also added to the design of automatic question answering system networks, using value-shared weights instead of position-shared weights in convolutional networks to combine different matching signals. Yu et al. [17] used CNN to extract semantic features of text, without any manual extraction and can be used in any domain. Lin [18] proposed BiLSTM network to extract deeper features by co-inputting data forward and reverse and overlaying the output. Neubg et al. [19] trained with convolutional neural network in a text dataset and then used transfer learning approach to apply the trained model to the Q&A dataset. The research for semantic correlation is gradually deepening, and the division of the deeper layers network is becoming more and more detailed. And some other researchers focused on how to add additional knowledge to the network to overcome the shortcomings of lack of background knowledge [20, 21, 22].

Research on Chinese question answering system started relatively late and progressed slowly. Because the lexical and grammatical structure of Chinese is more complex than English, the update speed of word segmentation dictionary is much lower than the emergence of new words on the Internet, which leads to the difficulty of Chinese word segmentation technology and the lack of basic resources for research progress. However, after the continuous exploration of Chinese people in this field, much progress has been made. In 2016, Zheng et al. researched and designed an automated question answering system for wheat pests and diseases in order to solve this kind of problems encountered by farmers in a timely and convenient manner during wheat cultivation [23]. Huawei Cloud, Taobao and other products make their usual communication with customers into a corpus of questions, through learning and training, develop their own smart customer service. Wang et al. used a classifier to train Chinese triples, and solved the problem of extracting answers from Chinese knowledge bases by adding lexical interpretation resources to text preprocessing [24]. Zhang et al. constructed a convolutional neural network model based on attention mechanism to learn the relationship between questions and entities, and modeled character-level interactions between question and answer sequences to improve the accuracy and effectiveness of the model [25]. But in general, the automatic Chinese question answering technology for open domains is not mature enough, the model accuracy and generalization ability are yet to be improved.

Therefore, this paper mainly studies the automatic question and answer matching in the corpus sets of open domain. By finding the semantic similarity correlation between questions and answers, the matching degree between questions and answers is enhanced. an automatic question and answer model is built for learning, and through continuous iterative optimization of parameters, the accuracy is improved. The remainder of this paper is organized as follows: the third part explains the semantic similarity-based automatic question answering approach in detail. What kind of algorithm is used to solve the problems arising in each part of the question answering system, and the content, mathematical meaning, and usage of the algorithms are precisely explained. The fourth section is algorithm model implementation and validation. In addition, an evaluation model is designed for the corpus set to obtain its correctness and stability. The last part concludes the paper.

3. Automatic question answering based on semantic similarity

The overall architecture of the automatic question answering model is shown in Fig. 1. Firstly, the text data is pre-processed and converted into word vectors, and input into the model. In the model, a BiLSTM network is added between questions and questions, and between answers and answers, to seek the association between contexts. and the sentences are semantically represented to obtain the distributed representation of the question and the answer. After that, the distance between questions and answers is dynamically adjusted by adding an attention mechanism, the vector of sentences is converted into a one-dimensional vector. Finally, the correlation metric between questions and answers is obtained by calculating the cosine similarity of the one-dimensional vectors, and the final answer is sought accordingly.

Figure 1.

Automatic question answering model architecture diagram.

3.1 Word vector calculation

Before the subsequent calculation, words need to be converted into word vector form [26]. In this paper, word2vec model is used, and the text has been processed before the model trainning [27], and meets the format requirements of the model.

Word2vec is actually distributed representation coding [28], which looks for correlations between words by characterizing contextual relationships, and by mapping words to space, word similarity becomes comparing the distance of words in space. So how do we associate contextual relationships? There are two approaches here, one is to predict the words in empty space by using the context as input, this approach is called CBOW in word2vec; the other is to give a word to predict the words around it, which is called skip-grams; in this paper, we uses skip-grams. The principle is shown in Fig. 2.

Figure 2.

Skip-grams.

At the beginning, each word is encoded by one-hot, and our goal is to take as few dimensions as possible to represent the words. Taking skip-grams as an example, our input is one-hot encoding of a word, and the output is its context word which is also one-hot encoding. Then the task of the neural network is to find a mapping relationship. In word2vec, the hidden layer is linearly connected, which greatly optimizes the previous language model. We just need to find a set of hidden layer weights to represent this mapping relationship. Since one-hot encoding in words is unique, then this hidden layer weight matrix is also unique, so after the model trainning, the hidden layer weight matrix is the word vector of each word.

3.2 BiLSTM network

With the word vectors, the next step is to build a deep model for training. The previous problem briefly introduces the LSTM [29] network, and many models have in turn derived many variants based on the LSTM architecture according to the needs of their corpus.In the field of automatic question answering, we find that not only the question has a positive influence on the answer, but the answer can also be a kind of feedback for the question. In each sentence, the correlation between the front and back is not just propagated forward, the following words are a backward propagation for the preceding word phrases or even sentences. Especially in Chinese contexts, people often add some modifications to the previous text because of some key words behind, or some will say the result first and some will first explain the results and then explain the reasons behind. These expressions make LSTM network not very accurate, it only does forward propagation in the time axis. so BiLSTM network is introduced here, and the forward LSTM and backward LSTM are combined into BiLSTM.

Figure 3.

BiLSTM network.

It can be seen from Fig. 3 that BiLSTM adds reverse operation to LSTM. It can be understood as reversing the input sequence and computing the output again in the same way as the LSTM. The final result is a stack of forward LSTM results and reverse RNN results. In this way, the model can be implemented to consider contextual information.In this paper, the questions and answers are vectorized respectively, the embedding of questions and answers are obtained. Then the questions embedding and answers embedding are sent into the same BiLSTM network to obtain the hidden layer outputs $h_{q}$ and $h_{a}$ . Then the significant features of $h_{q}$ and $h_{a}$ are extracted through the maximum pooling.The reason for feeding into the same network is that the shared parameters make the accuracy higher.

3.3 Attention-based QA-LSTM

Even LSTM network does not affect the sentence information with long dependence. Therefore, this paper adds the attention mechanism to focus limited attention on key information, so as to save resources and quickly obtain the most effective information. Through it, we can more efficiently grasp the key of the task, reduce network parameters and improve the speed of network iterative update.

Figure 4 shows the decomposition of the Attention principle.

Figure 4.

Attention schematic.

The attention principle in this article is divided into 3 main steps. In the first step, the attention score $s_{i}$ is obtained by performing a series of operations on query and key, such as summing, multiplying or calculating cosine similarity. Here we use a dot product for the calculation.

$\displaystyle s_{i}={Q}^{T}{k}_{i}$ (1)

In the second step, the obtained scores are normalized, which is done here using the softmax function, so that the weight ratio for each score is obtained.

$\displaystyle\alpha_{i}=\text{softmax}\left({{s}_{i}}\right)=\frac{\text{exp}% \left({{s}_{i}}\right)}{\sum_{{j}=1}^{N}\text{exp}\left({{s}_{j}}\right)}$ (2)

In the third step, the Value is finally weighted and summed by the weighting ratio.

$\displaystyle\text{Attention}\left({\left({{K},{V}}\right),{Q}}\right)=\sum_{{% i}=1}^{N}{\alpha}_{i}{v}_{i}$ (3)

In this paper, attention mechanism is added, hoping to better extract the relationship between the question and the answer in the long dialogue, so that the question and the answer are no longer isolated. The long-distance propagation of questions and answers requires certain dependencies, and the fixed width of the hidden vector of the bidirectional LSTM model becomes a bottleneck. Attention mechanism alleviates this problem by dynamically adjusting the distance between the parts with large amount of information between questions and answers.

In the maximum pooling layer or average pooling layer, each BiLSTM output vector will multiply with a softmax weight layer, in particular, at each time t, the vector is $h_{a}\left(t\right)$ , the output vector and question embedding of BiLSTM is $o_{q}$ . The updated formula of each answer $\widetilde{{h}_{a}}\left(t\right)$ is as follows :

$\displaystyle{m}_{a,q}\left({t}\right)=\tanh\left({{W}_{{am}}{h}_{a}\left({t}% \right)+{W}_{{qm}}{o}_{q}}\right)$ (4)

$\displaystyle{s}_{{a},q}\left({t}\right)\propto\text{exp}\left({w_{ms}^{T}{m}_% {a,q}\left({t}\right)}\right)$ (5)

$\displaystyle\tilde{h}_{a}(t)=h_{a}(t)s_{a,q}(t)$ (6)

$W_{am},W_{am},W_{am}$ are parameters of the attention mechanism, which gives more weight on each word rather than just depending on the question information as before.

3.4 Cosine similarity

In this paper, similarity calculation is mainly used for loss function definition. In neural network training, by comparing the gap between predicted and real values, the back propagation iterative update can be carried out to select more suitable weights. Nowadays, there have been many loss function definition methods, but these methods are mostly based on classification processing. In this paper, the similarity calculation is used to define the loss function that matches the question and answer corpus set. The specific processing method is as follows.

By processing the sequence in two directions, the question and answer are output as two independent vectors using the information from the preceding and following text, then the cosine similarity is used to measure the distance between the questions and answers [30]. The following function is to calculate the loss between them:

$\displaystyle L=\text{max}\left\{{0,M-\textit{cosine}\left({q,a_{+}}\right)+% \textit{ cosine}\left({q,a_{-}}\right)}\right\}$ (7)

$a_{+}$ is the correct answer given in the training set for the corresponding question, while $a_{-}$ is incorrect answer randomly selected from the training set. M is a constant, which represents the number of correct answers for each proble. so that the problem can be more generalized, since a question may generally have several correct answers. For each question on the training set, we may find similar correct answers after each training update iteration. In this way we achieve semantic similarity based answers to the questions.

Based on this formula, this paper improves $a_{-}$ to randomly select the error answers under the same category from the training set. Since each question is classified in the ã€Šbaike2018qaã€‹corpus set which is used in this paper. This processing is mainly aimed at the same category answer error will not make the wrong answer completely contrary. Providing the wrong answers within the same field can better enable the machine to understand the distance between the correct answers and the wrong answers.

In addition, there are two ways to updating the output of BiLSTM layer based on word vector, average pooling and maximum pooling. The specific effects will be compared according to the experimental results in this paper.

4. Model implementation and experimental validation

4.1 Selection of data set

Although there are some excellent domain-specific corpora, they cannot be used as the experimental corpus because our model is targeted at open domains. Because experiment in this paper is aimed at the open domain, it is hoped that most of the questions are oriented to the basic knowledge of various fields. The way of asking questions should not be too lifelike, the network terms should appear as little as possible, and some network-specific symbols and network buzzwords should also be less involved in the corpus. According to the requirements and characteristics of this topic, the “baike2018qa” corpus was finally selected, which contains 1.5 million pre-filtered high-quality questions and answers, and each question belongs to one category. In total, there are 492 categories, including 434 categories with frequencies of 10 or more times, the total number of training sets is 1,425,000, the test set is 45,000.

4.2 Evaluation metrics

In this paper, the experiments are divided into two evaluation metrics. In the model training, we calculate the cosine similarity of the standard question-answer matching as the benchmark, and define the loss function accordingly. In the validation set, we review the sentence similarity with a given threshold for all questions. If the threshold is exceeded, the matching is considered successful, and vice versa. When the number of keywords in the question and answer sentences is inconsistent, we construct a maximum matching of keywords in a sentence with the question and answer sentences respectively, and finally normalize them.

(1) loss:

$\displaystyle M-\textit{cosine}\left(q,a_{+}\right)+\textit{cosine}\left({q,a_% {-}}\right)$ (8)

(2) Sentence similarity (n_similarity):

$\displaystyle{n}_{\textit{similarity}}=\frac{\sum_{i}^{n}\max\left\{{\sum_{j}^% {m}\textit{similarity}\left({q_{i},a_{j}}\right)}\right\}+\sum_{j}^{m}\max% \left\{{\sum_{i}^{n}\textit{similarity}\left({q_{i},a_{j}}\right)}\right\}}{n+m}$ (9)

4.3 Experimental results and comparative analysis

In this paper, two models, QA-BiLSTM and Attension-BiLSTM, are used for comparison in the experiments, where both average pooling and maximum pooling are used for pooling manipulation.

The test set of this paper is divided into two parts, one part is more than 40000 questions and answers provided by “baike2018qa”. The overall form of question and answer is consistent with the training set data. The fields include life encyclopedia class as well as some scientific and cultural knowledge. The other part is the artificially added 1000 data, inputting encyclopedia-type questions, and artificially judging whether the answers are relevant.

The criterion used for evaluation is Mean Average Precision (MAP), which is the ratio of the total number of correct responses to the total number of questions.

$\displaystyle\textit{MAP}=\frac{\sum_{n}^{i}\textit{sgn}\left({{q}_{i},{a}_{i}% }\right)}{n}$ (10)

The final accuracy of our model is shown in Table 1.

Table 1

Experimental results

Model	Validation	Test
QA-BiLSTM(max pooling)	68.7%	65.2%
QA-BiLSTM(average pooling)	65.2%	62.3%
Attension-BiLSTM (max pooling)	70.1%	66.6%
Attension-BiLSTM (average pooling)	67.3%	64.9%

The difference between the maximum pooling and the average pooling can be seen by plotting the bar chart,shown in Figs 5 and 6.

Figure 5.

Difference between maximum pooling and average pooling.

Figure 6.

Model comparison.

It can be seen that in the pooling layer selection, the maximum pooling layer can handle the question and answer information better, which is exactly in line with our perception, that the answer with the largest possible is selected each time. Instead of doing average processing of several similar answers, which tends to make the answer information blurred. We can see that after adding the attention mechanism, the accuracy of the whole model has improved. In some long questions and answers, the attention mechanism can better remember the relevant information, remember the key content deeply, and eliminate the secondary information, which makes the question answering effect more better.

5. Conclusions

The purpose of an automatic question answering system is to enable a machine to "understand" text content as a human, and then answer related questions with textual information. The model in this paper is designed to make a best match for a given corpus. For a user’s question, the question with the best semantic similarity is found in the database and the corresponding answer is given.In this paper, we consider the interconnections between words, strengthen the contextual associations through th BiLSTM model, and add attention mechanism to question and answer matching, so that question and answer matching is no longer isolated, and finally reward and punish by calculating the cosine similarity.

Although similarity calculation is incorporated in the method of this paper, not much semantics is taken into account. If the answer is a simple repetition of the question, in terms of similarity, the question and answer cosine distances are close and the matching degree is definitely the highest. This method has high requirements for user’s representation and corpus. How to add deep semantic understanding, so that the machine can really excavate the semantics between texts, and build the association between text semantics through deep models, is still the direction of future development.

Footnotes

Acknowledgments

This work was supported by The National Key Research and Development Program of China No. 2019YFC0507800, Beijing Natural Science Foundation No. 19000550381, and Scientific Research Project of Beijing Educational Committee No. KM201710011006.

References

Xie

and Li

L.J.

, Research on sentence semantic similarity calculation based on Word2vec, Computer Science 44(9) (2017), 256–260.

Minaee

and Liu

, Automatic question-answering using a deep similarity neural network, IEEE Global Conference on Signal and Information Processing (2017), 923–927.

Bao

and Liu

, HHH: An online medical chatbot system based on knowledge graph and hierarchical bi-directional attention, Proceedings of the Australasian Computer Science Week (2020), 1–10.

Lecun

Bengio

and Hinton

, Deep learning, Nature 521(7553) (2015), 436–444.

Zhang

et al., Question answering over knowledge base with neural attention combining global knowledge information, https://arxiv.org/abs/1606.00979, 2016.

Zhang

W.E.

et al., Generating textual adversarial examples for deep learning models: a survey, https://arxiv.org/abs/1901.06796, 2019.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(7) (1997), 1735–1780.

Bordes

et al., Large-scale simple question answering with memory networks, https://arxiv.org/abs/1506.02075, 2015.

Devlin

et al., BERT: pre-training of deep bidirectional transformers for language understanding, Computing Research Repository (2018), 1–14.

10.

Shi

M.F.

et al., Question categorization of community question answering by combining Bi-LSTM and CNN with attention mechanism, Computer Systems Applications 27(9) (2018), 157–162.

11.

Kadlec

et al., Text understanding with the attention sum reader network, Proceedings of the 54th Annual Meeting of Association for Computational Linguistics (2016), 908–918.

12.

Dhingra

et al., Gated-attention readers for text comprehension, Proceedings of the 55th Annual Meeting of Association for Computational Linguistics 1 (2017), 1832–1846.

13.

Hao

et al., An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017),, 221–231ï¼Ž

14.

Sun

et al., Open domain question answering using early fusion of knowledge bases and text, Proceedings of the Conference on Empirical Methods in Natural Language Processing (2018),, 4231–4242.

15.

Wang

Liu

and Zhao

, Inner attention based recurrent neural networks for answer selection, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (2016), 1288–1297.

16.

Santos

C.D.

et al., Attentive pooling networks, https://arxiv.org/abs/1602.03609v1, 2016.

17.

et al., Deep learning for answer sentence selection, Computer Science, 2014.

18.

Lin

and Wang

, ICRC-HIT: A deep learning based comment sequence labeling system for answer selection challenge, International Workshop on Semantic Evaluation (2015), 210–214.

19.

Neubig

, Neural machine translation and sequence-to-sequence models: a tutorial, https://arxiv.org/abs/1703.01619, 2017.

20.

Chung

Lee

and Glass

J.R.

, Supervised and unsupervised transfer learning for question answering, Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2018), 1585–1594.

21.

Deng

et al., Knowledge as a bridge: improving cross-domain answer selection with external knowledge, Proceedings of the International Conference on Computational Linguistics (2018), 3295–3305.

22.

Min

Seo

and Hajishirzi

, Question answering through transfer learning from large fine-grained supervision data, Proceeding of the Annual Meeting of the Association for Computational Linguistics (2017), 510–517.

23.

Zheng

et al., Construction and implementation of an ontology-based question and answer system for wheat pests and diseases, Henan Agricultural Science 45(6) (2016), 143–146.

24.

Wang

Zhang

and Liu

, A deep learning approach for question answering over knowledge base, Natural Language Understanding and Intelligent Applications (2016), 885–892.

25.

Zhang

et al., An attention-based word-level interaction model: relation detection for knowledge base question answering, IEEE Access (2018), 75429–75441.

26.

Pennington

Socher

and Manning

, Glove: Global vectors for word representation, Proceedings of the Conference on Empirical Methods in Natural Language Processing (2014), 1532–1543.

27.

and Yang

, Word embedding for understanding natural language: a survey, in: Guide to Big Data Applications, Dordrecht: Springer, 2018, pp. 83-104.

28.

Mikolov

et al., Efficient estimation of word representations in vector space, Computer Science, 2013.

29.

Hochreier

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

30.

Jiang

et al., An improved calculation of semantic similarity of words based on Zhiwang, Chinese Journal of Informatics 22(5) (2008), 84–89.