Abstract
With the rapid growth of online reviews, the rapid and effective access to summary information is a hot topic of research. In this paper, an abstract sentence extraction model based on hierarchical attention network is proposed. The model has an encoder-decoder structure, which introduces a two-layer attention mechanism. The sentence encoder uses attention mechanism to obtain the vectorized representation of the sentence by inputting the aspect words. The review document encoder uses the attention mechanism to realize the context association of the sentences before and after. When decoding, a sentence output component composed of the GRU network makes the choice whether the sentence should be selected as the candidate sentence. Greedy algorithm is used to eliminate the redundantly result, and then obtains the final summary based on the sentence sorting method. The experiment results show that the proposed model is better than the benchmark model.
Introduction
According to the 42nd China Internet Network Information Center (CNNIC) released the “Statistical Report on the Development of China’s Internet Network” [4], China’s Internet industry is booming. As of June 2018, China’s Internet users reached 8.02 Billion, the Internet penetration rate reached 57.7%, and the number of new netizens increased by 29.68 million in the first half of the year. The development trend is shown in Fig. 1. Among these netizens, the proportion of online shopping users and users using online payment is 71.0%, and online shopping and Internet payment have become the application of higher netizens.

Summary model.
As the user base continues to expand, a large number of Internet reviews are generated daily, and these reviews are of great value to consumers, businesses, and governments. For potential consumers, users usually pay attention to the user’s reviews on the intended products before purchasing. Checking the reviews in advance is an important way to understand the real situation of a product or service. By browsing more subjective reviews, you can basically understand to the opinion of a product or service that you like, and for the enterprise, the review is the first channel to understand customer feedback, understand product advantages and potential problems, and the company analyzes reviews with opinions. Being able to understand and master the preferences and needs of most users; through continuous monitoring of online public opinion, government departments can grasp people’s attitudes and tendencies towards hot issues in life.
However, the amount of data submitted by these users is huge, and it is increasing. It is impossible to read each review carefully to obtain an overall cognition of a product. However, reading a small part is easy to be partial and cannot be accurately judged. Therefore, how to enable potential consumer users to quickly find valuable information from massive reviews, so that enterprise users and relevant government departments to effectively use the massive reviews generated has become an urgent problem.
Text automatic summarization technology is a research hotspot that continues to be widely concerned. The automatic summary technology can be divided into two major categories, one is a decimation summary, and the other is a production summary.
The abstract is extracted by extracting important sentences in the text. The sentences extracted from the text are smoothed and organized into abstracts. For example, Rada [17] uses each sentence in the text as a node in the graph, using sentences. The similarity construction diagram is based on the PageRank algorithm to iteratively calculate the weights of the nodes in the graph, and according to the weight value as the basis for evaluating the importance of the sentences, the extraction of the text summary sentences is realized. Wong [15] used the support vector machine and naive Bayesian classification to combine the relevance of words, extract multiple features and design the weight rules of the features, and get the importance order of the sentences in the text. In DUC2001 (Document Understanding Conference, 2001) The data set provided showed good results.
With the development of deep learning, the generated text abstracts based on neural networks have developed rapidly, especially the neural network model based on seq2seq (Sequence to Sequence) has many technical methods to implement text abstracts [3, 24]. However, such methods have the problem of easily repeating the contents of their own texts. See et al. [1] proposed a model of fusion seq2seq and a pointer-generator network, on the one hand, the ability to maintain abstraction through the seq2seq model, and on the one hand, the word generation network directly extracts words from the original text, which improves the inaccurate information generation in the text abstract, the lack of processing ability for unregistered words, and the high repetition rate of abstracts.
In order to facilitate users to understand the main content of the massive reviews, it is necessary to further simplify the review information and summarize the subset of the review sentences. In recent years, the rapid development of deep neural networks has pushed forward the limits of machine intelligence in the fields of image classification and machine translation due to its powerful representation ability. With the help of deep neural networks, automatic text abstracts have also achieved remarkable development. The problem of long-distance dependence is that the network loses a considerable amount of information when it is output, resulting in a summary that is not accurate enough. Therefore, the attention mechanism is applied, which helps the model to acquire important document content and improve the effect of the abstract [11, 21]. In this paper, a summary model Bi-HAN (Binary-HAN) based on Hierarchical Attention Networks (HAN) [5] is proposed for the summary of review information. This model introduces the attention mechanism at two levels. In the sentence layer and the review document layer, the GRU is extended, and the external knowledge from LDA model is used to introduce the aspect word identified in the previous chapter into the input word vector, so that the model has different attention to sentences of different importance in the review document. The ability to get summary information that is valid for the current aspect.
Model
Automatic text summaries can generally be divided into two categories: extractive and abstractive. The extractive summary judges the important sentences in the original text and extracts them into a summary.
The model composition based on the hierarchical attention network proposed to solve the above tasks, it includes sentence encoder, document encoder, sentence outputter, and a sentence sorting component, and their relationship in the model is as shown in Fig. 1.
Sentence Encoder: use the word vector to get a vectorized representation of the sentence;
Document Encoder: consists of a bidirectional GRU, which is combined with sentence information to obtain a representation of the comment document;
Sentence output component: the selected sentence according to the sentence information and the encoder that fuses the aspect feature information;
Sentence Sorting Component: sorts the weights of the sentences in the summary based on importance.
In the model, the sentence encoder is represented by a word vector as a fixed-length embedded vector, and the sentence vector is used to describe the document vector [19, 20]. The attention mechanism is introduced at the sentence and document levels respectively. The weight of each time series is calculated first, then the weighted sum of all time series vectors is used as the feature vector, and the output of the last time series is used as the feature vector to perform softmax classification to calculate whether the sentence should be selected in the summary.
Sentence encoder
The bidirectional cyclic neural network is used to encoding the sentence, and the attention mechanism is used to get the important words in the sentence. The overall structure is shown in Fig. 2.

Sentence Encoder structure diagram.
First, let s i to be the i-th sentence, it is made up of many words, words can be represented as [x1, ⋯ , x T ] where x t means the t-th word. Each word is replaced with a word embedded representation e i by a pre-trained word vector which is trained based on the Google News data set by Mikolov [21] before being sent to the neural network.
Since a review is composed of multiple sentences and a sentence is composed of multiple words, a hierarchical structure from the word layer, sentence layer to document layer can be used to model the review. In order to get a better document representation, a bi-directional GRU (gated recurrent unit) is used to model the context information of words and sentences in the document. Embed each word into a hidden state using GRU network h
t
= GRU (ht-1, x
t
). The gating mechanism is used to control how information enters the next time node from the last time node, the gate mechanism is expected to be able to preserve the information flow over a longer time series. The update gate is shown as Equation (1), the reset gate is shown as Equation (2). New memory is shown as Equation (3), final memory is shown as Equation (4).
A bi-directional GRU model is used,
Because not each word has the same contribution to the sentence representation, we should find out the words that have more weight in expressing the meaning of the sentence, use the attention mechanism of aspect to distinguish different vocabulary, construct sentence representation related to aspect, attention mechanism construction structure similar to literature [12].
Introduce the aspect word extracted by LDA model, top 30 words is selected, encode them as a continuous real-value vector u
w
, and add it to the input word vector to make the input information richer and improve the ability to capture important features of the word. u
w
and h
t
was mapped to a transform space. The importance of the t-th word is converted into a normalized weight matrix α
t
by the softmax function.
W a , W w , b a and u s are parameters within the attention mechanism. After the normalized attention weight matrix is obtained, the hidden state h t is weighted and summed with the corresponding encoder to obtain a representation vector of the time sentence Equation (10), which measuring the importance of the i-th sentence.
The document encoder uses the sentence vector sequence s = [s1, ⋯ , s
N
] encoded by the above sentence encoder as an input. The same network structure as the sentence encoder is used, such as Equation (11) and Equation (12). In the review text, there are many possibilities for a sentence to review on a content. The order of the sentences will affect the importance of the sentence as a description of the content, and the splicing
The above process uses a two-way GRU model to incorporate the context information of the sentence into its hidden layer representation h
i
. In the representation of the document, the aspect factor is also considered, and the attention mechanism is introduced. As mentioned in the previous section, the aspect extracted by LDA [7] model is introduced. The word which is encoded as a continuous real-value vector u
w
, a
i
is a scalar value representing the importance of the i-th sentence, thereby obtaining a review document vector w of the sentence vector weighted sum.
Each sentence of the review document uses a concatenated representation of the corresponding sentence and vector, taking into account the sentence-level and document-level context when predicting whether the sentence is in the digest [2]. The sentence exporter consists of an GRU network that makes a selection of the mark by detecting the sentences in the review corpus.
Given the review document representation and the implicit state H of the review document encoder, the sentence outputter will combine the hidden state of the decoding with the implicit state of its corresponding position encoding to predict whether the representation of the i-th sentence is a 1 or 0 flag:
In order to estimate the degree of inconsistency between the predicted value of the model and the true value, the loss function used in the model training process is as follows:
Determine whether a sentence can be filtered as a summary, and its expression is calculated:
In this model, the common sentence filtering method is greedy algorithm and ILP (Integer Linear Programming) for the phenomenon of excessive redundant sentences that often appear in the digest process. Generally speaking, the greedy score is not as good as ILP because it Greed to maximize, while ILP is precisely localized to a limited extent. However, the greedy algorithm provides a good balance between performance and computational cost. This section uses greedy algorithms to filter sentences to achieve the elimination of redundant information. Highly relevant sentences are iteratively added to the abstract, and only when one sentence s
t
satisfies
bigram - overlap (s t , S) representing the sentence s t and the current summary S The method of selecting the duplicate content on the binary-gram is based on Ren [16], which is set to 0.65. The value of the data set is also discussed experimentally.
Experiment DataSet
The public datasets, from TripAdvisor and Amazon, it is used by Wang [10]. Table 1 shows some general statistics of the dataset.
Properties of datasets
Properties of datasets
Four models were selected as baseline comparison models, 1) NNSE model [12], which based on neural network model to achieve abstract extraction, 2) LEAD [21], the method is relatively simple, the first 3 sentences of fixed selection documents are abstract 3) LexRank [9], a graph structure summary method based on feature vector centers; 4) Logistic Regression, hereinafter referred to as LR, used to predict whether a sentence is listed in the abstract.
Since 90% of the sentences in the data set do not exceed 15 words, 90% of the review documents do not exceed 30 sentences, so the sentence length is set to 15, and the review document length is set to 30, for the aforementioned review document encoder and sentence output. Optimize network parameters using the RMSProp optimizer [22] during training. Set the GRU unit size to 600, dropout [23] to 0.2, and learn rate to 0.01.
Evaluation method of summary
The ROUGE [6] evaluation model is used to calculate the quality of the abstract. ROUGE-N (N = 1, 2, 3, 4) measures the degree of overlap of the n-gram phrases between the candidate summary and the reference summary, and measures the amount of information of the summary. In the experiment, ROUGE-1, ROUGE-2 and ROUGE-L were obtained. Because Owczarzak [14] mentioned the efficiency of ROUGE-2 in automatic summary evaluation, it was used as the main evaluation indicator.
Results
Summary content
The Bi-HAN model was compared with NNSE, LEAD, LR and LexRank. The verification results of all models are shown in Table 2.
The ROUGE Tested in two Dataset
The ROUGE Tested in two Dataset
Based on the verification of two data sets, it is found that both the benchmark model and the Bi-HAN model are more stable in the score bias, that is, if the algorithm scores lower in one data set and scores in another data set, the result is low, because the two review data sets have little deviation in the statistical characteristics of the text, and the stability of the algorithm is also reflected.
As seen from Table 1, for the three evaluation indicators, such as the black body display, although the NNSE scored the highest ROUGE -1 score in the validation of the TripAdvisor data set, its ROUGE -2 score is lower than Bi-HAN. 1.1%, ROUGE-L is about 2.9% lower. In comparison, the Bi-HAN model is more stable. Regardless of TripAdvisor or Amazon datasets, the Bi-HAN model performed the highest under the ROUGE-2 metric, reflecting better results than several other baseline methods.
In particular, Bi-HAN after performing the redundant culling method, has a slightly improved effect and is underlined in the table. It can be seen that the Bi-HAN model uses the attention mechanism in sentence coding and document coding to improve the acquisition ability of the document theme, and the selected sentences are more representative.
In order to further test the effect of this model, and analyze the role played by each component in the model, Table 3 shows the evaluation indicators of the model described in this chapter after deleting different levels of attention mechanism. Taking the ROUGE-2 indicator as an example, the experiment based on the TripAdvisor data set does not use the attention mechanism of the sentence layer. The score is reduced by 0.54, and the decrease is 6.79%. Without the attention mechanism of the document layer, the score decreases by 0.47, and the decrease is 5.91%; based on the same experiment with Amazon dataset, the scores were reduced by 0.68 and 0.59, respectively, with a decrease of 8.56% and 7.43%, respectively. In the experimental comparison of the same data set, the use of the sentence layer attention mechanism caused a larger drop in the evaluation score. It can be speculated that under the model of the double-layer attention mechanism, the sentence attention mechanism played a relatively speaking role. The more important role, the external input of this mechanism is derived from the aspect information extracted in the previous chapter, which verifies that it is beneficial for abstract extraction.
Performance comparison after model reduction
Performance comparison after model reduction
The value of λ mentioned in Section 3.3, which affects whether the sentence will be selected as a summary. If the value is too small, the redundancy elimination effect will be weakened. As many sentences as possible are selected as abstracts, and the value is too large. The number of candidates satisfying the requirements is less. From the condition of redundant screening, the value is related to the length of the sentence in the corpus. The general news corpus and the online review corpus have different statistical characteristics in the sentence, in order to analyze the value. The relationship with the evaluation value was verified on the basis of the experimental data. Figure 3 shows the relationship, the abscissa is the value of 0.05 in steps, the ordinate is the value of ROUGE-1 and ROUGE-2 (in %), the upper curve shows the value of ROUGE-1, below the curve shows the value of ROUGE-2. It can be seen that ROUGE-1 reaches the highest score when it takes about 0.6, and ROUGE-2 reaches the highest score when it is between 0.6 and 0.7. Therefore, it is set to be 0.65.

The relation of lamda and ROUGE-1/2.
In this paper, based on the hierarchical attention network design summary model for the review document data, the GRU is extended, the aspect knowledge is introduced, the aspect vocabulary is introduced, the structure of the GRU is enhanced, and the attention weight mechanism is used to identify words with higher weights in the sentence. These words form a sentence representation, and then form a representation for the text based on the attention mechanism. The model is compared with a series of unsupervised and supervised summary baselines. The experimental results show that the model performs better than most of the baselines in the text, and can use the loop structure to capture context information and introduce attention in two encoders. The attention mechanism effectively improves the saliency sentence extraction of the abstract. Finally, in the experiment, whether or not to add the redundant culling step into the model test, the results show that the redundant culling is beneficial to the summary result of the review.
Next, the evaluation method ROUGE is based on the word correspondence, regardless of the semantic level. If the words are different, even if the semantics are closely, the calculated evaluation value will not be high. As the abstract technology becomes higher and higher, the phenomenon of synonymous different texts will increase. Therefore, there are still some shortcomings in the evaluation method of the abstract, and further consideration for authentication method is needed.
Footnotes
Acknowledgments
This research is supported by Scientific Research Project of Beijing Educational Committee (No. KM201810017005).
