A unified framework for abstractive summarization over prompt language model and pointer mechanism

Abstract

In this paper, we propose a unified framework for an abstractive summarization method which uses the prompt language model and a pointer mechanism. The abstractive summarization problem usually includes a text encoder and a text decoder. Current methods usually employ an encoder-decoder architecture to condense and paraphrase a document. To better paraphrase a document, we propose a unified framework for an abstractive summarization model that only uses a topic-sensitive decoder. Our model has a prompt input module, a text decoder and a pointer mechanism. We apply our model to Xsum, Gigaword, and CNN/DailyMail summarization datasets, and experimental results demonstrate that our model has achieved state-of-the-art results on the Xsum dataset and comparable results on the other two datasets.

Keywords

Abstractive summarization masked language model pointer mechanism text decoder

1 Introduction

Summarization is the task of producing a condensed text from an input text that contains the core meaning of the original. Automatic text summarization can be extractive or abstractive. Researchers first concentrated on extractive methods, which identify the salient text units in a document and take them as the summarization [1 –6].

The abstractive method uses artificial intelligence to understand the semantic of the original document and generate an informative summary [6 –14]. Because of the slow speed of the sequence-to-sequence model, this method is usually combined with the extractive method. Bottom-up abstractive summarization is currently the state-of-the-art method [15], combining extractive summarization with pointer-generator abstractive summarization [11]. These methods use an encoder-decoder architecture for abstractive summarization and a pointer-network to solve the out-of-vocabulary (OOV) problem when generating new words. Chen [12] used reinforcement learning to select important sentences and generate summarizations. We process the abstractive summarization problem in this work.

The contributions of this work are as follows. First, to better generate new words for news documents, we propose a unified framework for the abstractive summarization task. Second, we propose a prompt language model for abstractive summarization which incorporates a pointer mechanism to solve the OOV problem. Third, we propose a topic-sensitive decoder for abstractive summarization. Our model is applied to the Xsum, Gigaword, and CNN/DailyMail datasets. Our results on the abstractive summarization task are state-of-the-art on the Xsum dataset and are comparable on all ROUGE metrics [16] of the other two datasets.

The remainder of this paper is organized in seven parts. Section 2 shows related work, and section 3 shows the background. Our model is presented in section 4. Section 5 describes our experiments. The empirical results of abstractive summarization are presented in section 6. Sections 7 and 8 present our conclusions and acknowledgements, respectively.

2 Related work

Automatic text summarization has been studied for years, and consists of extractive summarization and abstractive summarization. Sentence scoring and sentence selection are the two key techniques for traditional extractive summarization. Unsupervised and supervised methods have been proposed to model and score sentences. Term frequency and TF*IDF weights can be used as the features for the sentence scoring. These unsupervised methods do not require model training or data annotation. The maximal marginal relevance(MMR) method proposed by [17] is used in sentence selection. The sentence selected by the MMR method has the maximal score and is minimally redundant with previous selected sentences. TextRank based on weighted graphs is an unsupervised method proposed by [18]. LexRank, as proposed by [19], uses the TextRank method to rank sentences. Recently, deep neural networks have been applied for extractive summarization. Cheng and Lapata [20] treated extractive summarization as a sequence labeling task. Nallapati et al. [6] proposed an extractive summarization model called SummaRuNNer with more features. Latent semantic analysis [21, 22] is a general theory of acquiring similarity and knowledge representation and achieving powerful inductive effects by extracting the right number of dimensions to represent objects and contexts.

The early work of abstractive summarization [23] viewed summarization as a statistical machine translation problem. The model proposed by [23] first selected salient content and then realized the surface using a statistical model. The calculation of the probabilities of candidate abstract terms and candidate surface realizations enables one to choose the most likely summarization for an article. Lacking an evaluation mechanism, the result of this method is not shown. After the DUC-2003 and DUC-2004 competitions, summarization research was formalized and standardized. Zajic [24] implemented an abstractive summarization system called TOPIARY which combined sentence compression and unsupervised topic discovery. TOPIARY performed best at DUC-2004.

Rush [25] first used neural networks for abstractive summarization and applied them to the headline generation task. The core of the model is the standard feedforward neural network language model (NNML) [26]. The method achieved a state-of-the-art result on the DUC-2004 and Gigaword datasets with the ROUGE metrics. Lopyrev [9] also applied an encoder-decoder architecture based on LSTM to abstractive summarization. They simplified the attention mechanism, but in this study, they did not compare their experimental results with those of other methods. Chopra [7] continued the work of [25], changing the encoder to a convolutional attention-based encoder and changing the decoder to a recurrent neural network (RNN)-based decoder. They applied their model to Gigaword and DUC-2004 on the ROUGE metrics and found that it performed slightly better than [25]. Nallapati [6] applied an attentional encoder-decoder based on RNN to summarization and showed its results on two English corpora. The encoder was a bidirectional GRU-RNN, and the decoder was a unidirectional GRU-RNN in [27]. To address the problem of the large vocabulary trick (LVT), they restricted the decoder vocabulary of each mini-batch to words in the source document of that batch. They used a feature-rich encoder with one embedding vector each for POS, NER tags, and discretized TF and IDF values, concatenated with word-based embeddings. A novel switching decoder/pointer architecture handled the OOV problem. Hu [8] developed a large dataset for Chinese short-text summarization. They applied an attentional encoder-decoder model based on RNN and the experimental results were not particularly good, probably because they used the representations of the characters and not the words as the encoder’s input. The shortcomings of the neural sequence-to-sequence model can lead to the inaccurate reproduction of factual details, and it can generate repeated words. To address these problems, [11] proposed a hybrid pointer-generator network that can copy words from the source text via pointing, and used coverage to keep track of what had been summarized. Chen [12] applied reinforcement learning to summarization, selecting salient sentences and abstractively rewriting them to generate a concise overall summary.

3 Background: Abstractive summarization

Abstractive document summarization aims to condense a document and retain the salient information. Abstractive summarization consists of document compression and document paraphrasing. With the input of a source document D = {x₁, x₂, . . . , x_n}, where x_i (1 ≤ i ≤ n) represents one token in the source document, abstractive summarization rewrites D as $\hat{D} = {{tg}_{1}, {tg}_{2}, . . ., {tg}_{m}}$ , where tg_j (1 ≤ j ≤ m) represents one token of the summary and m << n.

The abstractive summarization model can be defined by Equation (1). $f (x) = p ({tg}_{j} | vocab, {tg}_{1, j - 1}, D, θ),$ (1) where tg_j is the word to predict and vocab is the vocabulary list of all documents. The abstractive summarization problem is to learn a function f (x) parametrized by θ that maximizes the probability of generating the correct sequences. The generation model will generate one word at a time and will be aware of the previously generated words. Our new abstractive summarization model is based on the pointer generator network [11]. The new generation model can be defined by Equation (2).

$\begin{matrix} p ({tg}_{j}) = p_{gen} * p ({tg}_{j} | vocab, {tg}_{1, j - 1}, D, θ) + \\ (1 - p_{gen}) * attn_dist, \end{matrix}$ (2) where p_gen is the generation probability, and attn _ dist is the attention distribution between tg_j-1 and the source document D which represents the copy probability, i.e., the probability of copying a word from the source document.

4 A unified framework for abstractive summarization

The architecture of the unified framework for abstractive summarization model is shown in Fig. 1. The whole model consists of five parts: (1) Masked prompt input layer: masked prompt input of abstractive summarization model; (2) Decoder Layer: a transformer decoder-based language model to predict the next word of the text input; (3) Topic Layer:the topic-level representation of the document inferred by a topic model; (4) Generation Layer: a topic-sensitive decoder with a pointer mechanism to generate the summarization of the document; and (5) a joint training loss function to optimize the topic and decoder models.

Fig. 1

Overview of unified framework for abstractive summarization model.

The transformer decoder-based language model extracts the text word features. The topic model provides additional latent topic features for the decoder to find words that are more related to the document content.

Formally, D _k = {x₁, x₂, . . . , x_n} is the word sequence of document k, and E _k is the embedding representations of the words, where n is the sequence length and V is the vocabulary size. The one-hot representations of the source document words are fed into the transformer decoder-based language model to obtain the word and document representations. The latter is used to infer the topic-level text representations. In the decoding phase, the topic representation and current target words are fed into the decoder to generate the words.

4.1 Masked prompt input of abstractive summarization model

The input of the abstractive summarization model is shown as Fig. 2. The input module of the language model has three parts: tokens of the source document, prompt tokens, and target tokens of the source document.

Fig. 2

Prompt input module of abstractive summarization model.

The special token [SEP] disparts the source and target tokens. To better generate words and train the language model, we mask the input words in the source document by using the strategy proposed by [28]. We replace the word with the [Mask] 80% of the time. We replace the word with a random word 10% of the time. We keep the word unchanged 10% of the time. Part of the words in the source document are masked by a special token [MASK], while the target and prompt tokens are not masked. The whole tokens are the input of the embedding layer.

In [29], the author tested GPT-2’s ability to perform summarization on the CNN/DailyMail dataset. There will be usually a prompt token named TL;DR: before the summarization of a long article. They added the token TL;DR: after the article to induce summarization behavior. In our framework, we choose TL;DR: as the prompt tokens, and design some prompt tokens to induce summarization behavior. If the tokens can manipulate the behavior of the model and dig the knowledge of the source document, we can choose these tokens as the effective prompt tokens.

4.2 Language model based on transformer decoder

Because we use the decoder only in our model, we use the decoder of transformer [30] to capture the feature information of the document words. The input of the decoder is the one-hot representation of the document words. We use the embedding layer to obtain the distributed representations of the document words. The transformer can avoid information loss when training the language model for a long text. In the transformer decoder layer, the word in the current position can attend to the words at and before the current position. When predicting the next word, the hidden state of the current position will be computed by the embeddings of the source document, prompt input, and generated words. The decoder model is defined as Equations (3)-(10). The output of the decoder layer is H. ${head}_{i} = Attention (E W_{i}^{E_{1}}, E W_{i}^{E_{2}}, E W_{i}^{E_{3}})$ (3) , where $W_{i}^{E_{1}} \in ℝ^{d_{e} \times d_{q}}, W_{i}^{E_{2}} \in ℝ^{d_{e} \times d_{k}}, W_{i}^{E_{3}} \in ℝ^{d_{e} \times d_{v}}$ . $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,$ (4) where $Q = E W_{i}^{E_{1}}, K = E W_{i}^{E_{2}}, V = E W_{i}^{E_{3}}$ . $attention_out = ({head}_{1}; \dots; {head}_{h}) W$ (5) , where $W \in ℝ^{h d_{v} \times d_{e}}$ and h = 8. $E_{2} = E + attention_out$ (6) $E_{3} = FeedForward (E_{2})$ (7) $H = E + E_{3} H \in ℝ^{B \times L \times d_{e}},$ (8) where L is the whole length of the input tokens. $H = {h_{1}, h_{2}, . . ., h_{L}} H \in ℝ^{B \times L \times d_{e}}$ (9) $L = L_{1} + L_{2} + 1,$ (10) where L₁ is the length of the source document and L₂ is the length of the target document and the prompt tokens.

4.3 Topic model on abstractive summarization task

There is an association between the word distribution and its topic distribution in LDA-style topic models. We can infer a latent topic distribution through the topic model. We employ a variational autoencoder (VAE) [31] as our topic model to obtain the topic representations of the document. The topic model p ( Z | E_d ) compresses the sentence embedding E _d into a continuous hidden vector $Z_{k} \sim N (μ, σ^{2})$ that is defined as Equations (11), (12) and (13). $μ = f_{μ} (l (E_{d})))$ (11) $log σ^{2} = f_{σ} (l (E_{d})),$ (12) where E_d is the embedding representation of the document, l is the layer normalization function and f_μ (·) and f_σ (·) are simple linear transformations. The topic model generates the $μ_{d} and σ_{d}^{2}$ parameters which parameterize the normal distribution $N (μ_{d}, σ_{d}^{2})$ . We can further sample the latent topic distribution Z _d for the document embedding representation E_d . We sample Z as Equation (13). $Z = μ + ɛ \cdot σ,$ (13) where $ɛ \sim N (0, I)$ .

Given the document representation E_d , the topic model generates the latent topic distribution Z_d to provide topic-level representations. The topic model can be made to approximate a standard normal distribution by decreasing the Kullback-Leibler (KL) divergence which is defined by Equation (14).

$KL (q (Z) ∥ p (Z ∣ E_{d})) = \frac{1}{2} (- log σ^{2} + μ^{2} + σ^{2} - 1),$ (14)

where q ( Z ) is a standard normal distribution.

4.4 Generation layer with topic and pointer mechanism

The traditional decoder of the abstractive summarization has two parts: the target sequence embedding layer and the attention mechanism. In our abstractive summarization model, the inputs of the decoder are the document word embeddings $E_{d} \in ℝ^{N \times d_{e}}$ , current target word embedding E _tg, and topic embedding representation $Z_{d} \in ℝ^{d_{e}}$ . N represents the number of tokens in the whole input. The traditional decoder model is defined by Equations (15–19). $E_{{tg}_{j}} = f (concat (embedding ({tg}_{j}), {context}_{j - 1})),$ (15) where f (.) is the nonlinear function LSTM, $E_{{tg}_{j}} \in ℝ^{d_{e}}$ , and d_e is the dimension of the embedding. The attention distribution a^t is calculated as [32] which is defined by Equations (16) and (17). $e_{j}^{t} = v^{T} tanh (W_{tg} E_{{tg}_{j}} + W_{d} E_{d} + b_{attn})$ (16) $a^{t} = softmax (e^{t}),$ (17) where v, W_{tg
_j}, W_d, and b_attn are learned parameters. The attention distribution is viewed as the probability distribution over the source words. Then the context text representation is calculated as Equation (18). ${context}_{j} = \sum_{i} a_{i}^{t} E_{i} .$ (18) With the context vector, we can produce the vocabulary distribution defined by Equation (19).

$P_{vocab} = softmax (V^{'} (V [E_{{tg}_{j}}, {context}_{j}] + b) + b^{'}),$ (19)

where V, , b, and are the weight parameters and the bias parameters. In our abstractive summarization model, the generation probability has two parts. P_1vocab is the generation probability of the un-masked tokens of the source document defined as Equation (20), and P_2vocab is the generation probability of the tokens of the generated words and prompt tokens defined as Equation (21). $P_{1 vocab} (i + 1) = softmax ({Vh}_{i} + b)$ (20) $P_{2 vocab} (j + 1) = softmax (V [h_{j}, Z_{d}] + b),$ (21) where h_i is the hidden state of the token i and is also the output of the transformer decoder. The core function of the abstractive summarization is to condense the source document and generate a fluent text that retains the main meaning. So, we should look up the source document and find the words that retain the topic information. With this topic-sensitive generator, the generation model generates words that are close to the source document’s salient content.

We adopt a pointer mechanism [11] to solve the OOV problem. The generation model is defined in Equation (2) where p (tg_j|vocab, tg_1,j-1, D, θ) = P_2vocab (j). In Equation (2), p_gen is defined as Equation (22), which is the probability of generating new words through the old generation model. We obtain attn _ dist from the multi-head-attention results in the transformer decoder, where attn _ dist is defined by Equation (23) and B denotes the batch size. With the pointer mechanism, the summarization generated by our model has a few OOVs. $\begin{matrix} p_{gen} = Sigmoid (W_{p_{gen}} H [L_{1} + 1 :] + b_{p_{gen}}), \end{matrix}$ (22) where $p_{gen} \in ℝ^{B \times 1}$ , W_{p
_gen} is the weight parameter and b_{p
_gen} is the bias parameter. $\begin{matrix} attn_dist = attention_out [B, 0, L_{1} + 1 :, : L_{1}], \end{matrix}$ (23) where $attn_dist \in ℝ^{B \times L_{2} \times L_{1}}$ . In our pointer mechanism, the extended vocabulary is the union of all words appearing in the source document and the original vocabulary. The probability of p (w_j|vocab, w_1,j-1, D, θ) will be zero when w_j is out-of-vocabulary. The pointer generation model will select word from the source document. The pointer mechanism tells us we should generate the word from the vocabulary or copy the word from the source document.

During training, the loss is the negative log likelihood of the target word ${tg}_{i}^{*}$ of the unmasked source document and the target word ${tg}_{j}^{*}$ of the prompt tokens and the summarization tokens. The loss function is defined as Equation (24). $loss = \sum^{L_{1}^{*}} - \log (P ({tg}_{i}^{*}) + \sum_{j = 1}^{L_{2}} - logP ({tg}_{j}^{*})),$ (24) where $L_{1}^{*}$ is the collection of the subscripts of unmasked tokens in the source document. The second part of Equation (24) is the loss of the target words of prompt tokens and the target summarization tokens.

4.5 Joint learning

To update all model parameters involving topic model and neural abstractive summarization model, the loss combines cross-entropy with KL loss, which is defined as Equation (25). $L = L_{NAS} + λ {KL}_{loss},$ (25) where $L_{NAS}$ is the cross-entropy loss of the neural abstractive summarization model, KL_loss is the KL loss function of the topic model, and λ balances the two.

5 Experiments

We evaluate the proposed approach on the task of single document summarization. We discuss the datasets used in the training and evaluation processes and describe the implementation of our model. We report the ROUGE values of different models for comparison.

5.1 Dataset

We use three datasets in our experiments. The statistics of the datasets are shown in Table 1. The Xsum dataset [33] consists of professionally written one-sentence summaries, and contains 204,045 training pairs and 11,334 testing pairs. The average word count in a document is 378, and the average number of words in a headline is 8. We truncate the source documents to 80 tokens and truncate the headline to 20 tokens.

Table 1
Datasets descriptions

Dataset Training set size Testing set size Words in article Words in reference summarizations

Xsum 204,405 11,334 378 8

Gigaword 3,803,957 1,951 31 8

CNN/DailyMail 286,817 11,487 766 53

Dataset	Training set size	Testing set size	Words in article	Words in reference summarizations
Xsum	204,405	11,334	378	8
Gigaword	3,803,957	1,951	31	8
CNN/DailyMail	286,817	11,487	766	53

The Gigaword dataset produced by the Linguistic Data Consortium (LDC) [34] is a comprehensive archive of newswire text data in English acquired over several years, consisting of 3,803,957 training pairs and 1,951 testing pairs. The average number of words in articles is 31 and the average number of words in target titles is 8.

We evaluate the proposed approach on the CNN/DailyMail dataset [13, 14]. In all, the corpus contains 286,817 training pairs, 13,368 validation pairs, and 11,487 test pairs, as defined by their scripts. The source documents in the training set have 766 words spanning 29.74 sentences on average, while the summaries consist of 53 words and 3.72 sentences. We truncate the source document to 400 tokens and truncate the summary to 80 tokens.

5.2 Experimental settings

Our model is based on the decoder structure of Transformer [30]. We set the decoder layers to 3. The dimensions of the word embeddings and the hidden states are both set to 768. The dimension of topic representation is set to 768. We use the Adam optimizer with a learning rate of 0.0003. The batch size is 16, and there are two epochs in the training process. The vocabulary size is set to 50000. The maximum length of the document is 80, and the maximum length of the summary is 20 on the Xsum dataset. The maximum length of the document is 30, and the maximum length of the summary is 8 on the Gigaword dataset. The maximum length of the document is 400 and the maximum length of the summary is 80 on the CNN/Dailymail dataset. In the decoding phase, the beam size is set to 4. The model is trained on a 16-GB Tesla v100 GPU. We employ a pointer mechanism [11] to solve the OOV problem.

5.3 Evaluation

We adopt the ROUGE [16, 35] metrics to evaluate the performance of the summarization models and report the ROUGE-1, ROUGE-2, and ROUGE-L values. We use the ROUGE-1 value and ROUGE-2 value to assess the informativeness of summarization. The ROUGE-L value is used to assess a summary’s fluency.

5.4 Baselines

Xsum dataset—We compare our model against the baselines implemented by ourselves or others. A lead baseline takes the first sentence as the summary; pointer-generator [11] is a seq2seq model incorporated with the pointer-generator network; pointer-generator + coverage [11], which incorporates a coverage mechanism to solve the repetition problem; BERTSUMABS [36] applies BERT [28] to text summarization; and T-CONVS2S [37] is a neural abstractive summarization model completely based on convolutional neural networks.

Gigaword dataset—lead baseline is implemented by ourselves, which takes the first sentence as the summary; ControlCopying [38] can control over-copying during both the training and decoding stages of a neural summarization model; ProphetNet [39] can optimize the n-step future prediction.

CNN/DailyMail dataset—Pointer-generator + coverage [11], which is incorporated with the coverage mechanism on the news dataset. ProphetNet [39] predicts future n-grams for sequence-to-sequence pretraining. Gsum [40] is a general framework for guided neural abstractive summarization.

6 Results and analysis

Table 2 shows our main results on the Xsum dataset with the lead baseline shown at the top, and our model shown at the bottom. We observe that the pointer model and coverage mechanism are effective to solve the OOV problem and repetition problem. So, we employ the pointer model in our model. We evaluate our models with the ROUGE metrics [16]. The F₁ scores for ROUGE-1, ROUGE-2, and ROUGE-L are reported in Table 2, to respectively measure the word-overlap, bigram-overlap, and longest common sequence between the generated summary and the reference summary. From Table 2, we can find that our AUF-ABS model achieves state-of-the-art results. The ROUGE-1, ROUGE-2, and ROUGE-L values of our model are 45.22, 23.74, and 41.59, respectively, which far exceed those of the T-CONVS2S model.

Table 2
ROUGE results on Xsum test set. We report ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) F1 scores

Dataset Models R_1_F R_2_F R_L_F

Xsum Lead baseline 35.55 13.47 28.54

Xsum Pointer-generator (our implementation) [11] 36.44 15.66 33.42

Xsum Pointer-generator + coverage (our implementation) [11] 39.53 17.28 36.38

Xsum BERTSUMABS [36] 38.76 16.33 31.15

Xsum T-CONVS2S [37] 31.89 11.54 25.75

Xsum AUF-ABS (our model) 45.22 23.74 41.59

Dataset	Models	R_1_F	R_2_F	R_L_F
Xsum	Lead baseline	35.55	13.47	28.54
Xsum	Pointer-generator (our implementation) [11]	36.44	15.66	33.42
Xsum	Pointer-generator + coverage (our implementation) [11]	39.53	17.28	36.38
Xsum	BERTSUMABS [36]	38.76	16.33	31.15
Xsum	T-CONVS2S [37]	31.89	11.54	25.75
Xsum	AUF-ABS (our model)	45.22	23.74	41.59

Table 3 shows the results on the Gigaword dataset. The dataset we use is a compressed version. Compared with ProphetNet, the ROUGE values of our model are slightly lower, which is largely due to the simplied dataset.

Table 3

ROUGE results on Gigaword test set. We report ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) F1 scores

Dataset	Models	R_1_F	R_2_F	R_L_F
Gigaword	ControlCopying (full version) [38]	39.08	20.47	36.69
Gigaword	ProphetNet (full version) [39]	39.51	20.42	36.69
Gigaword	AUF-ABS (simplifed version)	35.24	17.78	33.02

Table 4 shows the results on the CNN/DailyMail dataset. Because this is a long document summarization dataset, our language model shows weakness compared to the other methods, but is still competitive.

Table 4

ROUGE results on CNN/DailyMail test set. We report ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) F1 scores

Dataset	Models	R_1_F	R_2_F	R_L_F
CNN/DailyMail	ProphetNet	44.20	21.17	41.30
CNN/DailyMail	GSum	45.94	22.32	42.48
CNN/DailyMail	AUF-ABS	35.72	15.65	30.11

6.1 Relation between extractive and abstractive summarization

Current researches focus on both extractive and abstractive summarization. Extractive summarization and abstractive summarization are closely related. The datasets of the extractive summarization lack labels, which restricts the training process.

From Table 2, we can see that the lead baseline achieves a good result. Although the first sentence contains important information, there is also important information in the middle and at the end of the document.

The core of extractive summarization is to select important sentences and remove unimportant sentences, which can compress the document. The core of abstractive summarization is to select the salient words of the source document and paraphrase them. Some methods combine both methods [15]. To implement end-to-end abstractive summarization, we propose a unified framework that can be used with different datasets of abstractive summarization. We can see the effect of our model from Tables 2 to 4.

6.2 How abstractive is our model?

Tables 5–7 show the proportions of novel n-grams of different models. Unlike our model, other models neglect the length of the source document when calculating the proportion. From Table 5, we can find that due to our pointer mechanism, the novel n-gram proportion of our model is less than that of T-CONVS2S. The lower rate of novel n-grams shows that our model copies more words from the source document than T-CONVS2S. In other words, our model has a lower degree of abstraction. However, this does not mean that our model is less effective, as shown by the ROUGE values on these datasets. From Table 5, we can see that the tokens of the source document are much longer and the rate of novel n-grams is much lower, indicating a lower degree of abstraction.

Table 5
Proportion of novel n-grams in summaries generated by various models on Xsum test set

Dataset Models 1_gram 2_gram 3_gram 4_gram

Xsum ORACLE 31.84 76.89 93.22 92.38

Xsum My oracle (whole document) 14.89 71.21 91.39 89.69

Xsum My oracle (length=230) 17.4 73.0 91.9 90.3

Xsum BERTSUMABS [36] - - - -

Xsum T-CONVS2S [37] 30.73 79.18 94.10 98.03

Xsum AUF-ABS (whole document) 11.40 67.89 89.40 86.91

Xsum AUF-ABS (length=230) 13.05 68.69 89.55 87.09

Dataset	Models	1_gram	2_gram	3_gram	4_gram
Xsum	ORACLE	31.84	76.89	93.22	92.38
Xsum	My oracle (whole document)	14.89	71.21	91.39	89.69
Xsum	My oracle (length=230)	17.4	73.0	91.9	90.3
Xsum	BERTSUMABS [36]	-	-	-	-
Xsum	T-CONVS2S [37]	30.73	79.18	94.10	98.03
Xsum	AUF-ABS (whole document)	11.40	67.89	89.40	86.91
Xsum	AUF-ABS (length=230)	13.05	68.69	89.55	87.09

Table 6

Proportion of novel n-grams in summaries generated by various models on the Gigaword test set

Dataset	Models	1_gram	2_gram	3_gram	4_gram
Gigaword	Oracle (simplified version)	14.83	63.03	82.60	99.97
Gigaword	ControlCopying (full version)[38]	-	-	-	-
Gigaword	ProphetNet (full version)[39]	-	-	-	-
Gigaword	AUF-ABS (simplified version)	43.92	86.61	95.46	97.50

Table 7

Proportion of novel n-grams in summaries generated by various models on the CNN/DailyMail test set

Dataset	Models	1_gram	2_gram	3_gram	4_gram
CNN/DailyMail	ProphetNet [39]	-	-	-	-
CNN/DailyMail	GSum [40]	-	-	-	-
CNN/DailyMail	AUF-ABS (whole document)	2.3	16.9	31.6	31.4
CNN/DailyMail	ORACLE (whole document)	8.1	43.4	64.8	64.1
CNN/DailyMail	AUF-ABS (length=400)	33.2	63.7	72.1	71.9
CNN/DailyMail	ORACLE (length=400)	25.4	64.6	78.5	77.4

Table 6 shows the proportion of novel n-grams on the simplified Gigaword dataset. The novel 1-gram, novel 2-gram, and novel 3-gram rates of our model are higher than those of ORACLE, which means that our model has a higher degree of abstraction. But the rate of novel 4-gram of our model is lower than that of ORACLE which is because the length of the title is short in the dataset.

We also provide the proportion of novel n-grams on the CNN/DailyMail dataset.

From Tables 6 to 7, we can find that our model’s summaries contain a higher rate of novel n-grams than the ORACLE model, indicating a higher degree of abstraction. But from Table 5, we can find that our model’s summaries contain a lower rate of novel n-grams than the ORACLE model indicating a lower degree of abstraction, and it copies more tokens from the source document.

Figure 3 shows two examples of summaries on the Xsum dataset. The dataset contains many news articles accompanying titles. Most words of the summary can be found in the document, so the proportion of the novel 1-gram is lower on Xsum dataset than other datasets. We can also find that the most words in the summary are in the first sentence of the document. By incorporating the topic information and pointer mechanism into our model and using the prompt language model, the feature extraction ability and the experimental results on the Xsum dataset are better than those of the other models.

Fig. 3

Examples of summaries produced by our model on Xsum dataset (bold denotes novel words).

6.3 Does the prompt language model suit all summary datasets?

Figure 4 shows examples of abstractive summaries produced by our model on the CNN/Dailymail dataset and Fig. 5 shows examples of abstractive summaries produced by our model on the Gigaword dataset. The reference summaries on the CNN/Dailymail dataset contain multi-sentences, while the reference summaries on the other two datasets just contain one sentence. The summary on the CNN/DailyMail dataset generated by the language model is long, and it contains multi-sentences. So, the experimental results of language models on long text summary datasets are not better than those of traditional sequence-to-sequence models. This problem can be solved in future work. From Fig. 5, we can find that the generated summary still contains out-of-vocabularies named UNK, which is because the training set and testing set on the Gigaword dataset contain UNK tokens. The OOV problem can be reduced by incorporating the pointer mechanism proposed in our model, but this problem still exists.

Fig. 4

Examples of abstractive summaries produced by our model on CNN/Dailymail dataset.

Fig. 5

Examples of summaries produced by our model on Gigaword dataset.

Language models have developed quickly. The prompt language model has been used in all kinds of natural language processing tasks. The prompt language model is also effective in the summarization task. From Tables 2 to 4, we can find that the prompt language model is more effective on the short text summarization task. For the long text summarization task, the prompt language model is much weaker, as shown by the experimental results on the CNN/DailyMail dataset.

Why is the prompt language model more effective on short text summarization datasets? One reason is that current feature-extraction models are restricted by the length of the document. The ability of extracting text features from long documents still has a long way to go. The other reason is that the important information is not only located in the first few rows.

7 Conclusion

We introduce a unified framework model based on a prompt language model and pointer mechanism. Our model is also a masked language model. The new model with the decoder only shows strong feature-extraction ability. With the pointer mechanism, it can generate a more proper summarization. The model was applied to summary datasets, achieving state-of-the-art results on the Xsum dataset, and comparable results with other models on the other two datasets.

Footnotes

Acknowledgements

We thank the anonymous reviewers for their helpful comments. This work was supported by the National Natural Science Foundation of China (No. 61862060).

References

Luhn

H.P.

, The automatic creation of literature abstracts, Ibm Journal of Research and Development 2(2) (1958), 159–165.

Edmundson

H.P.

, New methods in automatic extracting, Journal of the ACM 16(2) (1969), 264–285.

Brandow

, Mitze

and Rau

L.F.

, Automatic condensation of electronic publications by sentence selection, Information Processing and Management 31(5) (1995), 675–685.

Kupiec

, Pedersen

J.O.

, Chen

, A trainable document summarizer, in: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, 1995, pp. 68–73 .

Radev

D.R.

, Jing

, Sty’s

and Tam

, Centroid-based summarization of multiple documents, Information Processing and Management 40(6) (2004), 919–938.

Nallapati

, Zhai

and Zhou

, Summarunner: A recurrent neural network based sequence model for extractive summarization of documents, National Conference on Artificial Intelligence (2016), 3075–3081.

Chopra

, Auli

, Rush

A.M.

, Abstractive sentence summarization with attentive recurrent neural networks, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 93–98.

, Chen

, Zhu

, Lcsts: A large scale chinese short text summarization dataset, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1967–1972.

Lopyrev

, Generating news headlines with recurrent neural networks, arXiv preprint arXiv:1512.01712 (2015).

10.

Dorr

B.J.

, Zajic

D.M.

and Schwartz

R.M.

, Hedge trimmer: a parse-and-trim approach to headline generation, in: HLTNAACL- DUC ‘03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5, 2003, pp. 1–8.

11.

See

, Liu

P.J.

and Manning

C.D.

, Get to the point: Summarization with pointer-generator networks, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, 2017, pp. 1073–1083.

12.

Chen

Y.-C.

and Bansal

, Fast abstractive summarization with reinforce-selected sentence rewriting, Meeting of the Association for Computational Linguistics 1 (2018), 675–686.

13.

Hermann

K.M.

, Kočiský

, Grefenstette

, Espeholt

, Kay

, Suleyman

and Blunsom

, Teaching machines to read and comprehend, Neural Information Processing Systems (2015), 1693–1701.

14.

Nallapati

, Zhou

, dos Santos

C.N.

, Caglar Gulcehre and B. Xiang, Abstractive text summarization using sequenceto- sequence rnns and beyond, in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 2016, pp. 280–290.

15.

Gehrmann

, Deng

and Rush

A.M.

, Bottom-up abstractive summarization, Empirical Methods in Natural Language Processing (2018), 4098–4109.

16.

Lin

C.-Y.

, Rouge: A package for automatic evaluation of summaries, in: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, 2004, pp. 74–81 .

17.

Carbonell

J.G.

, Goldstein

, The use of mmr, diversity-based reranking for reordering documents and producing summaries, in: Research and Development in Information Retrieval, 1998, pp. 335–336.

18.

Mihalcea

, Tarau

, TextRank: Bringing order into text, in: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 404–411.

19.

Erkan

and Radev

D.R.

, Lexrank: Graph-based lexical centrality as salience in text summarization, Journal of Artificial Intelligence Research 22 (2004), 457–479.

20.

Cheng

and Lapata

, Neural summarization by extracting sentences and words, in: ACL (1), The Association for Computer Linguistics, 2016.

21.

Landauer

T.K.

and Dumais

S.T.

, A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge, Psychological Review 104 (1997), 211–240.

22.

Landauer

T.K.

, Foltz

P.W.

and Laham

, An introduction to latent semantic analysis, Discourse Processes 25 (1998), 259–284.

23.

Banko

, Mittal

V.O.

, Witbrock

M.J.

, Headline generation based on statistical translation, in: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2000, pp. 318–325.

24.

Zajic

, Dorr

, Schwartz

, Bbn/umd at duc-2004: Topiary, in: Proceedings of the HLT-NAACL 2004 Document Understanding Workshop, Boston, 2004, pp. 112–119.

25.

Rush

A.M.

, Chopra

, Weston

, A neural attention model for abstractive sentence summarization, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 379–389.

26.

Bengio

, Ducharme

, Vincent

and Janvin

, A neural probabilistic language model, Journal of Machine Learning Research 3(6) (2003), 1137–1155.

27.

Chung

, Gulcehre

, Cho

and Bengio

, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555 (2014).

28.

Devlin

, Chang

M.-W.

, Lee

and Toutanova

, Bert: Pretraining of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).

29.

Radford

, Wu

, Child

, Luan

, Amodei

and Sutskever

, Language models are unsupervised multitask learners, OpenAI blog 1(8) (2019), 9.

30.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

and Polosukhin

, Attention is all you need, Neural Information Processing Systems (2017), 5998–6008.

31.

Srivastava

and Sutton

, Autoencoding variational inference for topic models, in: 5th International Conference on Learning Representations, 2017.

32.

Bahdanau

, Cho

and Bengio

, Neural machine translation by jointly learning to align and translate, in: ICLR 2015: International Conference on Learning Representations, 2015, 2015.

33.

Narayan

, Cohen

S.B.

, Lapata

, Don’t give me the details, just the summary topic-aware convolutional neural networks for extreme summarization, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1797–1807.

34.

Graff

, Kong

, Chen

and Maeda

, English gigaword, Linguistic Data Consortium, Philadelphia 4(1) (2003), 34.

35.

Lin

C.-Y.

and Hovy

E.H.

, Automatic evaluation of summaries using n-gram co-occurrence statistics, in: NAACL’03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, 2003, pp. 71–78.

36.

Liu

, Lapata

, Text summarization with pretrained encoders, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3730–3740. doi: 10.18653/v1/D19-1387.

37.

Narayan

, Cohen

S.B.

, Lapata

38.

Song

, Wang

, Feng

, Ren

and Liu

, Controlling the amount of verbatim copying in abstractive summarization, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

39.

, Yan

, Gong

, Liu

, Duan

, Chen

, Zhang

, Zhou

, ProphetNet: Predicting future n-gram for sequence-to-sequence Pre-training, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online 2020, pp. 2401–2410. doi: 10.18653/v1/2020.findings-emnlp.217.

40.

Dou

Z.-Y.

, Liu

, Hayashi

, Jiang

and Neubig

, Gsum: A general framework for guided neural abstractive summarization, arXiv preprint arXiv:2010.08014 (2020).

A unified framework for abstractive summarization over prompt language model and pointer mechanism

Abstract

Keywords

1 Introduction

2 Related work

3 Background: Abstractive summarization

5.1 Dataset

Table 1 Datasets descriptions Dataset Training set size Testing set size Words in article Words in reference summarizations Xsum 204,405 11,334 378 8 Gigaword 3,803,957 1,951 31 8 CNN/DailyMail 286,817 11,487 766 53

5.3 Evaluation

5.4 Baselines

6 Results and analysis

6.2 How abstractive is our model?

Footnotes

Acknowledgements

References

Table 1
Datasets descriptions

Dataset Training set size Testing set size Words in article Words in reference summarizations

Xsum 204,405 11,334 378 8

Gigaword 3,803,957 1,951 31 8

CNN/DailyMail 286,817 11,487 766 53