Abstract
In this paper, we propose a unified framework for an abstractive summarization method which uses the prompt language model and a pointer mechanism. The abstractive summarization problem usually includes a text encoder and a text decoder. Current methods usually employ an encoder-decoder architecture to condense and paraphrase a document. To better paraphrase a document, we propose a unified framework for an abstractive summarization model that only uses a topic-sensitive decoder. Our model has a prompt input module, a text decoder and a pointer mechanism. We apply our model to Xsum, Gigaword, and CNN/DailyMail summarization datasets, and experimental results demonstrate that our model has achieved state-of-the-art results on the Xsum dataset and comparable results on the other two datasets.
Introduction
Summarization is the task of producing a condensed text from an input text that contains the core meaning of the original. Automatic text summarization can be extractive or abstractive. Researchers first concentrated on extractive methods, which identify the salient text units in a document and take them as the summarization [1–6].
The abstractive method uses artificial intelligence to understand the semantic of the original document and generate an informative summary [6–14]. Because of the slow speed of the sequence-to-sequence model, this method is usually combined with the extractive method. Bottom-up abstractive summarization is currently the state-of-the-art method [15], combining extractive summarization with pointer-generator abstractive summarization [11]. These methods use an encoder-decoder architecture for abstractive summarization and a pointer-network to solve the out-of-vocabulary (OOV) problem when generating new words. Chen [12] used reinforcement learning to select important sentences and generate summarizations. We process the abstractive summarization problem in this work.
The contributions of this work are as follows. First, to better generate new words for news documents, we propose a unified framework for the abstractive summarization task. Second, we propose a prompt language model for abstractive summarization which incorporates a pointer mechanism to solve the OOV problem. Third, we propose a topic-sensitive decoder for abstractive summarization. Our model is applied to the Xsum, Gigaword, and CNN/DailyMail datasets. Our results on the abstractive summarization task are state-of-the-art on the Xsum dataset and are comparable on all ROUGE metrics [16] of the other two datasets.
The remainder of this paper is organized in seven parts. Section 2 shows related work, and section 3 shows the background. Our model is presented in section 4. Section 5 describes our experiments. The empirical results of abstractive summarization are presented in section 6. Sections 7 and 8 present our conclusions and acknowledgements, respectively.
Related work
Automatic text summarization has been studied for years, and consists of extractive summarization and abstractive summarization. Sentence scoring and sentence selection are the two key techniques for traditional extractive summarization. Unsupervised and supervised methods have been proposed to model and score sentences. Term frequency and TF*IDF weights can be used as the features for the sentence scoring. These unsupervised methods do not require model training or data annotation. The maximal marginal relevance(MMR) method proposed by [17] is used in sentence selection. The sentence selected by the MMR method has the maximal score and is minimally redundant with previous selected sentences. TextRank based on weighted graphs is an unsupervised method proposed by [18]. LexRank, as proposed by [19], uses the TextRank method to rank sentences. Recently, deep neural networks have been applied for extractive summarization. Cheng and Lapata [20] treated extractive summarization as a sequence labeling task. Nallapati et al. [6] proposed an extractive summarization model called SummaRuNNer with more features. Latent semantic analysis [21, 22] is a general theory of acquiring similarity and knowledge representation and achieving powerful inductive effects by extracting the right number of dimensions to represent objects and contexts.
The early work of abstractive summarization [23] viewed summarization as a statistical machine translation problem. The model proposed by [23] first selected salient content and then realized the surface using a statistical model. The calculation of the probabilities of candidate abstract terms and candidate surface realizations enables one to choose the most likely summarization for an article. Lacking an evaluation mechanism, the result of this method is not shown. After the DUC-2003 and DUC-2004 competitions, summarization research was formalized and standardized. Zajic [24] implemented an abstractive summarization system called TOPIARY which combined sentence compression and unsupervised topic discovery. TOPIARY performed best at DUC-2004.
Rush [25] first used neural networks for abstractive summarization and applied them to the headline generation task. The core of the model is the standard feedforward neural network language model (NNML) [26]. The method achieved a state-of-the-art result on the DUC-2004 and Gigaword datasets with the ROUGE metrics. Lopyrev [9] also applied an encoder-decoder architecture based on LSTM to abstractive summarization. They simplified the attention mechanism, but in this study, they did not compare their experimental results with those of other methods. Chopra [7] continued the work of [25], changing the encoder to a convolutional attention-based encoder and changing the decoder to a recurrent neural network (RNN)-based decoder. They applied their model to Gigaword and DUC-2004 on the ROUGE metrics and found that it performed slightly better than [25]. Nallapati [6] applied an attentional encoder-decoder based on RNN to summarization and showed its results on two English corpora. The encoder was a bidirectional GRU-RNN, and the decoder was a unidirectional GRU-RNN in [27]. To address the problem of the large vocabulary trick (LVT), they restricted the decoder vocabulary of each mini-batch to words in the source document of that batch. They used a feature-rich encoder with one embedding vector each for POS, NER tags, and discretized TF and IDF values, concatenated with word-based embeddings. A novel switching decoder/pointer architecture handled the OOV problem. Hu [8] developed a large dataset for Chinese short-text summarization. They applied an attentional encoder-decoder model based on RNN and the experimental results were not particularly good, probably because they used the representations of the characters and not the words as the encoder’s input. The shortcomings of the neural sequence-to-sequence model can lead to the inaccurate reproduction of factual details, and it can generate repeated words. To address these problems, [11] proposed a hybrid pointer-generator network that can copy words from the source text via pointing, and used coverage to keep track of what had been summarized. Chen [12] applied reinforcement learning to summarization, selecting salient sentences and abstractively rewriting them to generate a concise overall summary.
Background: Abstractive summarization
Abstractive document summarization aims to condense a document and retain the salient information. Abstractive summarization consists of document compression and document paraphrasing. With the input of a source document D = {x1, x2, . . . , x
n
}, where x
i
(1 ≤ i ≤ n) represents one token in the source document, abstractive summarization rewrites D as
The abstractive summarization model can be defined by Equation (1).
The architecture of the unified framework for abstractive summarization model is shown in Fig. 1. The whole model consists of five parts: (1) Masked prompt input layer: masked prompt input of abstractive summarization model; (2) Decoder Layer: a transformer decoder-based language model to predict the next word of the text input; (3) Topic Layer:the topic-level representation of the document inferred by a topic model; (4) Generation Layer: a topic-sensitive decoder with a pointer mechanism to generate the summarization of the document; and (5) a joint training loss function to optimize the topic and decoder models.

Overview of unified framework for abstractive summarization model.
The transformer decoder-based language model extracts the text word features. The topic model provides additional latent topic features for the decoder to find words that are more related to the document content.
Formally,
The input of the abstractive summarization model is shown as Fig. 2. The input module of the language model has three parts: tokens of the source document, prompt tokens, and target tokens of the source document.

Prompt input module of abstractive summarization model.
The special token [SEP] disparts the source and target tokens. To better generate words and train the language model, we mask the input words in the source document by using the strategy proposed by [28]. We replace the word with the [Mask] 80% of the time. We replace the word with a random word 10% of the time. We keep the word unchanged 10% of the time. Part of the words in the source document are masked by a special token [MASK], while the target and prompt tokens are not masked. The whole tokens are the input of the embedding layer.
In [29], the author tested GPT-2’s ability to perform summarization on the CNN/DailyMail dataset. There will be usually a prompt token named TL;DR: before the summarization of a long article. They added the token TL;DR: after the article to induce summarization behavior. In our framework, we choose TL;DR: as the prompt tokens, and design some prompt tokens to induce summarization behavior. If the tokens can manipulate the behavior of the model and dig the knowledge of the source document, we can choose these tokens as the effective prompt tokens.
Because we use the decoder only in our model, we use the decoder of transformer [30] to capture the feature information of the document words. The input of the decoder is the one-hot representation of the document words. We use the embedding layer to obtain the distributed representations of the document words. The transformer can avoid information loss when training the language model for a long text. In the transformer decoder layer, the word in the current position can attend to the words at and before the current position. When predicting the next word, the hidden state of the current position will be computed by the embeddings of the source document, prompt input, and generated words. The decoder model is defined as Equations (3)-(10). The output of the decoder layer is H.
There is an association between the word distribution and its topic distribution in LDA-style topic models. We can infer a latent topic distribution through the topic model. We employ a variational autoencoder (VAE) [31] as our topic model to obtain the topic representations of the document. The topic model p (
Given the document representation
where q (
The traditional decoder of the abstractive summarization has two parts: the target sequence embedding layer and the attention mechanism. In our abstractive summarization model, the inputs of the decoder are the document word embeddings
where V, , b, and are the weight parameters and the bias parameters. In our abstractive summarization model, the generation probability has two parts. P1vocab is the generation probability of the un-masked tokens of the source document defined as Equation (20), and P2vocab is the generation probability of the tokens of the generated words and prompt tokens defined as Equation (21).
We adopt a pointer mechanism [11] to solve the OOV problem. The generation model is defined in Equation (2) where p (tg
j
|vocab, tg1,j-1, D, θ) = P2vocab (j). In Equation (2), p
gen
is defined as Equation (22), which is the probability of generating new words through the old generation model. We obtain attn _ dist from the multi-head-attention results in the transformer decoder, where attn _ dist is defined by Equation (23) and B denotes the batch size. With the pointer mechanism, the summarization generated by our model has a few OOVs.
During training, the loss is the negative log likelihood of the target word
To update all model parameters involving topic model and neural abstractive summarization model, the loss combines cross-entropy with KL loss, which is defined as Equation (25).
We evaluate the proposed approach on the task of single document summarization. We discuss the datasets used in the training and evaluation processes and describe the implementation of our model. We report the ROUGE values of different models for comparison.
Dataset
We use three datasets in our experiments. The statistics of the datasets are shown in Table 1. The Xsum dataset [33] consists of professionally written one-sentence summaries, and contains 204,045 training pairs and 11,334 testing pairs. The average word count in a document is 378, and the average number of words in a headline is 8. We truncate the source documents to 80 tokens and truncate the headline to 20 tokens.
Datasets descriptions
Datasets descriptions
The Gigaword dataset produced by the Linguistic Data Consortium (LDC) [34] is a comprehensive archive of newswire text data in English acquired over several years, consisting of 3,803,957 training pairs and 1,951 testing pairs. The average number of words in articles is 31 and the average number of words in target titles is 8.
We evaluate the proposed approach on the CNN/DailyMail dataset [13, 14]. In all, the corpus contains 286,817 training pairs, 13,368 validation pairs, and 11,487 test pairs, as defined by their scripts. The source documents in the training set have 766 words spanning 29.74 sentences on average, while the summaries consist of 53 words and 3.72 sentences. We truncate the source document to 400 tokens and truncate the summary to 80 tokens.
Our model is based on the decoder structure of Transformer [30]. We set the decoder layers to 3. The dimensions of the word embeddings and the hidden states are both set to 768. The dimension of topic representation is set to 768. We use the Adam optimizer with a learning rate of 0.0003. The batch size is 16, and there are two epochs in the training process. The vocabulary size is set to 50000. The maximum length of the document is 80, and the maximum length of the summary is 20 on the Xsum dataset. The maximum length of the document is 30, and the maximum length of the summary is 8 on the Gigaword dataset. The maximum length of the document is 400 and the maximum length of the summary is 80 on the CNN/Dailymail dataset. In the decoding phase, the beam size is set to 4. The model is trained on a 16-GB Tesla v100 GPU. We employ a pointer mechanism [11] to solve the OOV problem.
Evaluation
We adopt the ROUGE [16, 35] metrics to evaluate the performance of the summarization models and report the ROUGE-1, ROUGE-2, and ROUGE-L values. We use the ROUGE-1 value and ROUGE-2 value to assess the informativeness of summarization. The ROUGE-L value is used to assess a summary’s fluency.
Baselines
Xsum dataset—We compare our model against the baselines implemented by ourselves or others. A lead baseline takes the first sentence as the summary; pointer-generator [11] is a seq2seq model incorporated with the pointer-generator network; pointer-generator + coverage [11], which incorporates a coverage mechanism to solve the repetition problem; BERTSUMABS [36] applies BERT [28] to text summarization; and T-CONVS2S [37] is a neural abstractive summarization model completely based on convolutional neural networks.
Gigaword dataset—lead baseline is implemented by ourselves, which takes the first sentence as the summary; ControlCopying [38] can control over-copying during both the training and decoding stages of a neural summarization model; ProphetNet [39] can optimize the n-step future prediction.
CNN/DailyMail dataset—Pointer-generator + coverage [11], which is incorporated with the coverage mechanism on the news dataset. ProphetNet [39] predicts future n-grams for sequence-to-sequence pretraining. Gsum [40] is a general framework for guided neural abstractive summarization.
Results and analysis
Table 2 shows our main results on the Xsum dataset with the lead baseline shown at the top, and our model shown at the bottom. We observe that the pointer model and coverage mechanism are effective to solve the OOV problem and repetition problem. So, we employ the pointer model in our model. We evaluate our models with the ROUGE metrics [16]. The F1 scores for ROUGE-1, ROUGE-2, and ROUGE-L are reported in Table 2, to respectively measure the word-overlap, bigram-overlap, and longest common sequence between the generated summary and the reference summary. From Table 2, we can find that our AUF-ABS model achieves state-of-the-art results. The ROUGE-1, ROUGE-2, and ROUGE-L values of our model are 45.22, 23.74, and 41.59, respectively, which far exceed those of the T-CONVS2S model.
ROUGE results on Xsum test set. We report ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) F1 scores
ROUGE results on Xsum test set. We report ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) F1 scores
Table 3 shows the results on the Gigaword dataset. The dataset we use is a compressed version. Compared with ProphetNet, the ROUGE values of our model are slightly lower, which is largely due to the simplied dataset.
ROUGE results on Gigaword test set. We report ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) F1 scores
Table 4 shows the results on the CNN/DailyMail dataset. Because this is a long document summarization dataset, our language model shows weakness compared to the other methods, but is still competitive.
ROUGE results on CNN/DailyMail test set. We report ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) F1 scores
Current researches focus on both extractive and abstractive summarization. Extractive summarization and abstractive summarization are closely related. The datasets of the extractive summarization lack labels, which restricts the training process.
From Table 2, we can see that the lead baseline achieves a good result. Although the first sentence contains important information, there is also important information in the middle and at the end of the document.
The core of extractive summarization is to select important sentences and remove unimportant sentences, which can compress the document. The core of abstractive summarization is to select the salient words of the source document and paraphrase them. Some methods combine both methods [15]. To implement end-to-end abstractive summarization, we propose a unified framework that can be used with different datasets of abstractive summarization. We can see the effect of our model from Tables 2 to 4.
How abstractive is our model?
Tables 5–7 show the proportions of novel n-grams of different models. Unlike our model, other models neglect the length of the source document when calculating the proportion. From Table 5, we can find that due to our pointer mechanism, the novel n-gram proportion of our model is less than that of T-CONVS2S. The lower rate of novel n-grams shows that our model copies more words from the source document than T-CONVS2S. In other words, our model has a lower degree of abstraction. However, this does not mean that our model is less effective, as shown by the ROUGE values on these datasets. From Table 5, we can see that the tokens of the source document are much longer and the rate of novel n-grams is much lower, indicating a lower degree of abstraction.
Proportion of novel n-grams in summaries generated by various models on Xsum test set
Proportion of novel n-grams in summaries generated by various models on Xsum test set
Proportion of novel n-grams in summaries generated by various models on the Gigaword test set
Proportion of novel n-grams in summaries generated by various models on the CNN/DailyMail test set
Table 6 shows the proportion of novel n-grams on the simplified Gigaword dataset. The novel 1-gram, novel 2-gram, and novel 3-gram rates of our model are higher than those of ORACLE, which means that our model has a higher degree of abstraction. But the rate of novel 4-gram of our model is lower than that of ORACLE which is because the length of the title is short in the dataset.
We also provide the proportion of novel n-grams on the CNN/DailyMail dataset.
From Tables 6 to 7, we can find that our model’s summaries contain a higher rate of novel n-grams than the ORACLE model, indicating a higher degree of abstraction. But from Table 5, we can find that our model’s summaries contain a lower rate of novel n-grams than the ORACLE model indicating a lower degree of abstraction, and it copies more tokens from the source document.
Figure 3 shows two examples of summaries on the Xsum dataset. The dataset contains many news articles accompanying titles. Most words of the summary can be found in the document, so the proportion of the novel 1-gram is lower on Xsum dataset than other datasets. We can also find that the most words in the summary are in the first sentence of the document. By incorporating the topic information and pointer mechanism into our model and using the prompt language model, the feature extraction ability and the experimental results on the Xsum dataset are better than those of the other models.

Examples of summaries produced by our model on Xsum dataset (bold denotes novel words).
Figure 4 shows examples of abstractive summaries produced by our model on the CNN/Dailymail dataset and Fig. 5 shows examples of abstractive summaries produced by our model on the Gigaword dataset. The reference summaries on the CNN/Dailymail dataset contain multi-sentences, while the reference summaries on the other two datasets just contain one sentence. The summary on the CNN/DailyMail dataset generated by the language model is long, and it contains multi-sentences. So, the experimental results of language models on long text summary datasets are not better than those of traditional sequence-to-sequence models. This problem can be solved in future work. From Fig. 5, we can find that the generated summary still contains out-of-vocabularies named UNK, which is because the training set and testing set on the Gigaword dataset contain UNK tokens. The OOV problem can be reduced by incorporating the pointer mechanism proposed in our model, but this problem still exists.

Examples of abstractive summaries produced by our model on CNN/Dailymail dataset.

Examples of summaries produced by our model on Gigaword dataset.
Language models have developed quickly. The prompt language model has been used in all kinds of natural language processing tasks. The prompt language model is also effective in the summarization task. From Tables 2 to 4, we can find that the prompt language model is more effective on the short text summarization task. For the long text summarization task, the prompt language model is much weaker, as shown by the experimental results on the CNN/DailyMail dataset.
Why is the prompt language model more effective on short text summarization datasets? One reason is that current feature-extraction models are restricted by the length of the document. The ability of extracting text features from long documents still has a long way to go. The other reason is that the important information is not only located in the first few rows.
We introduce a unified framework model based on a prompt language model and pointer mechanism. Our model is also a masked language model. The new model with the decoder only shows strong feature-extraction ability. With the pointer mechanism, it can generate a more proper summarization. The model was applied to summary datasets, achieving state-of-the-art results on the Xsum dataset, and comparable results with other models on the other two datasets.
Footnotes
Acknowledgements
We thank the anonymous reviewers for their helpful comments. This work was supported by the National Natural Science Foundation of China (No. 61862060).
