On redundancy in multi-document summarization 1

Abstract

In this paper we study how the presence or absence of redundancy on multiple related texts can be used to compute sentence relevance for extractive multi-document summarization. Two types of redundancy can be found: intra-document and inter-document. By experimenting with them, different ideas can be extracted, for example: statements redundant between documents—which can be important by their popularity; statements that are not redundant—which can be important by their novelty; or statements redundant within each document—which can be important by being constantly addressed by a single author. We propose an unsupervised graph-based method that allows to generate summaries based on different strategies of redundancy. We present experiments on two DUC corpora of nine different strategies to extract information depending of how redundancy within a document and in different documents is managed. According to DUC gold standards, we found that a multi-document generic summary should contain the most redundant (popular) information between different sources while avoiding local intra-document redundancy. We implemented a mechanism to enrich sentence rankings with redundancy, improving the evaluation of summaries.

Keywords

Multi-document summarization similarity graphs unsupervised summarization sentence redundancy doc2vec

1 Introduction

As information consumers, we face everyday the problem of handling massive amounts of data from the Internet. Much of this information is redundant, which complicates the assimilation of content in a big group of texts when looking for a specific topic. This redundancy is in part caused by the freedom to generate information without virtually any control on its content, which in general is a good thing; however, tools to filter this redundancy would be of great help to people who need to digest all this information.

Apart from being a problem to deal with, this huge amount of data can be highly valuable; information redundancy has two faces: it is a problem, but it is also useful for discriminating information. Multi-document summarization deals with this affluence of information with methods to automatically extract the most important ideas from multiple sources and present them into a final summary. In some cases, summarization algorithms can use some clues to determine importance, such as in query-focused, comparative or update summarization: in these approaches, a sentence is considered important if it matches a specific query (query-focused), if it is the most discriminative sentence by representing the characteristics of a document group (comparative), or if it shows information different from what was previously known (of update).

However, one may need a general panorama of what texts say, i.e., a generic summary. In this setting, the importance of information is determined only with regard to the content of the input itself. So, in this case what should be considered of importance to the final reader? This is why we propose a different perspective to search for information in multiple documents based on redundancy, giving importance to an idea by its level of popularity. In this way, this massive redundancy would be of great help at assigning relevance to an idea.

For the purpose of evaluating the effect of redundancy on summarization, we developed an unsupervised method to automatically summarize multiple documents, providing the flexibility of extracting either popular or novel information from these documents. What differentiates this work from others is the way we deal with redundancy. We first analyzed all possible cases of redundancy that can arise per document, and across documents, within multiple texts. Afterwards, we developed a mechanism to extract different summaries—on the same group of documents—containing sentences generated by experimenting with these cases of intra-document and inter-document redundancy, such as: ideas redundant per author, ideas redundant by different authors, unique ideas that are not redundant, etc.

Studies have been carried on to find out what approach of redundancy humans prefer, leaving inconclusive results because this depends always on the information needs [23]. While some studies have suggested that people prefer to consult sources of information with low redundancy, others have shown that people ignore the redundancy in the sources or, in some cases, even prefer it. By applying the aforementioned mechanism on the DUC corpora and evaluating with the ROUGE measures2, we aim to study redundancy in the ambit of multi-document summarization. In this way, we can provide a general statement on what makes an idea important, considering its redundancy, so that it should be included in a summary.

The rest of this paper is structured as follows. In Section 2 we present an analysis of related work from the point of view of how these works handle redundancy. Section 3 explains our proposed method in full detail. In Section 4, we present our experiments to determine the influence of redundancy on summarization, using the proposed method. Finally, in Section 5 we draw our conclusions, along with possible lines of future work.

2 State of the art

There have been numerous methods that analyze redundancy when looking for information in multiple sources. For example, back in 1998, Bourbakis et al. [1] proposed a methodology for eliminating redundancy from documents of a database by comparing their paragraphs’ features. Back then, there were no common resources for evaluating these methods, so that now we focus on those works that have used DUC corpora and ROUGE measures; in this way, we can find out how redundancy has influenced the measure of sentences’ relevance in multi-document summarization.

All approaches agree that redundancy should be eliminated from the final summary. Some of them manage redundancy at the phase of sentence selection by selecting the top n non-redundant ranked sentences, i.e., avoiding sentences in the summary if their similarity to the previously selected sentences is greater than a given threshold [5 , 18].

The maximum marginal relevance (MMR) approach [7] searches novelty by penalizing the score of sentences according to their similarity with the already selected sentences for the summary. But in the same way they penalize this redundancy, they benefit similarity with regard to a given query. This method is typically applied during the process of sentence selection for query-focused summarization methods [12, 16]. In this way, the top sentences composing the final summary are important to the query and are not redundant between themselves, i.e., are diverse.

Modifications have been proposed on the MMR technique to change the benefit in the formula. Instead of benefiting its similarity with a given query, they favor other aspects related to the method, while the penalization on redundancy remains the same. For example Qiang et al. [19] made an adjustment to benefit the coverage of the sentence on the closed patterns of the documents. Also, Hennig [10] changed the formula to benefit the already computed score of the sentence according to its similarity with not only the query, but other elements of the text as the title of the cluster or the title of the document.

Furthermore, some other techniques penalize redundancy on the computation of the sentence score diferently from the MMR technique. For example Radev et al. [20] penalize a sentence if it overlaps with highest ranked sentences; on the other hand, Toutanova et al. [25] use a discount factor on the computation of each feature value of the sentence, where the magnitude of the discount or penalization is related to the similarity with the first selected sentences for the summary.

Some other authors, instead of penalizing, compensate the non-redundancy or diversity. For example, Parveen and Strube [17] compensate sentences by not sharing the same entities, while Galanis et al. [9] compensate sentences by not sharing the same bigrams through an optimization method pursuing, among other variables, in order to maximize the quantity of different bigrams in the final summary.

A different approach is to deal with redundancy in previous stages. In graph-based methods this is considered from the construction of the graph. For example, Mihalcea and Tarau [15] establish a maximum similarity threshold. In this way, very similar sentences are not linked with an edge and thereby they do not add importance to a node. Other graph-based approach is presented by Shen and Li [22], where the summary is represented by the minimum dominating set on a graph built by representing nodes as sentences and edges as similarities, hence leaving out the redundant sentences by looking for the minimum set.

Clustering methods also have been successful in reducing redundancy and representing diversity by grouping sentences which are highly similar to each other into one cluster, thus generating several clusters. Once sentences are clustered, sentence selection is performed by selecting a sentence from each cluster. At the end some extra measures have to be executed on the selected sentences to ensure non-redundancy on the final summary [5, 20].

As can be seen, none of these methods judges directly the importance of an idea by its absence or presence of redundancy. The usual management of redundancy is mainly focused on precisely avoiding it on their final summary. This is why this work focuses on studying redundancy handling through a flexible framework that allows us to experiment with different redundancy settings. This framework will be explained in the following section.

3 Proposal

We aim to take advantage of the presence or absence of intra-document and inter-document redundancy as relevant information. Of course, at the end redundancy should be eliminated from the final summary, but this information on sentences is not ignored. This section presents our proposed method for this purpose. The explanation is divided in the three stages of the extractive summarization process: sentence extraction, computing relevance—from redundancy—for each sentence, and the generation of the summary from the most relevant sentences.

3.1 Sentence extraction

Starting from a group of texts to be summarized, an appropriate representation to manipulate them is necessary. For this purpose, we represent them in a graph adopting a traditional method, i.e., by mapping all the sentences of the texts as nodes and the similarity between them as edges, that is, edges are weighted with the similarity value. The graph is complete: every node is connected to all other nodes.

A suitable similarity measure is very important. Following Le and Mikolov [11], we implemented a cosine similarity measure on sentences represented on a vector space model (Paragraph Vector—also known as doc2vec), which has shown to give good results in measuring similarity considering context. This measure has proven to be the best among other semantic models—such as LSA, LDA, or Word2Vec—for automatic document summarization [4].

For breaking a text into sentences, we used the standard script provided by NIST for evaluating summaries since DUC 20033. This script takes points, ellipsis, exclamation and question marks as end-of-sentence markers, taking care of not cutting well-known abbreviations and pronouns. To improve this process, the script first does some pre-processing steps on the text to homologate the different writing styles from every newspaper provider. By using this script, we conform to DUC standards to determine what is a sentence and what is not. In addition, a basic pre-procesing was performed on sentences by converting them to lower case and stemming—in order to improve the overlapping and sharing of context of words—and by removing punctuation marks.

3.2 Relevance search

After extracting sentences into the graph, nodes are weighted to initially determine sentence importance. This latter is going to be useful for decision making at removing sentences. We are adopting the TextRank [14] value as the initial weight of nodes—relevance of sentences. In an undirected graph, node weights are computed as in PageRank: the more connections the node has with other important nodes, the higher score it has. Moreover, the strength of the connection also accounts for giving it importance, i.e., the stronger the similarities of its edges with others, the more relevant the node.

Formally, the TextRank value can be computed as follows: let G = (V, E) be a directed graph with the set of vertices V and the set of edges E, where E is a subset of V × V. For a given vertex V_i, let In (V_i) be the set of vertices that point to it (predecessors), and let Out (V_i) be the set of vertices that vertex V_i points to (successors). The TextRank score of a vertex V_i is defined as follows

$\begin{matrix} WS (V_{i}) \\ = (1 - d) + d \sum_{V_{j} \in In (V_{i})} \frac{w_{ji}}{\sum_{V_{k} \in Out (V_{j})} w_{jk}} WS (V_{j}) \end{matrix}$ (1) where w_ij represents the strength of the connection from node V_i to node V_j, and d is a damping factor representing the probability of jumping from a given vertex to another random vertex in the graph. This factor appears in the context of Web surfing with the PageRank [3] method, where a user clicks on links at random with a probability d, and jumps to a completely new page with probability 1 - d. We adopted the commonly used value for this d factor of 0.85 [3, 14].

Since our graph is not directed, we treated the output and input edges as equal. After computing the initial relevance value for each node, we obtain a graph with weighted edges and weighted nodes, as depicted in Fig. 1. This figure represents all sentences from n documents, where each document has a variable number of sentences (m sentences for document 1, k sentences for document 2, l sentences for document n, etc.)

Fig.1

Graph representation after weighting edges.

Once we have this graph representation, we are able to experiment with redundancy. Redundancy will be exhibited through similarity: if two sentences are highly similar, then these sentences are redundant.

It is possible to determine the relevance of an idea depending on how redundant it is across documents, i.e., inter-document redundancy, or how redundant it is within each document, i.e., intra-document redundancy. For example: one can say that a statement is important if it is redundant across all documents—such as popular information, see Fig. 2a; or that it is important if it is not mentioned across all documents—as the concept of the inverse document frequency, that if it is mentioned by everyone then it is not relevant because everyone knows about it, see Fig. 2b; or that an statement is important if it is redundant only in one document but not on all of them—if only one author thinks it is important maybe it is, as valuable rare information, see Fig. 2c. All these ideas could be valid on different areas such as medicine, news, geography, politics, etc.

Fig.2

Some strategies to define idea importance according to redundancy.

After considering these notions, we determined nine possible strategies given the combination of intra-document and inter-document redundancy. These strategies establish that a sentence is relevant if:

It is redundant per document and redundant across documents.

It is redundant per document and not redundant across documents.

It is redundant per document.

It is not redundant per document and redundant across documents.

It is not redundant per document and not redundant across documents.

It is not redundant per document.

It is redundant across documents.

It is not redundant across documents.

Redundancy is not taken into account to provide relevance.

Strategies 3 and 6 determine the relevance of a sentence based on intra-document redundancy only, inter-document redundancy is irrelevant. Strategies 7 and 8 are based on intra-document redundancy as well, but consider relevance based on inter-document redundancy, without taking into account intra-document redundancy. Strategy 9 is a combination of these two latter ideas, by ignoring intra-document and inter-document redundancy, so that redundancy is eliminated without taking it into account to determine the relevance of the sentence.

These strategies are more clearly defined based on two parameters:

pd: per-document redundancy parameter, indicating how intra-document redundancy (r.) should be managed,

pd = 1 intra-document r. is important,

pd = 0 intra-document r. is irrelevant,

pd = -1 intra-document r. is undesired.

cd: cross-document redundancy parameter, indicating how inter-document redundancy (r.) should be managed,

cd = 1 inter-document r. is important,

cd = 0 inter-document r. is irrelevant,

cd = -1 inter-document r. is undesired.

Having established these strategies, what follows is the merging of similar nodes on the graph, taking care of how the relevance of the nodes is being carried on. To mimic the nature on how documents are processed by human readers, we settled to perform the merging of nodes first by document and then across documents. Starting from the graph with weighted edges and weighted nodes, we present the algorithm for merging, see Algorithm 1.

Algorithm 1

Merging process

Step 1 Choose the most similar two nodes on the graph: N₊ and N_-. Being N₊ the node with the highest weight (the important node) and N_- the other node (the less important node).

Step 2 Merge nodes, taking into account the following aspects:

1. The text of N₊ remains, while other is dropped.

2. All the edges of N_- must be inherited to N₊ by averaging the weight of their correspondent edges.

3. Being R (N₊) the weight of N₊, it is re-computed depending on parameters pd and cd:

if redundancy is important (pd = =1 or cd = =1) then R (N₊) = R (N₊) + R (N_-)

else if redundancy is irrelevant (pd = =0 or cd = =0) then R (N₊) = R (N₊)

else if redundancy is undesired (pd = = -1 or cd = = -1) then R (N₊) = R (N₊) - R (N_-)

Step 3 Repeat step 1 until a similarity threshold [15] is reached.

To depict this process, a single merging between two given nodes is shown in Fig. 3. Let us follow the merging process between these two nodes together with the presented algorithm.

Fig.3

Merging process example of two nodes.

At step 1, the two most similar nodes are detected in Fig. 3a, N₊ is d_ns₂ and N_- is d₂s₂. Next, the merges of these two nodes for step 2 are presented from Fig. 3b to d. For all cases, steps 2.1 and 2.2 of the algorithm are performed in the same way: the text of d_ns₂ remains, while d₂s₂ disappears; before disappearing, all the edges of d₂s₂ are inherited to d_ns₂ by averaging their weights—this is represented as thickness of the edges. It can be seen how before merging there was no edge between nodes d_ns₂ and d₂s₁, which later appears because node d₂s₂ inherited it to d_ns₂ before disappearing, with half of its original thickness.

Now we focus on the three possible cases at step 2.3, where the weight of N₊ is re-computed. Figure 3b shows the resulting relevance of N₊ when redundancy is considered as important. Figure 3c shows the resulting relevance of N₊ when redundancy is taken as irrelevant, and Fig. 3d shows the resulting relevance of N₊ when redundancy is undesired. In this way, the merging process increases, ignores, or penalizes the relevance of the node when it finds a similar sentence, giving the flexibility needed to experiment with the different strategies.

When merging per document, at the first step the two chosen nodes must belong to the same document and the parameter taken into account at step 2.3 is pd. The process finishes when the established intra-document similarity threshold is reached. When merging across documents, at the first step the two chosen nodes must belong to different documents and the parameter taken into account at step 2.3 is cd. The process stops when the established inter-document similarity threshold is reached. Once the algorithm finishes, each sentence will have a computed relevance value according to the selected redundancy strategy.

3.3 Summary generation

Once the assignment of relevance to the sentences has finished, a ranking of sentences will result. Usually at this stage, the most relevant sentences compose the final summary, but some measures are implemented to ensure that they are not redundant between each other.

Given that we removed redundancy together with the computing of relevance, no redundancy is expected in the ranked sentences. Therefore, we can simply pick the top relevant sentences and employ them in the final summary until a maximum length is reached, ensuring non-redundancy.

4 Experiments and results

For our experiments we used the DUC 2004 task 2 and DUC 2007 main task corpora. No data set to train the method was necessary since it is unsupervised. The specifications of these corpora are shown in Table 1.

Table 1
Specifications of the DUC 2004 and DUC 2007 corpora

DUC 2004 task 2 DUC 2007 main task

Type of summaries Generic Query-focused

Number of clusters 50 45

Number of documents per cluster 10 25

Number of reference summaries per cluster 4 4

Maximum length allowed in summaries 665 bytes 250 words

	DUC 2004 task 2	DUC 2007 main task
Type of summaries	Generic	Query-focused
Number of clusters	50	45
Number of documents per cluster	10	25
Number of reference summaries per cluster	4	4
Maximum length allowed in summaries	665 bytes	250 words

By using DUC corpora, we wanted to study which of the previous stated redundancy strategies are used by gold-standard DUC summaries. Since the majority of works involved in summarization compare their resulting summaries against those summaries provided by DUC, this study will lead us to state the general understanding of what kind of redundancy makes an idea important to become part of a summary—according to DUC gold standards.

4.1 Experiment and results with different similarity measures

For computing the doc2vec-based similarity measure, the Paragraph Vector method was used to create a model and generate vectors of the sentences, later they will be compared using a cosine measure.

We adopted the Distributed Memory Model of Paragraph Vectors (PV-DM) of doc2vec. This is a fully connected multi-layer perceptron with three layers: the input, projection and output layer. As input for the first layer, the one-hot encoded vector of the paragraph and the one-hot encoded vectors of words within this paragraph, must be provided—the number of words is specified by a given sliding window. The number of neurons for this input layer depends on the length of the dictionary of words and paragraphs used for training this neural network. Input vectors are then combined by concatenating them at the projection layer. Finally, at the output layer, a hierarchical softmax with a binary Huffman tree is used to normalize output values and optimize the training process. The training process is made by minimizing the error with the stochastic gradient descent, where the gradient is obtained via backpropagation.

This neural network requires big volumes of texts to be trained. For this requirement, the training was done with the dataset from “One Billion Word Language Modeling Benchmark”4 that has almost 1 billion words. Punctuation marks were removed, so the sentences to be represented afterwards in this vector space model are also processed without punctuation marks. Each resulting vector has a dimension of 100 elements (this is configured at the projection layer of the neural network), the chosen window between the predicted word and context words was of 8 words, we ignored all words with a total frequency lower than 5, and we did 10,000 iterations at the inference stage for computing new sentence vectors.

As doc2vec has proven to outperform other simple similarity measures in different tasks [11], we decided to carry out our experiments based on the doc2vec-based similarity measure.

To exemplify the use of the doc2vec-based similarity measures with the DUC corpora, we present a few sample sentences from a cluster of news of DUC 2002, where the covered topic is the opening of the first McDonald’s in different countries. See Table 2. Consider sentences s3 and s4 as a reference to be compared with the other sentences. Sentence s3 and sentence s8 refer to a similar idea, and sentences s4, s6 and s7 seem to refer to the same fact. Similarity values using doc2vec and the superposition similarity measure—which simply measures the overlapping of words between sentences [14]—are shown in Table 3. doc2vec-based values are between -1 and 1, while similarities of the superposition measure start from 0 having no upper limits—the upper limit depends on the length of the sentence. The highest values reported by the doc2vec-based similarity measure correspond to the most similar ideas, while the superposition similarity measure does not perform well when comparing s3 by giving the highest value to an idea that is not similar. Also note that doc2vec performs better at comparing sentences that are similar but not necessarily by sharing the same words—as the case between s3 and s8.

Table 2
Examples from DUC 2002 with similarity measures

Id Sentence

s1 John Onoda, a spokesman at McDonald’s Oak Brook, Ill., headquarters, said it was the first of the chain’s outlets in a communist country.

s2 The restaurant has 350 seats and employs 110 people capable of serving 2500 meals per hour.

s3 Normally dour citizens broke into grins as they caught the infectious cheerful mood from youthful Soviet staffers hired for their ability to smile and work hard.

s4 The world’s largest version of the landmark American fast-food chain rang up 30000 meals on 27 cash registers, breaking the opening-day record for McDonald’s worldwide, officials said.

s5 McDonald’s hamburgers, fries and golden arches came to China on Monday when the fast-food chain opened its first restaurant in a nation famed for its distinctive cuisine.

s6 With seats for 900 and 27 cash registers, it served 30000 people on opening day.

s7 The world’s largest McDonald’s, with 27 cash registers and a seating capacity of 900, brings to Moscow not only hamburgers, french fries and shakes (called ’milk cocktails’ here), but also a living lesson in Western-style marketing.

s8 At training sessions before opening day, cashiers were taught the importance of greeting customers cheerfully, of saying ’please’ and ’thank you’ —all of which promises something distinctly different from the typically surly service at most of Moscow’s dingy state cafes.

Id	Sentence
s1	John Onoda, a spokesman at McDonald’s Oak Brook, Ill., headquarters, said it was the first of the chain’s outlets in a communist country.
s2	The restaurant has 350 seats and employs 110 people capable of serving 2500 meals per hour.
s3	Normally dour citizens broke into grins as they caught the infectious cheerful mood from youthful Soviet staffers hired for their ability to smile and work hard.
s4	The world’s largest version of the landmark American fast-food chain rang up 30000 meals on 27 cash registers, breaking the opening-day record for McDonald’s worldwide, officials said.
s5	McDonald’s hamburgers, fries and golden arches came to China on Monday when the fast-food chain opened its first restaurant in a nation famed for its distinctive cuisine.
s6	With seats for 900 and 27 cash registers, it served 30000 people on opening day.
s7	The world’s largest McDonald’s, with 27 cash registers and a seating capacity of 900, brings to Moscow not only hamburgers, french fries and shakes (called ’milk cocktails’ here), but also a living lesson in Western-style marketing.
s8	At training sessions before opening day, cashiers were taught the importance of greeting customers cheerfully, of saying ’please’ and ’thank you’ —all of which promises something distinctly different from the typically surly service at most of Moscow’s dingy state cafes.

Table 3

Similarity values between sentences used for comparison (rounded to the first decimal)

	s1	s2	s3	s4	s5	s6	s7	s8
Doc2vec-based similarity values
s3	0.0	0.0	1.0	0.0	0.2	0.0	0.0	0.3
s4	0.1	0.1	0.0	1.0	0.1	0.3	0.1	0.1
Superposition similarity values
s3	0.3	0.7	9.1	0.6	1.3	0.7	0.9	0.9
s4	2.0	1.1	0.6	9.3	2.7	2.9	2.9	1.6

4.2 Results with different redundancy strategies

We tested the merging method with all combinations of the parameters described in Section 3.2, yielding nine redundancy strategies. Our method was applied on the graphs built with the clusters of texts from DUC 2004 and DUC 2007 corpora. The resulting summaries were evaluated against summaries provided by DUC with the ROUGE method.

Table 4 presents the results of the experiments performed on the DUC 2004 corpus, while Table 5 presents the results of the experiments performed on the DUC 2007 corpus. Only recall measures are presented for these experiments, and results are ordered by ROUGE-1 in descending order.

Table 4
Results for redundancy strategies for DUC 2004 texts

Strategy Parameters R-1 R-2 SU4

pd cd

7 0 1 0.3775 0.0771 0.1283

4 -1 1 0.376 0.076 0.127

1 1 1 0.370 0.075 0.125

9 0 0 0.359 0.071 0.121

6 -1 0 0.359 0.070 0.119

8 0 -1 0.356 0.067 0.117

5 -1 -1 0.354 0.066 0.116

3 1 0 0.350 0.065 0.113

2 1 -1 0.347 0.063 0.112

Strategy	Parameters	R-1	R-2	SU4
7	0	1	0.3775	0.0771	0.1283
4	-1	1	0.376	0.076	0.127
1	1	1	0.370	0.075	0.125
9	0	0	0.359	0.071	0.121
6	-1	0	0.359	0.070	0.119
8	0	-1	0.356	0.067	0.117
5	-1	-1	0.354	0.066	0.116
3	1	0	0.350	0.065	0.113
2	1	-1	0.347	0.063	0.112

Table 5

Results for redundancy strategies for DUC 2007 texts

Strategy	Parameters		R-1	R-2	SU4
	pd	cd
6	-1	0	0.419	0.100	0.155
4	-1	1	0.419	0.103	0.1581
9	0	0	0.417	0.101	0.156
7	0	1	0.415	0.102	0.156
1	1	1	0.412	0.101	0.155
3	1	0	0.410	0.098	0.152
5	-1	-1	0.410	0.088	0.145
8	0	-1	0.409	0.091	0.147
2	1	-1	0.408	0.092	0.147

Table 4 shows that, for DUC 2004, the strategy that generated the best ROUGE results was number 7, i.e., giving relevance to ideas redundant across documents. The next two best results reinforce the same idea: giving relevance to ideas redundant across documents (cd = 1) is the best strategy. On the other hand, the strategies where an idea is penalized by being redundant across documents (cd = -1) were ranked in the four worst results.

Another interesting pattern to notice is, when the cd parameter is left static, the pd parameter follows the same ranking: first, it is best to ignore intra-document redundancy (pd = 0), a less good strategy is to penalize it (pd = -1), and the worst strategy is to take it into account to provide relevance (pd = 1). A similar case happens when the pd parameter is left static, being the same rank for the cd parameter: the best strategy is to take into account inter-document redundancy (cd = 1), then ignore this redundancy (cd = 0) and being the worst option to penalize it (cd = -1).

The more the inter-document redundancy is considered as favorable, the better results are, whereas intra-document redundancy is irrelevant for the importance of the sentence.

For DUC 2007, Table 5 shows that the best strategy is number 6 or number 4 with a small difference on the scores. Both strategies penalize the relevance of the sentence if it is redundant per document (i.e., intra-document redundancy). Strategy 6 does not take into account inter-document redundancy while strategy 4 does. On the other hand, the worst three strategies, resulted from penalizing inter-document redundancy (cd = -1).

An interesting pattern to observe is that, when the cd parameter is left static: the best option is to penalize intra-document redundancy (pd = -1), then ignore this redundancy (pd = 0) and the worst option is to consider relevance when intra-document redundancy is present (pd = 1).

For the case of DUC 2007, we can say that penalizing redundancy per document yields better results, while statements that are redundant across documents (i.e., inter-document redundancy) are desirable. In other words, statements not biased to a single source are better. We analyzed some clusters of documents where penalizing intra-document redundancy was better and we found that giving importance to this type of redundancy reduces the possibility of taking inter-document redundancy into the final summary, which is more relevant for multi-document summarization.

Finally, the fact that there are strategies that are able to obtain a higher score than strategy 9 (pd = 0, cd = 0), means that considering redundancy to enrich relevance calculation in sentences improves summaries.

4.3 Comparison with existent methods

Despite the goal of this work is not to outperform the best techniques on multi-document summarization, we present a comparison with other techniques to show that this technique, on its best redundancy strategy for the DUC corpora, is able to outperform well-known baselines. We aimed to leave the method as simple as possible in order to attest the effect of changing redundancy strategies only, while avoiding interference from other improvement stages of the summarization method.

Baseline methods were designed with the purpose of having a reference for comparison that should always be outperformed when a new method is proposed. The most well-known baselines are those required by DUC on their competitions: random, lead and coverage.

Random baseline, consists in generating summaries by extracting random sentences to build the final summary. To better represent the performance of this method, the experiment is performed many times and the averaged score of this executions is given as the final score.

Lead baseline, consists in taking the last sentences of the last document in the cluster. In DUC corpora, this is a challenging baseline because the last document is the most recent document on the collection, conveying the most updated information—the latest news.

Coverage baseline, consists in taking the first sentence of the first document, then the first sentence of the second document, and so on, taking the first sentence of each document of the cluster until the maximum length of the summary is reached. This baseline is a challenging one [2], specially on news where the most relevant information is put immediately on the first paragraph to catch the reader’s attention.

We present a comparison between the results of our method with the best redundancy strategy, which was the number 7 (taking ideas redundant across documents) for DUC 2004, and number 4 (taking ideas redundant across documents, but not redundant per document) for DUC 2007—considering its ROUGE-2 and ROUGE-SU4 scores, against baselines and best performing methods. For these comparisons see Tables 6 and 7. We show only the recall measure for ROUGE-1, ROUGE-2 and ROUGE-SU4, because this is the most widely available score for the methods compared.

Table 6
Comparison with baselines and best methods using DUC 2004 corpus

Method R-1 R-2 SU4

PatSum [19] – 0.102 –

wHAASum [6] 0.416 0.095 0.138

SentTopic-MultiRank [26] 0.410 0.099 0.143

Our method (pd = 0, cd = 1) 0.377 0.077 0.128

Coverage 0.344 0.076 0.117

Lead 0.329 0.063 0.105

Random 0.319 0.049 0.096

Method	R-1	R-2	SU4
PatSum [19]	–	0.102	–
wHAASum [6]	0.416	0.095	0.138
SentTopic-MultiRank [26]	0.410	0.099	0.143
Our method (pd = 0, cd = 1)	0.377	0.077	0.128
Coverage	0.344	0.076	0.117
Lead	0.329	0.063	0.105
Random	0.319	0.049	0.096

Table 7

Comparison with baselines and best methods using DUC 2007 corpus

Method	R-1	R-2	SU4
ILP2 [9]	–	0.125	0.176
IIIT Hyderabad—peer 15 [18]	–	0.124	0.177
PLSA-JS [10]	0.458	0.116	0.176
Our method (pd = -1, cd = 1)	0.419	0.103	0.158
Random	0.375	0.069	0.125
Coverage	0.365	0.083	0.128
Lead	0.309	0.060	0.107

4.4 Statistical significance

To verify statistical significance of presented results for all nine strategies, we present in Table 8 for DUC 2004 the results with their confidence interval of 95%, which are provided by the ROUGE script. To decide whether one strategy is better than another, results can be examined by taking the minimum value of the best strategy and comparing it with the average score for every other strategy.

Table 8
ROUGE-1 results on all strategies of redundancy for DUC 2004 with a 95% confidence interval

Strategy Parameters average minimum maximum

pd cd

7 0 1 0.3775 0.3651 0.3897

4 -1 1 0.3763 0.3632 0.3889

1 1 1 0.3709 0.3565 0.3851

9 0 0 0.3594 0.3461 0.3728

6 -1 0 0.3590 0.3469 0.3711

8 0 -1 0.3560 0.3423 0.3691

5 -1 -1 0.3546 0.3403 0.3680

3 1 0 0.3504 0.3357 0.3643

2 1 -1 0.3471 0.3328 0.3603

Strategy	Parameters	average	minimum	maximum
7	0	1	0.3775	0.3651	0.3897
4	-1	1	0.3763	0.3632	0.3889
1	1	1	0.3709	0.3565	0.3851
9	0	0	0.3594	0.3461	0.3728
6	-1	0	0.3590	0.3469	0.3711
8	0	-1	0.3560	0.3423	0.3691
5	-1	-1	0.3546	0.3403	0.3680
3	1	0	0.3504	0.3357	0.3643
2	1	-1	0.3471	0.3328	0.3603

For DUC 2004, all strategies with cd = 1 have a minimum score greater than the average score of every other strategy. Therefore, we can conclude that the strategies where cd = 1 outperform others with a 95% of confidence. Regarding the strategies where the cd flag varies between 0 and -1, there are not conclusive differences, even when the pd flag is varied between all of its possible values: 1, 0, - 1.

5 Conclusions

An unsupervised method to summarize multiple documents was presented. This method provides the flexibility to experiment with all variations of intra-document and inter-document redundancy, allowing to generate summaries with popular or rare—which can be viewed also as novel or diverse—information within a cluster of documents. These summaries can be convenient for some areas where popularity or rarity are useful to provide a general idea of what multiple texts talk about.

Using this method, we presented a study of nine different strategies to extract information depending of how redundancy within a document and in different documents is managed. By studying these strategies on DUC corpora, we could establish a general statement about what makes an idea important—in the ambit of news—to be part of a summary, according to DUC gold standards.

We discovered that summaries should convey ideas redundant across documents, whereas taking ideas redundant per document seems to be inconvenient. The main conclusion we obtained, with a confidence greater than 95%, is that a multi-document generic summary should contain the most redundant—popular—information between different sources. These results lead us to conclude that considering inter-document redundancy is better than ignoring it or penalizing it for generic summaries.

We found also that providing relevance to ideas according to their level of redundancy, does improve the summary. For all methods were a final ranking of sentences results as a process of computing sentence relevance, this method can be implemented as a mechanism to eliminate redundancy and at the same time take it into account to improve the ranking.

Summaries for each corpus are different: summaries from DUC 2004 are generic, while those for DUC 2007 are query-focused. Having obtained different strategies for each corpus is an expected result, since different types of summaries should need different redundancy strategies.

As future work, it is possible to enhance the readability of the final summary by identifying the correct order to present the highest ranked sentences. To find this order, it could be possible to add directed edges on the graph according to the order of appearance of the sentences on each document.

In addition, results of this method can be improved to outperform other methods with several techniques. For example, by applying the proposed merging process on the best performing methods at the stage of redundancy reduction; in this way, the presence of redundancy can be used to improve the value of the sentence’s relevance. It is also possible to try more experiments where better similarity measures are obtained; varying the parameters involved on the neural network of Paragraph Vector; adding more data; more types of pre-processing on the text as stemming, removing punctuation marks or stop words, etc.

Finally, the proposed method can be applied on other corpora apart from DUC, to assess how the same redundancy strategies used to build a good summary for news, vary in different areas. Particularly, evaluation methods without summary references will be explored [13 , 24].

Footnotes

The DUC corpora (http://duc.nist.gov/) are a set of resources generated by the Document Understanding Conferences, organized by the National Institute of Standards and Technology (NIST) of the United States of America from 2001 to 2007.

References

Bourbakis

, Meng

, Wu

, Salerno

and Borek

, Removal of redundancy in documents retrieved from dierent resources, Proceedings of IC-TAI 1998, 10th IEEE International Conference on Tools with Articial Intelligence (1998), 112–119.

Brandow

, Mitze

and Rau

L.F.

, Automatic condensation of electronic publications by sentence selection, Information Processing and Management31(5) (1995), 675–685.

Brin

and Page

, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems30(1-7) (1998), 107–117.

Campr

and Ježek

, Comparing semantic models for evaluating automatic document summarization, Proceedings of TSD 2015, 18th International Conference on Text, Speech, and Dialogue (2015), 252–260.

Canhasi

and Kononenko

, Multi-document summarization via archetypal analysis of the content-graph joint model, Knowledge and Information Systems41(3) (2014), 821–842.

Canhasi

and Kononenko

, Weighted hierarchical archetypal analysis for multi-document summarization, Computer Speech and Language37 (2016), 24–46.

Carbonell

and Goldstein

, The use of MMR, diversity-based reranking for reordering documents and producing summaries, Proceedings of SIGIR 1998, 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1998), 335–336.

Celikyilmaz

and Hakkani-Tur

, A hybrid hierarchical model for multi-document summarization, Proceedings of ACL 2010, 48th Annual Meeting of the Association for Computational Linguistics (2010), 815–824.

Galanis

, Lampouras

and Androutsopoulos

, Extractive multi-document summarization with integer linear programming and support vector regression, in pp, Proceedings of COLING 2012, 24th International Conference on Computational Linguistics (2012), 911–926.

10.

Hennig

, Topic-based multi-document summarization with probabilistic latent semantic analysis, Proceedings of RANLP 2009, 7th conference on Recent Advances in Natural Language Processing (2009), 144–149.

11.

and Mikolov

, Distributed representations of sentences and documents, Proceedings of ICML 2014, 31st International Conference on Machine Learning (2014), 1188–1196.

12.

and Li

, Query-focused multi-document summarization: Combining a topic model with graph-based semi-supervised learning, Proceedings of COLING 2014, 25th International Conference on Computational Linguistics (2014), 1197–1207.

13.

Louis

and Nenkova

, Automatically assessing machine summary content without a gold standard, Computational Linguistics39(2) (2013), 267–300.

14.

Mihalcea

and Tarau

, TextRank: Bringing order into texts, Proceedings of EMNLP 2004, 9th conference on Empirical Methods in Natural Language Processing4(4) (2004), 404–411.

15.

Mihalcea

and Tarau

, A language independent algorithm for single and multiple document summarization, Proceedings of IJCNLP 2005, 2nd International Join Conference on Natural Language Processing (2005), 19–24.

16.

Ouyang

, Li

and Lu

, Applying regression models to query-focused multi-document summarization, Information Processing and Management47(2) (2011), 227–237.

17.

Parveen

and Strube

, Multi-document summarization using bipartite graphs, Proceedings of TextGraphs-9: Graph-based Methods for Natural Language Processing, Workshop at EMNLP 2014 (2014), 15–24.

18.

Pingali

, K

and Varma

, IIIT hyderabad at DUC in 2007, Proceedings of DUC 2007, 7th Document Understanding Conference (2007).

19.

Qiang

J.-P.

, Chen

, Ding

, Xie

and Wu

, Multi-document summarization using closed patterns, Knowledge-Based Systems99 (2016), 28–38.

20.

Radev

D.R.

, Jing

, Styś

and Tam

, Centroid-based summarization of multiple documents, Information Processing and Management40(6) (2004), 919–938.

21.

Saggion

, Torres-Moreno

J.-M.

, Cunha

I.D.

and SanJuan

, Multilingual summarization evaluation without human models, in; Posters, Proceedings of the 23rd International Conference on Computational Linguistics (2010), 1059–1067.

22.

Shen

and Li

, Multi-document summarization via the minimum dominating set, Proceedings of COL-ING 2010, 23rd International Conference on Computational Linguistics 2 (2010), 984–992.

23.

Soll

J.B.

, Intuitive theories of information: Beliefs about the value of redundancy, Cognitive Psychology38(2) (1999), 317–346.

24.

Torres-Moreno

J.-M.

, Saggion

, Cunha

I.D.

, San-Juan

and Velázquez-Morales

, Summary evaluation with and without references, Polibits (2010), 4213–20.

25.

Toutanova

, Brockett

, Gamon

, Jagarla-mudi

, Suzuki

and Vanderwende

, The PYTHY summarization system: Microsoft research at DUC, Proceedings of DUC 2007, 7th Document Understanding Conference (2007).

26.

Yin

, Pei

, Zhang

and Huang

, SentTopic-MultiRank: A novel ranking model for multi-document summarization, in}} International Conference on Computational Linguistics, Proceedings of COLING 2012, 24th (2012), 2977–2992.