Unsupervised extractive multi-document text summarization using a genetic algorithm

Abstract

The task of Extractive Multi-Document Text Summarization (EMDTS) aims at building a short summary with essential information from a collection of documents. In this paper, we propose an EMDTS method using a Genetic Algorithm (GA). The fitness function considering two unsupervised text features: sentence position and coverage. We propose the binary coding representation, selection, crossover, and mutation operators. We test the proposed method on the DUC01 and DUC02 data set, four different tasks (summary lengths 200 and 400 words), for each of the collections of documents (in total, 876 documents) are tested. Besides, we analyze the most frequently used methodologies to summarization. Moreover, different heuristics such as topline, baseline, baseline-random, and lead baseline are calculated. In the results, the proposed method achieves to improve the state-of-art results.

Keywords

Genetic algorithm heuristics unsupervised extractive multi-document text summarization

1 Introduction

The diffusion of the Internet has led to the exchange of digital texts, becoming an efficient communication channel. The user can obtain and share information almost instantly; the Internet contains billions of documents and is growing at an exponential rate [22, 31], resulting in an inevitable problem of information overload. This situation affects the efficiency of the human brain, the stress produced by receiving more information which can understand, interferes with decision-making abilities. It causes reading fatigue and loss of time, which affects the productivity of users [15, 24].

To manage the problem of information overload can be used Automatic Text Summarization (ATS). The general process of summarization consists of rewriting the full text into a short version [34]. ATS is a task of Natural Language Processing (NLP). ATS consists of selecting the essential units, which could be paragraphs, sentences, part of sentences or keywords from a document or collection of documents using the state-of-the-art methods or commercial systems.

ATS methods can be abstractive or extractive. In the abstractive text summarization, the summaries are composed of fusing and generating new text that describes the essential facts [23]. In the extractive text summarization, the sentences or other parts of documents are extracted and concatenated to compose a summary [25, 31].

Depending on the number of documents, summarization techniques can be classified into two tasks: Single-Document Text Summarization (SDTS) and MDTS. The main goal of the MDTS is to allow the users to have an overview of the topics and relevant information that exists in the collection of documents within relatively a short time [4 , 41]. The MDTS has gained interest since the mid-1990s [12], starting with the development of evaluation programs such as Document Understanding Conferences (DUC) [43] and Text Analysis Conferences (TAC) [47].

The task of the ATS relies on combinatorial optimization algorithms to select the most essential parts of the text, which consist of finding a solution close to the best, which is considered an optimized solution [1 , 42] such as Genetic Algorithms (GA), which is a search technique inspired by natural selection and genetic mechanisms [6 , 48].

In the state-of-the-art, several approaches have been developed in the ATS. Where a score is usually assigned to each sentence of the documents, taking into account specific features that determine the relevance of the sentences [19 , 53]. Some features that have been considered to include a sentence in the final summary are: keywords, similarity with the title, length of the sentence, identification of verbs, adverbs, proper names, and font types, reduction of redundancy, without however, among the most used, stand out the sentence position and coverage [3 , 54].

In consequence, in this paper is proposed an EMDTS method using a GA, which considered two unsupervised text features: sentences position and coverage. Moreover, we analyze the most frequently used methodologies that we divide into two groups of works: methodology without considering all sentences and methodology with considering all sentences, we consider the second methodology for building the final summary [8 , 57].

The organization of the paper is as follows: Section 2 the related work is described. In Section 3 the principal methodologies used in MDTS is presented. Section 4 explains the proposed method. Section 5 shows experimental configuration and results. Moreover, we show the comparison to other state-of-the-art methods and heuristics. Finally, the conclusions are presented in Section 6.

2 Related work

Multi-document summarization has been widely studied. Researchers all over the world working on multi-document summarization are trying different directions to see methods that provide the best results.

In [57] is proposed Multi-Document Summarization method using Sentence-based Topic Models with Bayesian algorithm to estimate the parameters of the model, this method calculates Probability distributions of selecting sentences (Term- document and term-sentence associations).

The proposal explicitly models the probability distributions of selecting sentences given topics and provides a principled way for the summarization task. An efficient variational Bayesian algorithm is derived for estimating model parameters.

On the other hand, in [44] is proposed NFM (Non-Negative Matrix Factorization), this method can improve the quality of document summaries because the inherent semantics of the documents are well reflected by using the semantic features calculated by NMF [57].

In [56] is proposed BSTM: Sentence-based topic model with Bayesian algorithm, this is extend the NMF model and provide a good framework for weighting different terms and documents.

While in [28] is present NeATS (Next Generation Automated Text Summarization). NeATS is an extraction-based multi-document summarization system. It leverages techniques proved effective in single document summarization such as term frequency, sentence position, stigma words, and a simplified version of maximum marginal relevance to select and filter content. To improve topic coverage and readability, it uses term clustering, a ‘buddy system’ of paired sentences, and explicit time annotation.

In [14] is introduced a stochastic graph-based method (LexPageRank), for computing the relative importance of textual units, relies on the concept of sentence salience to identify the most important sentences in a document or set of documents. Salience is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence.

Otherwise in [7], multi-document summaries were constructed by utilizing complete sentences from the documents in the collection. Classic clustering techniques were employed in an attempt to partition the set of sentences into disjoint subsets or clusters, each of which contained sentences covering exactly one topic. Clusters are ranked by their similarity with the vector of the term frequencies of all terms appearing in the documents to be summarized.

Finally, in the research of [17, 36] Single Document Text Summarization methods were proposed base on the Genetic algorithm with the features Sentence position and coverage. Both present the hypothesis that the methods can be implemented into Multi-document text summarization. However, they were only tested for single-document Text Summarization.

In consequence, is proposes an EMDTS method, based on the GA where the fitness function is calculated considering two unsupervised text features: sentence position and coverage. Unsupervised text features may give rise to novel model constructions autonomously emerging from the text. The sentence position is based on the hypothesis that the Sentence importance decreases with its distance from the beginning of the document and coverage means that the summary generated should cover all subtopics as much as possible. This refers to the extent to which the information provided in the original documents is included in the summary.

3 Multi-document text summarization methodologies

In this paper, we consider the most frequently used methodologies that we divide into two groups of works: methodology without considering all sentences [40 , 55] and methodology taking into account all sentences [3 , 39]. The first methodology consists in building the individual summary for each document, and then construct the final summary. The second methodology consists in to join all documents of a collection in only one document, and then builds only one final summary. In this section, we describe two methodologies and explain because we apply the second methodology in this paper.

3.1 Methodology without considering all sentences

The first methodology uses so-called “meta” summarization procedure for generating multi-document text summaries [40], described as follows:

Composing single summary: of documents, the relevant sentences are independently detected; in other words, essential sentences are detected locally, in the first step, for each document of the collection.

Composing multiple-document summary: In the second step, each summary is merged, producing “meta” document, and summarized through the same or a different algorithm used in the previous step [52, 55].

This methodology is represented in Fig. 1. It is showed that each document of the collection of documents the essential sentences are independently detected.

Fig. 1

Methodology without considering all sentences [52, 55].

This methodology hypothesized that the final multi-document summaries would be of higher quality since only relevant information is considered for MDTS. Nonetheless, this methodology does not take into account all the sentences of the collection, so it cannot reach the upper bound.

3.2 Methodology considering all the sentences

This methodology consists of the following steps:

Combining all documents from a given collection: In the first step, a new single document is created containing all the documents from a given collection of documents.

Composing a single summary: In the second step, a summary of this new single document is generated.

This methodology is represented in Fig. 2. Where is merging all documents from a given collection. That is, in a new single document, the set of all the sentences that the document collection contains is represented as D = {S₁, S₂, ... ,S_n}, D = { S₁, S₂, … , S_n } , where S corresponds to the i sentence of the document collection and n is the total number of sentences in this collection. Likewise, a sentence is represented by the set S_i = { t_i1, t_i2, . . . , t_ik, . . . , t_io }, where t_ik, is the k - th term of the sentence S_i and o is the total number of terms in the sentence [3 , 39].

Fig. 2

Methodology considering all the summaries [3 , 39].

In this paper, we use the second methodology because we can consider all sentences for the final summary. The most recent state-of-the-art methods also use this methodology.

4 Proposed method

4.1 Pre-processing

In this step, the documents of the collection were chronologically ordered, then the original version is adapting to the entry of the format of the GA, where the original text is separated in sentences. Also, the text pre-processing is applied to the collection of documents. Firstly, the text was divided into words separated by commas; then, some tags were placed in the text to be able to differentiate quantities, emails, among others [36, 17].

4.2 Text model

The simple and most successful form for text modeling is the word sequences of a determinate n size, which are known as n-grams [25]. An n-gram is defined as a subsequence of consecutive elements in a given sequence; n-grams have been widely utilized in diverse ATS researches because of their easy extraction and because they decrease the loss of context (enhancing the extracted terms at more extended n sizes), making them robust for ATS [54]. We tested whit n-gram of length 2. [26, 36]

4.3 Genetic algorithm

The basic configuration of GA is defined as follows [11, 13]: the initial population is randomly generated, while the population of other generations is generated from some selection/reproduction procedure. The search process terminates when a termination criterion is met. Otherwise, a new generation will be produced, and the search process continues. The termination criterion can be selected as a maximum number of generations, or the convergence of the genotypes of the individuals. Genetic operators are constructed according to the problem to be solved, so the crossover operator has been applied to the generation of summaries.

Encoding. The binary encoding is used for each, where each sentence of the document constitutes a gene. The values 1 and 0 determine if the sentence will appear or no in the final summary. The initial population is randomly generated [36, 17].

Selection Operator. Roulette selects individuals from a population according to their aptitude and is intended to choose stronger individuals (with higher value in the fitness function) [36].

Crossover Operator. This operator has been used in [17]. It was designed for ATS, where each represents a selection of sentences. The process of cross over is randomly select parents, only those with genes with a value of 1, and this value is assigned to the new individual. Genes with a value of 1 in both parents will be more likely to be chosen. To meet the condition of the summary, a gene is selected to be part of a new individual, the number of words is counted [36].

Mutation Operator. This operator performs the mutation according to a certain probability, as described in [36, 17].

Stop Condition. The stop condition that was applied for the term of the GA is the maximum number of generations. For the execution of the GA, consideration must be given to the number of words that the summary must-have. In this case, the lengths of 200, and 400 words were used.

The number of individuals and the number of generations is automatically calculated by the GA through equations 1 and 2, respectively. The number of individuals is determined by the number of sentences that the document contains through of the following equation [36]: $Number individuals = Number Sentences * 2$ (1)

The number of generations is calculated through the following equation: $Number Generations = 4 * 15 * Number Sentences$ (2)

Fitness Function. The fitness function was used in the method [36, 17]. In this fitness function is evaluated two features, position sentences and, coverage. The main idea is that if all the sentences (see the Equation 3) had the same importance, it is could draw a line with the points that make up those coordinates as it is shown in Equation 4. ${X_{1}, X_{2}, X_{3}, \dots X_{n}}$ (3) ${(X_{1}, y), (X_{2}, y), (X_{3}, y), \dots (X_{n}, y)}$ (4)

The idea for assigning more importance for the first sentences would be considered the first sentence with the importanceX_n, the second with the significance of X_n - 1.

Since the placement of the line indicates its importance, the midpoint of that line can be used to determine the slope of the line; thus, softening the significance of sentences. This situation would allow us to know how important a sentence is concerning the following, for this can use the general equation of the slope of the line.

For a text with n sentences, if the sentence i is selected for the summary then its relevance is defined as t (i - x) + x, where x = 1 + (n - 1)/2 and t is the slope to be discovered. To normalize the measurement of the position of the sentence (SentenceImportance), the importance of the first k sentences is calculated, where k is the number of selected sentences. Then the formula to calculate the significance of the first sentences would be as follows:

$\begin{matrix} Sentence importance \\ = \frac{\sum_{| c_{i} |}^{n} = 1^{t (i - x) + x}}{\sum_{j = 1}^{k} t (j - x) + 1}, x = 1 + \frac{(n - 1)}{2} \end{matrix}$ (5)

However, it is not the only value by which the GA should be governed since it would try to obtain only the first sentences. It is also necessary to evaluate that the summary has different ideas, that is, it is not repetitive, but at the same time, it has important words (Precision _ Recall). To measure both things the fitness function makes the summation of the frequencies of the n-grams that the summary weighs how significant are the n-grams obtained is the same but considering the original text, in this case only the most frequent n-grams according to the number of minimum words. These weightings Precision and Recall. Precision defines as a sum of the frequencies of the n-grams consider the original text, expressed as follows: $Σ Original text frequency$ (6)

Recalldefines as a sum of the frequencies of the different n-grams of summary: $Σ Frequency Summary$ (7)

Therefore, the formula for obtaining Precision-Recall is: $Precision_Recall = \frac{Σ Original text frequency}{Σ Frequency Summary}$ (8)

Finally, to obtain the value of the fitness function, the following formula is applied, which is multiplied by 1000. $FA = Presicion_Recall * Sentence Importance * 1000$ (9)

In the Fig. 3, it showed every step of the proposed method: input, text model, the process whit GA, the build summary, to a final summary.

Fig. 3

Proposed Method [17, 36].

5 Experimentation and results

We test the proposed EMDTS method using the dataset provided in DUC01 and DUC02 [43]. Traditionally, text summarization evaluation involves human judgments of different quality metrics, for example, coherence, conciseness, grammaticality, readability, and content [33]. We use ROUGE 1 to evaluate the proposed method, which is widely applied by DUC for performance evaluation [27]. It measures the performance of a summary by counting the unit overlaps between the candidate summary and a set of reference summaries.

5.1 Datasets

DUC02 and DUC01 dataset are used, which are benchmark data set of DUC for automatic summarization evaluation to empirically evaluate the summarization results. Table 1 gives a brief description of Datasets[30].

Table 1
Description of Datasets [30]

Dataset Features Description

DUC01 Number of documents 309

Number of collection of documents 30

Documents in each cluster About 10

Summary length 200, 400 words

DUC02 Number of documents 567

Number of collection of documents 59

Documents in each cluster From 5 to 14

Summary length 200, 400 words

Dataset	Features	Description
DUC01	Number of documents	309
	Number of collection of documents	30
	Documents in each cluster	About 10
	Summary length	200, 400 words
DUC02	Number of documents	567
	Number of collection of documents	59
	Documents in each cluster	From 5 to 14
	Summary length	200, 400 words

5.2 Description of the state-of-the-art methods

In this section, we describe and then compare the described methods and heuristics, in this paper, we consider the state-of-the-art methods that use the same methodology described in section 3.2.

WFS-NMF: It extends of Document clustering based on non-negative matrix factorization model and provides a useful framework for weighting different terms and documents.

BSTM [57]: Bayesian Sentence-based Topic Models (BSTM) explicitly models the probability distributions of selecting sentences given topics and provides principled way for the summarization task.

LexPageRank [57]: LexPageRank computes sentence importance based on the concept of centrality in a graph representation of sentences. In this model, a connectivity matrix based on intro sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences.

NMF [57]: Considers a selection of theoretical and empirical features on a document-sentence matrix, and selects the sentences associated with the highest weights to form summaries.

TE + WF [31]: This method applies prior recognition of the textual entailment as a previous step to the frequency of the words in the summarization process. TE (Textual Entailment) consists of using textual implication in text summarization that has been considered as a useful approach for obtaining a preliminary summary, where the sentences have not associated with any other sentence of the document. WF (Word Frequency) The sentences that contain the words with the most frequency from the source document (without stop-words) are considered for the final summary.

NeATS [29]: Is an extraction-based multi-document summarization system. It leverages techniques proved effective in single document summarization such as: term frequency, sentence position, stigma words, and a simplified version of Maximum Marginal Relevance to select and filter content.

CBA [7]: A Clustering Based Approach to Creating Multi-Document Summaries. The specific clustering method used was a combination of hierarchical and non-hierarchical methods (k-means). To determine which sentences should be selected to be included in the summary and the order in which they should appear, clusters were ranked by their similarity (using the cosine similarity measure) to the collection term frequency vector.

Baldwin [5]: This method is based on interesting words and interesting phrases; this approach uses a background corpus TREC to select them.

5.3 Description heuristics

Topline [46]: It is a heuristic that allows obtaining the maximum value that any state-of-the-art method can achieve due to the lack of concordance between evaluators since it selects sentences considering one or several gold-standard summaries.

Baseline-first [46]: Take the first sentence in the 1st, 2nd, 3rd, etc. document collection in chronological sequence until you have the target summary size.

Baseline-random [46, 51]: It is the state-of-the-art heuristic that randomly selects sentences to present them as an extractive summary to the user.

Baseline-first-document: Take the first sentences in the 1st document of a document collection until you have the target summary size.

Lead Baseline [31, 46]: It is a heuristic that takes the first 200 and 400 words in the last document in the collection, where documents are assumed to be chronologically ordered.

5.4 Experimentation results DUC02

In this section, the results that were obtained through the proposed method in DUC02 dataset are presented.

Table 2 shows the parameters that were used to get the results of the tasks of 200 and 400 words.

Table 2
Parameters of the proposed GA

Feature Description

Selection operator Roulette

Text representation Bigrams

Elitism 3

Value of slope 0.72

Feature	Description
Selection operator	Roulette
Text representation	Bigrams
Elitism	3
Value of slope	0.72

To determine the performance of the proposed method concerning state-of-art methods and heuristics, the advance was calculated, as follows: Since any method can be worse than randomly choosing sentences (baseline-random), the advance is recalculated as 0%. The best possible performance, topline, it is considered as 100%. Using baseline-random and topline is possible to recalculate the F-measure results to see an advance compared to the worst and the best results.

The comparison of results to the state-of-art methods and heuristics for 200 words, in F-measure of ROUGE-1 and Advance are presented in Table 4, there are 5 unsupervised and 1 supervised method, and 5 heuristics calculated, (topline, baseline-first, baseline-first-document, baseline-random and lead baseline).

Table 3 shows that exists a wide margin between the best method of selection (Baseline-first), and the best possible outcome of obtaining (Topline), the difference is 67.10%, this means that there are still efforts to be made in this task. Also, it is observed that none method-of-state-art has managed to overcome the heuristic Baseline-first.

Table 3

Comparison of results to other methods and heuristics for 200 words (Rouge-1)

Type of Method	Method	Rouge-1	Advance (%)
Unsupervised Methods	WFS-NMF [56]	49.900	30.63%
BSTM [57]	48.812	27.64%
	Proposed	48.455	26.66%
	LexPageRank [57]	47.963	25.31%
	NMF [57]	44.587	16.04%
Supervised Methods	TE + TS [31]	41.811	8.42%
Heuristics	Topline [46]	75.163	100.00%
	Baseline-first [46]	50.726	32.90%
	Baseline-first-document	40.500	1.75%
	Baseline-random [46]	38.742	0.00%
	Lead baseline	38.195	–1.50%

Table 4 shows the comparison of results to the state-of-art methods and heuristics for 200 words, in F-measure of ROUGE-2 and advance there are 5 unsupervised and 1 supervised method, and 5 heuristics calculated, (topline, baseline-first, baseline-first-document, baseline-random and lead baseline).

Table 4

Comparison of results to other methods and heuristics for 200 words (Rouge-2)

Type of Method	Method	Rouge-2	Advance (%)
Unsupervised Methods	WFS-NMF [56]	25.800	28.55%
	BSTM [57]	24.571	26.39%
	LexPageRank [57]	22.949	23.55%
	Proposed	21.765	21.47%
	NMF [57]	16.280	11.84%
Supervised Methods	TE + TS [31]	13.466	6.91%
Heuristics	Topline [40]	66.512	100.00%
	Baseline-first [40]	36.979	48.17%
	Baseline-first-document	13.648	7.23%
	Lead baseline	11.68	3.77%
	Baseline-random [40]	9.528	0.00%

For the task of 400 words summary length, we did not find the state-of-the-art and heuristics to compare. We calculate heuristics (topline, baseline-first, baseline-random, baseline-first-document, and lead-baseline).

Table 5 shows the comparison of results to the proposed method and heuristics for 400 words, in F-measure of ROUGE-1 and Advance.

Table 5

Comparison of results to other methods and heuristics for 400 words (Rouge-1)

Type of Method	Method	Rouge-1	Advance (%)
Unsupervised Method	Proposed	56.636	27.85%
Heuristics	Topline [46]	78.836	100.00%
	Baseline-first [46]	58.771	34.79%
	Baseline-random [46]	48.066	0.00%
	Baseline-first-document	44.437	–11.79%
	Lead baseline	42.518	–18.03%

Table 6 shows the comparison of results to the proposed method and heuristics for 400 words, in F-measure of ROUGE-2 and Advance.

Table 6

Comparison of results to other methods and heuristics for 400 words (Rouge-2)

Type of Method	Method	Rouge-2	Advance (%)
Unsupervised Method	Proposed	28.679	26.90%
Heuristics	Topline [46]	63.255	100.00%
	Baseline-first [46]	34.772	39.78%
	Baseline-first-document	16.461	1.07%
	Baseline-random [46]	15.951	0.00%
	Lead baseline	14.221	–3.65%

Tables 5 and 6 expose that exists a wide margin between the best method of selection (Baseline-first), and the best possible outcome of obtaining (Topline). The difference is 65.21% for Rouge-1. Moreover, the difference is 60.22% for Rouge-2. We hope that this experiment serves as a reference for future works.

To unify all the performances obtained from Rouge-1 and Rouge-2 for 200 and 400 words, Table 7 shows, them in a unified positions,ranking using the equation 10, which has been used in [2 , 45]

Table 7

Ranking of the state-of-the-art methods (DUC02)

Method	R _r						Resultant
	1	2	3	4	5	6	Rank
Proposed	2	0	2	0	0	0	3.333
WFS-NMF [56]	2	0	0	0	0	0	2.000
BSTM [57]	0	2	0	0	0	0	1.667
LexPageRank [57]	0	0	1	1	0	0	1.167
NMF [57]	0	0	0	0	2	0	0.667
TE + TS [31]	0	0	0	0	0	2	0.333

$Ran (method) = \sum_{r = 1}^{6} \frac{(6 - r + 1) R_{r}}{6}$ (10)

R_r refers to the number of times that the method affects the r - th position. The number 6 represents the total number of methods involved in the comparisons.

5.5 Experimentation results 2

In this section, the results that were obtained through the proposed method in DUC01 dataset are presented.

Table 8 shows the parameters that were used to get the results of the tasks of 200 and 400 words.

Table 8
Parameters of the proposed GA

Task Feature Description

200 words Selection operator Roulette

Text representation Bigrams

Elitism 3

Value of slope 0.80

400 words Selection operator Roulette

Text representation Bigrams

Elitism 3

Value of slope 0.70

Task	Feature	Description
200 words	Selection operator	Roulette
	Text representation	Bigrams
	Elitism	3
	Value of slope	0.80
400 words	Selection operator	Roulette
	Text representation	Bigrams
	Elitism	3
	Value of slope	0.70

The advance was calculated such as explained in section 5.4. The comparison of results to the state-of-art methods and heuristics for 200 words, in F-measure of ROUGE-1 and Advance are presented in Table 10, there are 2 unsupervised and 1 supervised method, and 5 heuristics calculated in the state-of-the-art heuristics (topline, baseline-first, baseline-random, baseline-first-document, and lead-baseline).

In Table 9, we see the results of the state-of-art method and heuristics for task of 200 words in F-measure of Rouge-1, moreover the advance.

Table 9

Comparison of results to other methods and heuristics for 200 words (Rouge-1)

Type of Method	Method	Rouge-1	Advance (%)
Unsupervised Methods	Proposed	40.224	31.50%
	NeATS [29]	37.883	19.54%
	CBA [7]	34.108	.26%
Supervised Methods	Baldwin [5]	35.890	9.36%
Heuristics	Topline	53.630	100.00%
	Baseline-first	39.280	26.68%
	Baseline-first-document	35.472	7.22%
	Baseline-random	34.057	0.00%
	Lead Baseline	34.009	–0.24%

In Table 10, we see the results of the state-of-art method and heuristics for the task of 200 words in F-measure of Rouge-2, moreover the advance.

Table 10

Comparison of results to other methods and heuristics for 200 words (Rouge-2)

Type of Method	Method	Rouge-2	Advance (%)
Unsupervised Methods	Proposed	10.306	29.00%
	NeATS [29]	7.674	23.93%
	CBA [7]	5.525	1.63%
Supervised Methods	Baldwin [5]	6.883	9.40%
Heuristics	Topline	22.703	100.00%
	Baseline-first	9.339	23.47%
	Baseline-first-document	7.225	11.36%
	Lead Baseline	6.195	5.46%
	Baseline-random	5.240	0.00%

Tables 9 and 10 expose a wide margin between the best method (proposed), and the best possible outcome of obtaining (Topline). The difference is 68.50% for Rouge-1. Moreover, the difference is 71.00% for Rouge-2.

In Table 11, we see the results of the state-of-art method and heuristics for task of 400 words in F-measure of Rouge-1, also the advance.

Table 11

Comparison of results to other methods and heuristics for 400 words (Rouge-1)

Type of Method	Method	Rouge-1	Advance (%)
Unsupervised Methods	Proposed	47.619	29.56%
	NeATS [29]	45.551	18.42%
	CBA [7]	41.259	–4.69%
Supervised Methods	Baldwin [5]	43.812	9.05%
Heuristics	Topline	60.691	100.00%
	Baseline-first	47.198	27.30%
	Baseline-random	42.131	0.00%
	Baseline-first-document	41.161	–5.22%
	Lead Baseline	39.961	–11.69%

Table 12 shows the results of the state-of-art method and heuristics for the task of 400 words in F-measure of Rouge-2, further the advance.

Table 12

Comparison of results to other methods and heuristics for 400 words (Rouge-2)

Type of Method	Method	Rouge-2	Advance (%)
Unsupervised Methods	Proposed	13.668	26.85%
	NeATS [29]	11.722	16.93%
	CBA [7]	7.546	–4.34%
Supervised Methods	Baldwin [5]	10.613	11.28%
Heuristics	Topline	28.021	100.00%
	Baseline-first	13.885	27.96%
	Baseline-first-document	9.943	7.87%
	Lead Baseline	8.557	0.81%
	Baseline-random	8.398	0.00%

Tables 11 and 12 presents that exists a wide margin between the best method (proposed), and the best possible outcome of obtaining (Topline). The difference is 70.44% for Rouge-1. Besides, the difference between best selection method (baseline-first), and the best possible outcome of obtaining (Topline) is 72.04% for Rouge-2.

To unify all the performances obtained from Rouge-1 and Rouge-2 for 200 and 400 words, of DUC01 is showed in Table 13, them in a unified ranking, using the equation 11, which has been used in [2 , 45].

Table 13

Ranking of the state-of-the-art methods (DUC01)

Method	R _r				Resultant
	1	2	3	4	Rank
Proposed	4	0	0	0	4.000
NeATS [29]	0	4	0	0	3.000
CBA [7]	0	0	0	4	1.000
Baldwin [5]	0	0	4	0	2.000

$Ran (method) = \sum_{r = 1}^{4} \frac{(4 - r + 1) R_{r}}{4}$ (11)

R_r refers to the number of times that the method affects the r - th position. The number 4 represents the total number of methods involved in the comparisons.

6 Conclusions

In this paper, we proposed the method for Extractive MDTS based on GA. The fitness function was calculated considered sentence position and coverage. We proposed the binary coding representation, selection, crossover, and mutation operators two different tasks for each of the collections of documents of DUC02 and DUC01 dataset were tested.

We tested different configurations of the most used methodology to generate Unsupervised EMDST summaries. Moreover, different heuristics such as topline, baseline, baseline-random, and lead baseline were calculated. The results obtained provide a point of reference for future research.

For future work, we will use more language-independent features as redundancy reduction, sentence length, and similarity with the title [53]. Also, we will consider other text models like sn-grams (Sintactic ngrams) [50] and MFSs (Maximal frequent sequences) [16 , 51].

Footnotes

ROUGE (Recall-Oriented Understudy for Gisting Evaluation), toolkit (version 1.5.5.).

References

Aguilar

, Resolución computacional de un problema de optimización combinatorio hibrido, (2017).

Alguliev

R.M.

, Aliguliyev

R.M.

and Hajirahimova

M.S.

, GenDocSum+MCLR: Generic document summarization based on maximum coverage and less redundancy, Expert Syst Appl39 (2012), 12460–12473.

Alguliev

R.M.

, Aliguliyev

R.M.

and Isazade

N.R.

, Multiple documents summarization based on evolutionary optimization algorithm, Expert Syst Appl40 (2013), 1675–1689.

Bakkar

, Al-Hamad

and Bakar

, Multi-document Summarizer, in: Springer, Cham, (2018), pp. 461–478.

Baldwin

and Ross

, Baldwin language technology’s DUC summarization system, Proc 1st Doc Underst Conf New Orleans, LA. (2001).

Bean

J.C.

, Genetic Algorithms and Random Keys for Sequencing and Optimization, ORSA J Comput6 (1994), 154–160.

Boros

, Kantor

P.B.

and Neu

D.J.

, A Clustering Based Approach to Creating Multi-Document Summaries, 2001.

Cao

F.L.S.Z.M.

and Ziqiang

, Ranking with Recursive Neural Networks and Its Application to Multi-Document Summarization, (2015), 7.

Carbonell

and Goldstein

, The use of MMR, diversity-based reranking for reordering documents and producing summaries, in: SIGIR 98, ACM Press, New York, New York, USA, (1998), pp. 335–336.

10.

César Vélez

and Alejandro Montoya

, Metaheurísticos: Una alternativa para la solución de problemas combinatorios en administración de operaciones, (2007).

11.

Coello

C.A.C.

, Introducción a la Computación Evolutiva (Notas de Curso), 2004.

12.

Das

and Martins

A.F.T.

, A Survey on Automatic Text Summarization, 2007.

13.

K.L.

and Swamy

M.N.S.

, A Survey on Automatic Text Summarization, 2007.

14.

Erkan

, Radev

D.R.

, LexRank: Graph-based Lexical Centrality as Salience in Text Summarization, J Artif Intell Res22 (2004), 457–479.

15.

Ferreira

, de Souza Cabral

, Freitas

, Lins

R.D.

, de França Silva

, Simske

S.J.

and Favaro

, A multi-document summarization system based on statistics and linguistic treatment, Expert Syst Appl41 (2014), 5780–5787.

16.

García-Hernández

R.A.

, Desarrollo de Algoritmos Para el descubrimiento de patrones secuenciales máximales, Instituto Nacional de Astrofísica, Óptica y Electrónica, 2007.

17.

García-Hernández

R.A.

and Ledeneva

, Single Extractive Text Summarization Based on a Genetic Algorithm, LNCS7914 (2013), 374–383.

18.

García-Hernández

R.A.

, Martínez-Trinidad

J.F.

and Carrasco-Ochoa

J.A.

, A New Algorithm for Fast Discovery of Maximal Sequential Patterns in a Document Collection, (2006), 514–523.

19.

Gupta

, Kaur

, Bajaj

and Khanna

, Intelligent Systems and Applications, Intell Syst Appl4 (2019), 39–51.

20.

Joshi

, Fidalgo

, Alegre

and Fernández-Robles

, SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders, Expert Syst Appl129 (2019), 200–215.

21.

Jung

, Datta

and Segev

, Multi-document summarization using evolutionarymulti-objective optimization, in: Proc Genet Evol Comput Conf Companion - GECCO ’17, ACM Press, New York, New York, USA, 2017: pp. 31–32.

22.

Kaushik

and Naithani

, AComprehensive Study of Text Mining Approach, 2016.

23.

Kumar Bharti

, Sathya Babu

and Pradhan

, Automatic Keyword Extraction for Text Summarization in Multi-document e-Newspapers Articles, 2017.

24.

Ledeneva

, García-Hernández

, Vazquez

F.A.

, Osorio

and de Jesús , Experimenting with Maximal Frequent Sequences for Multi-Document Summarization, 45 (2010), 233–244.

25.

Ledeneva

Y.N.

and García-Hernández

R.A.

, Generación automática de resúmenes - Retos, propuestas y experimentos, (2017).

26.

Ledeneva

Y.N.

and Gelbukh

, Automatic Language-Independent Detection of Multiword Descriptions for Text Summarization, Instituto Politécnico Nacional, 2013.

27.

Lin

C.-Y.

, ROUGE: A Package for Automatic Evaluation of Summaries, 34 (2011), 1213–1220.

28.

Lin

C.-Y.

and Hovy

, From Single to Multi-document Summarization: A Prototype Systemand its Evaluation, n.d.

29.

Lin

C.-Y.

and Hovy

, From single to multi-document summarization, in: Proc 40th Annu Meet Assoc Comput Linguist - ACL ’02, (2002), pp. 457.

30.

Lin

and Bilmes

, Multi-document summarization via budgeted maximization of submodular functions, (n.d.) 9.

31.

Lloret

, Ferrández

, Muñoz

and Palomar

, Incorporating Textual Entailment Recognition in Single-and Multi-Document Summarization Systems, 2008.

32.

Mandal

, Singh

G.K.

and Pal

, A Constraints Driven PSO Based Approach for Text Summarization, J Informatics Math Sci10 (2018), 703–714.

33.

Mani

, Automatic Summarization, John Benjamins Publishing Company, Amsterdam, 2001.

34.

Mani

and Bloedorn

, Multi-document Summarization by Graph Search and Matching, (1997).

35.

Mao

, Yang

, Huang

, Liu

and Li

, Extractive summarization using supervised and unsupervised learning, Expert Syst Appl133 (2019), 173–181.

36.

Matías

M.G.A.

, Generación Automática De Resúmenes Usando Algoritmos Genéticos, Universidad Autónoma del Estado de México, 2013.

37.

Mcdonald

, A Study of Global Inference Algorithms in Multi-Document Summarization, 2007.

38.

Mendoza

, Bonilla

, Noguera

, Cobos

and León

, Extractive single-document summarization based on genetic operators and guided local search, 2014.

39.

Mendoza

, Cobos

, León

, Lozano

, Rodríguez

and Herrera-Viedma

, A New Memetic Algorithm for Multi-document Summarization Based on CHC Algorithm and Greedy Search, in: Springer, Cham, (2014), pp. 125–138.

40.

Mihalcea

and Tarau

, A Language Independent Algorithm for Single and Multiple Document Summarization, in: Proc IJCNLP 2005, 2nd Int Join Conf Nat Lang Process, (2005), pp. 19–24.

41.

Nayeem

M.T.

and Chali

, Extract with Order for Coherent Multi-Document Summarization, (2017).

42.

Nguyen

M.-T.

, Nguyen

T.-H.-N.

, Nguyen

H.-D.

and Nguyen

V.-H.

, Learning to Estimate the Importance of Sentences for Multi-Document Summarization, in: 2018 10th Int Conf Knowl Syst Eng, IEEE, (2018), pp. 31–36.

43.

Over

and Dang

, DUC in context, Inf Process Manag43 (2007), 1506–1520.

44.

Park

, Lee

J.H.

, Kim

D.H.

and Ahn

C.M.

, Multi-document summarization based on cluster using non-negative matrix factorization, in: Lect Notes Comput Sci (Including Subser Lect Notes Artif Intell Lect Notes Bioinformatics), (2007), pp. 761–770.

45.

Rojas

S.J.

, Ledeneva

and García-Hernández

R.A.

, Calculating the significance of automatic extractive text summarization using a genetic algorithm, J Intell Fuzzy Syst35 (2018), 293–304.

46.

Rojas Simón

, Ledeneva

and García Hernández

R.A.

, Calculating the Upper Bounds for Multi-Document Summarization using Genetic Algorithms, Comput y Sist22 (2018).

47.

Saggion

and Poibeau

, Automatic Text Summarization: Past, Present and Future, in: Springer, Berlin, Heidelberg, (2013), pp. 3–21.

48.

Sastry

, Goldberg

, Kendall

, Chapter 4 Genetic Algorithms, 2005.

49.

Satpute

M.N.

, Dong

, Wu

and Du

D.-Z.

, Multi-Document Extractive Summarization as a Non-linear Combinatorial Optimization Problem, in: (2019), pp. 295–308.

50.

Sidorov

, N-gramas sintácticos no-continuos, Polibits (2013), 69–78.

51.

Sidorov

, Syntactic n-grams in Computational Linguistics, 2019.

52.

Stein

G.C.

, Bagga

and Wise

G.B.

, Multi-Document Summarization :Methodologies and Evaluations (2000), 16–18.

53.

Vázquez

, García-Hernández

R.A.

and Ledeneva

, Sentence features relevance for extractive text summarization using genetic algorithms, J Intell Fuzzy Syst35 (2018), 353–365.

54.

Villatoro-Tello

, Villaseñor-Pineda

and Montes-y-Gómez

, Using Word Sequences for Text Summarization, in: Springer, Berlin, Heidelberg, (2006), pp. 293–300.

55.

Villatoro-Tello

, Villaseñor-Pineda

, Montes-y-Gómez

and Pinto-Avendaífot

, Multi-Document summarization based on locally relevant sentences, in: 8th Mex Int Conf Artif Intell - Proc Spec Sess MICAI 2009, IEEE, (2009), pp. 87–91.

56.

Wang

, Li

, Ding

, Weighted Feature Subset Non-Negative Matrix Factorization and its Applications to Document Understanding, IEEE Int Conf Data Min. (2010).

57.

Wang

, Zhu

, Li

and Gong

, Multi-document summarization using sentence-based topic models, in: ACL AFNLP, (2010), pp. 297.

Unsupervised extractive multi-document text summarization using a genetic algorithm

Abstract

Keywords

1 Introduction

2 Related work

3 Multi-document text summarization methodologies

3.1 Methodology without considering all sentences

4.1 Pre-processing

4.2 Text model

4.3 Genetic algorithm

5.1 Datasets

5.3 Description heuristics

5.4 Experimentation results DUC02

Table 2 Parameters of the proposed GA Feature Description Selection operator Roulette Text representation Bigrams Elitism 3 Value of slope 0.72

Table 8 Parameters of the proposed GA Task Feature Description 200 words Selection operator Roulette Text representation Bigrams Elitism 3 Value of slope 0.80 400 words Selection operator Roulette Text representation Bigrams Elitism 3 Value of slope 0.70

Footnotes

References

Table 2
Parameters of the proposed GA

Feature Description

Selection operator Roulette

Text representation Bigrams

Elitism 3

Value of slope 0.72

Table 8
Parameters of the proposed GA

Task Feature Description

200 words Selection operator Roulette

Text representation Bigrams

Elitism 3

Value of slope 0.80

400 words Selection operator Roulette

Text representation Bigrams

Elitism 3

Value of slope 0.70