Abstract
The task of Extractive Multi-Document Text Summarization (EMDTS) aims at building a short summary with essential information from a collection of documents. In this paper, we propose an EMDTS method using a Genetic Algorithm (GA). The fitness function considering two unsupervised text features: sentence position and coverage. We propose the binary coding representation, selection, crossover, and mutation operators. We test the proposed method on the DUC01 and DUC02 data set, four different tasks (summary lengths 200 and 400 words), for each of the collections of documents (in total, 876 documents) are tested. Besides, we analyze the most frequently used methodologies to summarization. Moreover, different heuristics such as topline, baseline, baseline-random, and lead baseline are calculated. In the results, the proposed method achieves to improve the state-of-art results.
Introduction
The diffusion of the Internet has led to the exchange of digital texts, becoming an efficient communication channel. The user can obtain and share information almost instantly; the Internet contains billions of documents and is growing at an exponential rate [22, 31], resulting in an inevitable problem of information overload. This situation affects the efficiency of the human brain, the stress produced by receiving more information which can understand, interferes with decision-making abilities. It causes reading fatigue and loss of time, which affects the productivity of users [15, 24].
To manage the problem of information overload can be used Automatic Text Summarization (ATS). The general process of summarization consists of rewriting the full text into a short version [34]. ATS is a task of Natural Language Processing (NLP). ATS consists of selecting the essential units, which could be paragraphs, sentences, part of sentences or keywords from a document or collection of documents using the state-of-the-art methods or commercial systems.
ATS methods can be abstractive or extractive. In the abstractive text summarization, the summaries are composed of fusing and generating new text that describes the essential facts [23]. In the extractive text summarization, the sentences or other parts of documents are extracted and concatenated to compose a summary [25, 31].
Depending on the number of documents, summarization techniques can be classified into two tasks: Single-Document Text Summarization (SDTS) and MDTS. The main goal of the MDTS is to allow the users to have an overview of the topics and relevant information that exists in the collection of documents within relatively a short time [4, 41]. The MDTS has gained interest since the mid-1990s [12], starting with the development of evaluation programs such as Document Understanding Conferences (DUC) [43] and Text Analysis Conferences (TAC) [47].
The task of the ATS relies on combinatorial optimization algorithms to select the most essential parts of the text, which consist of finding a solution close to the best, which is considered an optimized solution [1, 42] such as Genetic Algorithms (GA), which is a search technique inspired by natural selection and genetic mechanisms [6, 48].
In the state-of-the-art, several approaches have been developed in the ATS. Where a score is usually assigned to each sentence of the documents, taking into account specific features that determine the relevance of the sentences [19, 53]. Some features that have been considered to include a sentence in the final summary are: keywords, similarity with the title, length of the sentence, identification of verbs, adverbs, proper names, and font types, reduction of redundancy, without however, among the most used, stand out the sentence position and coverage [3, 54].
In consequence, in this paper is proposed an EMDTS method using a GA, which considered two unsupervised text features: sentences position and coverage. Moreover, we analyze the most frequently used methodologies that we divide into two groups of works: methodology without considering all sentences and methodology with considering all sentences, we consider the second methodology for building the final summary [8, 57].
The organization of the paper is as follows: Section 2 the related work is described. In Section 3 the principal methodologies used in MDTS is presented. Section 4 explains the proposed method. Section 5 shows experimental configuration and results. Moreover, we show the comparison to other state-of-the-art methods and heuristics. Finally, the conclusions are presented in Section 6.
Related work
Multi-document summarization has been widely studied. Researchers all over the world working on multi-document summarization are trying different directions to see methods that provide the best results.
In [57] is proposed Multi-Document Summarization method using Sentence-based Topic Models with Bayesian algorithm to estimate the parameters of the model, this method calculates Probability distributions of selecting sentences (Term- document and term-sentence associations).
The proposal explicitly models the probability distributions of selecting sentences given topics and provides a principled way for the summarization task. An efficient variational Bayesian algorithm is derived for estimating model parameters.
On the other hand, in [44] is proposed NFM (Non-Negative Matrix Factorization), this method can improve the quality of document summaries because the inherent semantics of the documents are well reflected by using the semantic features calculated by NMF [57].
In [56] is proposed BSTM: Sentence-based topic model with Bayesian algorithm, this is extend the NMF model and provide a good framework for weighting different terms and documents.
While in [28] is present NeATS (Next Generation Automated Text Summarization). NeATS is an extraction-based multi-document summarization system. It leverages techniques proved effective in single document summarization such as term frequency, sentence position, stigma words, and a simplified version of maximum marginal relevance to select and filter content. To improve topic coverage and readability, it uses term clustering, a ‘buddy system’ of paired sentences, and explicit time annotation.
In [14] is introduced a stochastic graph-based method (LexPageRank), for computing the relative importance of textual units, relies on the concept of sentence salience to identify the most important sentences in a document or set of documents. Salience is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence.
Otherwise in [7], multi-document summaries were constructed by utilizing complete sentences from the documents in the collection. Classic clustering techniques were employed in an attempt to partition the set of sentences into disjoint subsets or clusters, each of which contained sentences covering exactly one topic. Clusters are ranked by their similarity with the vector of the term frequencies of all terms appearing in the documents to be summarized.
Finally, in the research of [17, 36] Single Document Text Summarization methods were proposed base on the Genetic algorithm with the features Sentence position and coverage. Both present the hypothesis that the methods can be implemented into Multi-document text summarization. However, they were only tested for single-document Text Summarization.
In consequence, is proposes an EMDTS method, based on the GA where the fitness function is calculated considering two unsupervised text features: sentence position and coverage. Unsupervised text features may give rise to novel model constructions autonomously emerging from the text. The sentence position is based on the hypothesis that the Sentence importance decreases with its distance from the beginning of the document and coverage means that the summary generated should cover all subtopics as much as possible. This refers to the extent to which the information provided in the original documents is included in the summary.
Multi-document text summarization methodologies
In this paper, we consider the most frequently used methodologies that we divide into two groups of works: methodology without considering all sentences [40, 55] and methodology taking into account all sentences [3, 39]. The first methodology consists in building the individual summary for each document, and then construct the final summary. The second methodology consists in to join all documents of a collection in only one document, and then builds only one final summary. In this section, we describe two methodologies and explain because we apply the second methodology in this paper.
Methodology without considering all sentences
The first methodology uses so-called “meta” summarization procedure for generating multi-document text summaries [40], described as follows: Composing single summary: of documents, the relevant sentences are independently detected; in other words, essential sentences are detected locally, in the first step, for each document of the collection. Composing multiple-document summary: In the second step, each summary is merged, producing “meta” document, and summarized through the same or a different algorithm used in the previous step [52, 55].
This methodology is represented in Fig. 1. It is showed that each document of the collection of documents the essential sentences are independently detected.

This methodology hypothesized that the final multi-document summaries would be of higher quality since only relevant information is considered for MDTS. Nonetheless, this methodology does not take into account all the sentences of the collection, so it cannot reach the upper bound.
This methodology consists of the following steps: Combining all documents from a given collection: In the first step, a new single document is created containing all the documents from a given collection of documents. Composing a single summary: In the second step, a summary of this new single document is generated.
This methodology is represented in Fig. 2. Where is merging all documents from a given collection. That is, in a new single document, the set of all the sentences that the document collection contains is represented as D = {S1, S2, ... ,Sn}, D = { S1, S2, … , S n } , where S corresponds to the i sentence of the document collection and n is the total number of sentences in this collection. Likewise, a sentence is represented by the set S i = { ti1, ti2, . . . , t ik , . . . , t io }, where t ik , is the k - th term of the sentence S i and o is the total number of terms in the sentence [3, 39].

In this paper, we use the second methodology because we can consider all sentences for the final summary. The most recent state-of-the-art methods also use this methodology.
Pre-processing
In this step, the documents of the collection were chronologically ordered, then the original version is adapting to the entry of the format of the GA, where the original text is separated in sentences. Also, the text pre-processing is applied to the collection of documents. Firstly, the text was divided into words separated by commas; then, some tags were placed in the text to be able to differentiate quantities, emails, among others [36, 17].
Text model
The simple and most successful form for text modeling is the word sequences of a determinate n size, which are known as n-grams [25]. An n-gram is defined as a subsequence of consecutive elements in a given sequence; n-grams have been widely utilized in diverse ATS researches because of their easy extraction and because they decrease the loss of context (enhancing the extracted terms at more extended n sizes), making them robust for ATS [54]. We tested whit n-gram of length 2. [26, 36]
Genetic algorithm
The basic configuration of GA is defined as follows [11, 13]: the initial population is randomly generated, while the population of other generations is generated from some selection/reproduction procedure. The search process terminates when a termination criterion is met. Otherwise, a new generation will be produced, and the search process continues. The termination criterion can be selected as a maximum number of generations, or the convergence of the genotypes of the individuals. Genetic operators are constructed according to the problem to be solved, so the crossover operator has been applied to the generation of summaries.
The number of individuals and the number of generations is automatically calculated by the GA through equations 1 and 2, respectively. The number of individuals is determined by the number of sentences that the document contains through of the following equation [36]:
The number of generations is calculated through the following equation:
Fitness Function. The fitness function was used in the method [36, 17]. In this fitness function is evaluated two features, position sentences and, coverage. The main idea is that if all the sentences (see the Equation 3) had the same importance, it is could draw a line with the points that make up those coordinates as it is shown in Equation 4.
The idea for assigning more importance for the first sentences would be considered the first sentence with the importanceX n , the second with the significance of X n - 1.
Since the placement of the line indicates its importance, the midpoint of that line can be used to determine the slope of the line; thus, softening the significance of sentences. This situation would allow us to know how important a sentence is concerning the following, for this can use the general equation of the slope of the line.
For a text with n sentences, if the sentence i is selected for the summary then its relevance is defined as t (i - x) + x, where x = 1 + (n - 1)/2 and t is the slope to be discovered. To normalize the measurement of the position of the sentence (SentenceImportance), the importance of the first k sentences is calculated, where k is the number of selected sentences. Then the formula to calculate the significance of the first sentences would be as follows:
However, it is not the only value by which the GA should be governed since it would try to obtain only the first sentences. It is also necessary to evaluate that the summary has different ideas, that is, it is not repetitive, but at the same time, it has important words (Precision _ Recall). To measure both things the fitness function makes the summation of the frequencies of the n-grams that the summary weighs how significant are the n-grams obtained is the same but considering the original text, in this case only the most frequent n-grams according to the number of minimum words. These weightings Precision and Recall. Precision defines as a sum of the frequencies of the n-grams consider the original text, expressed as follows:
Recalldefines as a sum of the frequencies of the different n-grams of summary:
Therefore, the formula for obtaining Precision-Recall is:
Finally, to obtain the value of the fitness function, the following formula is applied, which is multiplied by 1000.
In the Fig. 3, it showed every step of the proposed method: input, text model, the process whit GA, the build summary, to a final summary.

We test the proposed EMDTS method using the dataset provided in DUC01 and DUC02 [43]. Traditionally, text summarization evaluation involves human judgments of different quality metrics, for example, coherence, conciseness, grammaticality, readability, and content [33]. We use ROUGE 1 to evaluate the proposed method, which is widely applied by DUC for performance evaluation [27]. It measures the performance of a summary by counting the unit overlaps between the candidate summary and a set of reference summaries.
Datasets
DUC02 and DUC01 dataset are used, which are benchmark data set of DUC for automatic summarization evaluation to empirically evaluate the summarization results. Table 1 gives a brief description of Datasets[30].
Description of Datasets [30]
Description of Datasets [30]
In this section, we describe and then compare the described methods and heuristics, in this paper, we consider the state-of-the-art methods that use the same methodology described in section 3.2.
Description heuristics
Experimentation results DUC02
In this section, the results that were obtained through the proposed method in DUC02 dataset are presented.
Table 2 shows the parameters that were used to get the results of the tasks of 200 and 400 words.
Parameters of the proposed GA
Parameters of the proposed GA
To determine the performance of the proposed method concerning state-of-art methods and heuristics, the advance was calculated, as follows: Since any method can be worse than randomly choosing sentences (baseline-random), the advance is recalculated as 0%. The best possible performance, topline, it is considered as 100%. Using baseline-random and topline is possible to recalculate the F-measure results to see an advance compared to the worst and the best results.
The comparison of results to the state-of-art methods and heuristics for 200 words, in F-measure of ROUGE-1 and Advance are presented in Table 4, there are 5 unsupervised and 1 supervised method, and 5 heuristics calculated, (topline, baseline-first, baseline-first-document, baseline-random and lead baseline).
Table 3 shows that exists a wide margin between the best method of selection (Baseline-first), and the best possible outcome of obtaining (Topline), the difference is 67.10%, this means that there are still efforts to be made in this task. Also, it is observed that none method-of-state-art has managed to overcome the heuristic Baseline-first.
Comparison of results to other methods and heuristics for 200 words (Rouge-1)
Table 4 shows the comparison of results to the state-of-art methods and heuristics for 200 words, in F-measure of ROUGE-2 and advance there are 5 unsupervised and 1 supervised method, and 5 heuristics calculated, (topline, baseline-first, baseline-first-document, baseline-random and lead baseline).
Comparison of results to other methods and heuristics for 200 words (Rouge-2)
For the task of 400 words summary length, we did not find the state-of-the-art and heuristics to compare. We calculate heuristics (topline, baseline-first, baseline-random, baseline-first-document, and lead-baseline).
Table 5 shows the comparison of results to the proposed method and heuristics for 400 words, in F-measure of ROUGE-1 and Advance.
Comparison of results to other methods and heuristics for 400 words (Rouge-1)
Table 6 shows the comparison of results to the proposed method and heuristics for 400 words, in F-measure of ROUGE-2 and Advance.
Comparison of results to other methods and heuristics for 400 words (Rouge-2)
Tables 5 and 6 expose that exists a wide margin between the best method of selection (Baseline-first), and the best possible outcome of obtaining (Topline). The difference is 65.21% for Rouge-1. Moreover, the difference is 60.22% for Rouge-2. We hope that this experiment serves as a reference for future works.
To unify all the performances obtained from Rouge-1 and Rouge-2 for 200 and 400 words, Table 7 shows, them in a unified positions,ranking using the equation 10, which has been used in [2, 45]
Ranking of the state-of-the-art methods (DUC02)
R r refers to the number of times that the method affects the r - th position. The number 6 represents the total number of methods involved in the comparisons.
In this section, the results that were obtained through the proposed method in DUC01 dataset are presented.
Table 8 shows the parameters that were used to get the results of the tasks of 200 and 400 words.
Parameters of the proposed GA
Parameters of the proposed GA
The advance was calculated such as explained in section 5.4. The comparison of results to the state-of-art methods and heuristics for 200 words, in F-measure of ROUGE-1 and Advance are presented in Table 10, there are 2 unsupervised and 1 supervised method, and 5 heuristics calculated in the state-of-the-art heuristics (topline, baseline-first, baseline-random, baseline-first-document, and lead-baseline).
In Table 9, we see the results of the state-of-art method and heuristics for task of 200 words in F-measure of Rouge-1, moreover the advance.
Comparison of results to other methods and heuristics for 200 words (Rouge-1)
In Table 10, we see the results of the state-of-art method and heuristics for the task of 200 words in F-measure of Rouge-2, moreover the advance.
Comparison of results to other methods and heuristics for 200 words (Rouge-2)
Tables 9 and 10 expose a wide margin between the best method (proposed), and the best possible outcome of obtaining (Topline). The difference is 68.50% for Rouge-1. Moreover, the difference is 71.00% for Rouge-2.
In Table 11, we see the results of the state-of-art method and heuristics for task of 400 words in F-measure of Rouge-1, also the advance.
Comparison of results to other methods and heuristics for 400 words (Rouge-1)
Table 12 shows the results of the state-of-art method and heuristics for the task of 400 words in F-measure of Rouge-2, further the advance.
Comparison of results to other methods and heuristics for 400 words (Rouge-2)
Tables 11 and 12 presents that exists a wide margin between the best method (proposed), and the best possible outcome of obtaining (Topline). The difference is 70.44% for Rouge-1. Besides, the difference between best selection method (baseline-first), and the best possible outcome of obtaining (Topline) is 72.04% for Rouge-2.
To unify all the performances obtained from Rouge-1 and Rouge-2 for 200 and 400 words, of DUC01 is showed in Table 13, them in a unified ranking, using the equation 11, which has been used in [2, 45].
Ranking of the state-of-the-art methods (DUC01)
R r refers to the number of times that the method affects the r - th position. The number 4 represents the total number of methods involved in the comparisons.
In this paper, we proposed the method for Extractive MDTS based on GA. The fitness function was calculated considered sentence position and coverage. We proposed the binary coding representation, selection, crossover, and mutation operators two different tasks for each of the collections of documents of DUC02 and DUC01 dataset were tested.
We tested different configurations of the most used methodology to generate Unsupervised EMDST summaries. Moreover, different heuristics such as topline, baseline, baseline-random, and lead baseline were calculated. The results obtained provide a point of reference for future research.
For future work, we will use more language-independent features as redundancy reduction, sentence length, and similarity with the title [53]. Also, we will consider other text models like sn-grams (Sintactic ngrams) [50] and MFSs (Maximal frequent sequences) [16, 51].
Footnotes
ROUGE (Recall-Oriented Understudy for Gisting Evaluation), toolkit (version 1.5.5.).
