Abstract
In the last 16 years with the existence of Document Understanding Conference (DUC), several methods have been developed in Automatic Extractive Text Summarization (AETS) that have allowed the continuous improvement of this task. However, no significant analysis has been performed to determine the significance of the AETS methods. In this paper, we present a new method based on a Genetic Algorithm to determine the best sentence combination of DUC01 and DUC02 datasets to rank the newest methods of AETS. Using three heuristics presented in the state-of-the-art, we rank the most recent AETS methods, obtaining upper bounds and recovering lower bounds of the state-of-the-art.
Introduction
Automatic Extractive Text Summarization (AETS) is a task contemplated in Natural Language Processing (NLP) that allows to reduce the textual content of a document or a set of them, by selecting a set of phrases or sentences more representative of the original text obtained from a method or a computational tool, using supervised and unsupervised learning techniques [30, 50].
Among the first advances made in AETS, has been considered in Luhn [17] and Edmunson [16] as the pioneers of Automatic Text Summarization (ATS) and, particularly, AETS. However, the most consistent developments of the ATS were through Document Understanding Conferences (DUC) since 2001 to 2007 organized by the National Institute of Standards and Technology (NIST) [11].
With the existence of the DUC conferences, several methods have been developed that have employed automatic learning techniques [3, 23], text connectivity [2, 8], text representation through use of graphs [36–39], algebraic reduction [19–21] and the use of evolutionary models [25, 34], with the purpose of generation of automatic extractive summaries that best resemble summaries made by humans.
One of the main challenges of AETS is to generate automatic extractive summaries that similar to summaries generated by humans (gold-standard summaries). However, for several domains, the gold-standard summaries are made abstracting summaries by substituting some terms and phrases of the original text. According to Verma and Lee [11, 35], the gold-standard summaries of DUC01 and DUC02 employ approximately 9% of words not found in the original documents [35]. Consequently, the level of maximum similarity will be less than 100%, and even more, if compared from several gold-standard summaries, the upper bounds will be lower for any AETS method.
In several previous works [30, 48] heuristics have been used to compare the performance of the AETS methods. These heuristics are known as Baseline and Baseline-random [30, 47], which allow to establish the minimum performance limits (lower bounds) by which extractive summaries must be generated. However, the AETS methods have not been ranked because the best extractive summaries obtained from Topline heuristic were not known [14].
To know the best extractive abstracts of Topline heuristic, techniques based on exhaustive searches have been used in the state-of-the-art to find the best combinations of sentences that best resemble those used by humans. In some previous works, these summaries are known as Oracle extracts [7]. However, methods based on this technique have been used in short documents, and despite this, large-scale processing techniques (clusters) have been used to evaluate all possible combinations [7, 15], because the increase of sentences of an original text represents an exponential growth in the space of solutions. Therefore, using a method based on exhaustive searches to evaluate all possible combinations is not feasible to use.
In the other hand, the use of several evolutive methods in AETS has represented a viable solution generating extractive summaries of superior performance. These types of techniques include the use of Genetic Algorithms (GA) [30] and Memetic Algorithms (MA) [26]. Therefore, using optimization algorithms, is a viable solution to obtain extractive summaries closest to the best ones represented. In this paper, a GA is used to obtain the combinations of sentences that best resemble selected by humans using the ROUGE-1.5.5 system.
The rest of the paper is organized as follows: Section 2 presents some related works that have used techniques based on exhaustive searches to determine the best extractive abstracts. Section 3 describes the general process of a GA. Section 4 describes the structure and development of the proposed GA. Section 5 shows the GA experimental configuration to determine the highest performance sentence combinations for calculating Topline of DUC01 and DUC02 dataset. In addition to compare the performance of Topline with some methods and heuristics used in the state-of-the-art, a ranking in the performance of AETS methods are showed. Finally, Section 6 shows the conclusions and future work.
Related works
In the last years, with the existence of Document Understanding Conferences, many advances have been made in the development of ATS. However, to know and determine the best extractive summaries, few studies have been carried out, and some of them use techniques based on exhaustive searches to determine the best combination of sentences that best represent the judgments made by humans. Lin and Hovy [5–7] developed a comprehensive search-based method to find the best combinations of a document by taking the first 100±5 and 150±5 words of the DUC01 dataset, and evaluating sentence combinations by co-occurrence of bag-of-words of the ROUGE system.
This work arises due to the idea that the best combination of sentences is substantially better than any other AETS method in the state-of-the-art, allowing to know the upper bounds that any AETS method can achieve [7, 14]. However, the main drawback that affected the performance of this procedure was exponential increase of the search space that implies the number of sentences of each document. For example, if we use a document of 100 sentences and it is inferred that on average each sentence has a length of 20 words, then to find the best extractive summary of 100 words should take the best 5 sentences of the 100 available (
In 2010, Ceilan [15] introduced a same method based on exhaustive searches to find the best sentence combinations. Unlike Lin and Hovy [7], this work was applied to summaries from different domains (literary, scientific, journalistic and legal) using a probability density function from the weights established by the ROUGE system to reduce space search solutions. However, in the experimentation stage, it was necessary to modify ROUGE-1.5.5 Perl-based script to process different combinations of sentences in a cluster of computers to distribute the processing of the documents. In addition, in the news domain it was necessary to divide the original document into several sub-sections to reduce the search and processing time of each document by discriminating the different possible combinations that can be generated.
In 2017, Wang [43] used a new strategy for finding the best combinations of one and multiple documents, using nine sentence reduction heuristics that present a low relation to the gold-standard summary. Subsequently, the remaining sentences are introduced through seven weighting methods to measure the similarity of the candidate sentences in relation to gold-standard summaries. However, the use of several heuristics to determine the best combinations of sentences in different domains and different entries allows the increase of the computational cost to find the best combinations of sentences. In addition, for summaries of a document only a single gold-standard summary was used and in the case of summaries for multiple documents only 533 documents of 567 of the DUC02 dataset were used, generating more biased results.
In this paper we propose the method based on the use of GAs to find the best combinations of sentences that can be generated from the summaries of DUC01 and DUC02 dataset and rank AETS state-of-the-art methods.
Basic genetic algorithm
The GAs [29, 49] is a technique of optimization and iterative, parallel, stochastic search inspired by the principles of natural selection proposed by Darwin in 1859 [4]. The GAs was proposed by John Holland in 1975 as a method that pretends to simulate the actions of nature in a computer to optimize a wide variety of processes [1, 41]. Nowadays, GA is the most widely used evolutive computing method in the optimization problems [41].
A traditional GA is characterized by representing the solution of a problem in individuals, which are represented by variable bit strings and together form a population [24]. GA begins with a population of N pop individuals who share a set of n characteristics for each generation g, where each i-th individual X i is randomly generated as shown in Equation (1).
Each individual X
r
(g) is evaluated from a specific adaptation value (fitness function) to determine the quality of individuals and its proximity to the optimal values of GA [24, 41]. From the value obtained as a fitness function, a selection of individuals is performed, where each pair of parents X
p
(g) and X
m
(g) is chosen to participate in the cross-step forming individuals Y
i
(g), which have combined characteristics of X
p
(g) and X
m
(g). Finally, the new individual Y
i
(g) is introduced to the mutation stage, where partial and minimal modifications are made to generate an individual Z
i
(g). As mentioned by Mendoza, the mutation of individuals is based on a probability P [26], as shown in Equation (2).
The selection, crossing, and mutation of individuals are iterated until they meet a certain termination criterion, these criteria are based on the number of iterations, the convergence of individuals of a gene, and on a fitness function [49]. In summary, the process that conducts a GA is guided in Fig. 1 [24, 25].

In general, the proposed method is integrated of the steps and procedures of the basic GA of Section 3. The GA proposed evaluates several combinations of sentences in an optimized search space, which are candidates in representing the best extractive summary of one or multiple documents.
Solution representation
In the proposed GA, the solution is presented using a coding of individuals considering the order of sentences that can appear in extractive summary. Therefore, each individual X
i
is represented in a vector of n positions [P1, P2, …, P
n
], where each position includes a set of sentences {S1, S2, …, S
n
} of the original document D, and the union of all the sentences will represent the content of the original document, as shown in Equation (3).
For each coding to be considered as an extractive summary, the first sentences are considered from a set of words. For example, if we have a document with n = 10 sentences and we generate an extractive summary of 100 words with an average of 20 words per sentence, then the position vector can use a sequence equivalent to [1–9] indicating that the possible solution begins with sentences 4 and 1, ending with sentence 9, although only the first 5 sentences will be taken into account to comply with first 100 words as a summary.
The fitness function is an important stage for the performance of the GA and is the value by which the quality of the summaries is maximized with the passing of (g + 1) generations. To measure the quality of each summary, F-measure maximization based on the co-occurrence of bag-of-words and bigrams evaluated from ROUGE-1.5.5 system was used [5]. The maximum F-measure value of the individual X k (g) obtained from X i (g) population determine the best combination of sentences found in GA. This maximization is shown in Equation (4)
If the individual X k (g) have the greatest co-occurrence of n-grams from the all generations g of populations X i (g), then it will have the best combination of sentences when obtaining the largest number of retrieved n-grams.
The most common strategy for initializing the population (when g = 0) must be generated with codifications of random real numbers for signature each sentence of the set D = {S1, S2, …, S n } in each position P i . Therefore, the first generation of individuals will be according to Equation 6
The selection is the GA step that allows to take a set of individuals X c from a generation g to obtain the greatest fitness values with the purpose of obtain better individuals in g + 1 generations.
One of the methods of selection most known of GA is the elitism stage, which has the quality to choose a set of individuals of better aptitude in the generation g to pass to the generation g + 1.
According to [26], if we have Pob (g) ={ X1 (g), X2 (g), …, X N pop (g) } as a population of individuals ordered from greater to lesser fitness, then the set of individuals that will be pass to the next generation is (g+ 1) = { X1 (g), X2 (g), …, X e (g) } where E (g + 1) ⊆ Pob (g), e < N pop , and e is a parameter that specifies the number or percentage of individuals to be selected by elitism. However, for the selection of individuals it is required to use at least one other selection operator to maintain N pop individuals for each generation.
To select the remaining individuals from each generation, we propose to generate new offspring from the tournament selection operator by taking several samples of N Tor randomly selected individuals to obtain the best fitness value individual [41], as shown in Equation (7)
For the crossover of individuals, we use the cycle crossover operator (CX). This operator has the capacity to generate new offspring from the genetic coding of each pair of parents, considering their hereditary characteristics [41]. For the CX operator to be started, a starting point must be selected for genetic exchange. Therefore, if we have a pair of parents Xp1 (g) and Xp2 (g) which represent pairs of parents to cross, then we use a randomly generated starting point to exchange information from both parents and generate a new individual Y
i
(g), as shown in Equation (8)
Remembering the Equation (2) of the Section 3.1, the mutation stage takes a set of individuals Y i (g) to generate individuals Z i (g) modifying some features for each generation g. We used the insertion mutation operator to select a pair of genes of the individual Yi,t (g) and Yi,r (g) randomly to insert the gene Yi,t (g) in the gene Yi,r (g) [1], as shown in Equation (9). Therefore, if the random value rand is between the value 0 and P, then the mutation of individuals is performed by insertion operator, otherwise the individual is not modified.
For the replacement of individuals, we propose to integrate the set of individuals generated by elitist selection (E (g + 1)) and the set of individuals Z
i
(g) from the mutation stage, to integrate the population of the next generation X
i
(g + 1), as shown in Equation (10).
The termination criterion used to halt GA iterations is determined by several generations established as the execution parameter.
Experiments and results
In this section, we present the experiments carried out by the proposed GA to calculate Topline of extractive summaries using DUC01 and DUC02 datasets and the performance of some AETS methods and heuristics in the state-of-the-art.
Datasets
The DUC datasetsare the most used by researchers in the AETS of a document and multiple documents highlighting DUC01 and DUC02. To measure the proposed GA performance, we used DUC01 and DUC02. DUC01 and DUC02 are products of workshops organized by the National Institute of Standards and Technology (NIST) for the development of ATS. The documents that make up these collections are based on news articles from some news agencies such as The Financial Times, The Wall Street Journal, Associated Press and others [11, 28].
DUC01 dataset consists of 309 English documents grouped into 30 collections, each collection containing an average of 10 documents based on news articles addressing natural disaster issues, biographical information, and others [26, 27]. Each original document of DUC01 was assigned two gold-standard summaries generated in abstractive form by two humans, containing approximately 100 words. For the ATS of multiple documents, two abstracts were generated for each collection generating 60 abstract summaries that have lengths of 50, 100, 200 and 400 words [27].
DUC02 dataset consists of 567 news articles in English grouped into 59 collections, each collection contains between 5 and 12 documents dealing with topics of technology, food, politics, finance, among others. Like DUC01, this dataset is mainly used for two tasks, the first is to generate summaries of a document, each document had one or twogold-standard summaries that had a minimum length of 100 words. The second task is to generate summaries of multiple documents. For the AETS of multiple documents, one or two abstracts were generated for each collection, generating 118 abstracts/extracts with lengths of 10, 50, 100, 200 and 400 words [28]. Table 1 shows the general data for each dataset.
Datasets main characteristics
Datasets main characteristics
For determine the upper bounds of DUC01 and DUC02, different tests were carried out with some adjustments of parameters with the objective of obtaining the best extractive summaries. Table 2 shows the best tuning parameters applied to GA proposed to calculate the best extractive summaries of a document.
AG parameters to calculate Topline of DUC01 and DUC02 for AETS
AG parameters to calculate Topline of DUC01 and DUC02 for AETS
The fitness value of each solution is obtained from the n-gram specification to be evaluated by the ROUGE system. In this paper, the unit of evaluation based on the co-occurrence of bag-of-words and bigrams (ROUGE-1 and ROUGE-2) was used, to compare the performance of the most novel state-of-the-art methods in relation to set of gold-standard summaries [6].
As mentioned in Section 1, the importance of knowing the best extractive summaries consist in determining Topline from the extractive summaries of one and several documents and reweight the most novels methods in the state-of-the-art. In this section, we present a performance comparison of the state-of-the-art methods and their advances with respect to performance obtained from the Baseline-first, Baseline-random and Topline heuristics. The methods and heuristics involved in this comparison are the following:
DE, FEOM, QCS and MA-Single DocSum methods do not participate in the following comparisons, because the sentence segmentation stage is not performed according to DUC01 and DUC02 workshops.
For comparing and reweigh the performance of the methods previously described with the heuristics of the state-of-the-art, we used the evaluation based on the co-occurrence of bag-of-words and bigrams (ROUGE-1 and ROUGE-2) of the ROUGE system [5, 6] using the function of Equation (11) to establish the performance of each state-of-the-art method respect to the best extractive summaries obtained by the proposed GA.
Tables 3 and 4 shows the average results of ROUGE-1 and ROUGE-2 when calculating the Topline of 309 documents of DUC01 dataset and 567 documents of DUC02 dataset using the GA parameters presented in Table 2. The performance of the state-of-the-art methods are shown in this comparison.
Results of ROUGE-1 and ROUGE-2 methods and heuristics on DUC01
Results of ROUGE-1 and ROUGE-2 methods and heuristics on DUC02
According to the results presented in Tables 3 and 4, Topline performance is substantially distant from other state-of-the-art methods, as mentioned by [7, 43]. For DUC01, Topline obtained a performance equivalent to 59.408 with ROUGE-1 and 33.422 with ROUGE-2, while the best state-of-the-art methods are NetSum obtaining 46.427 with ROUGE-1 and GA-Summarization obtaining 19.762 with ROUGE-2. For DUC02, Topline obtained a performance equivalent to 62.367 with ROUGE-1 and 35.742 with ROUGE-2, while the best state-of-the-art methods are UnitifiedRank obtaining 48.487 with ROUGE-1 and SentenceFeatures obtaining 22.471 with ROUGE-2.
A comparison of the level of advance of the most recent state-of-the-art methods is shown in Tables 5 and 6. To determine this performance, we use the formula (12) based on the premise that the performance of Topline heuristic is 100% and Baseline-random is 0%.
Ranking of state-of-the-art methods and heuristics for DUC01
Ranking of state-of-the-art methods and heuristics for DUC02
The best state-of-the-art method of the Table 5 presents an advance equivalent to 43.12% for ROUGE-1 and 38.39% for ROUGE-2. Therefore, it follows that for the development of the AETS task there is 56.88% for ROUGE-1 and 61.61% for ROUGE-2 to be explored. In the other hand, it is observed that the performance of Baseline-first heuristic is better than several methods of the state-of-the-art in F-measure of ROUGE-1 (33.68%), surpassing to Manifold Ranking (29.67%) and TextRank (19.70%), in F-measure ROUGE-2 (38.11%) is better than NetSum (29.07%), UnitifiedRank (28.84%), CRF (27.41%), SVM (26.01%), Manifold Ranking (24.28%) and TextRank (12.64%) methods.
The best state-of-the-art methods present an advance equivalent to 41.06% for ROUGE-1 and 40.61% for ROUGE-2 (see Table 6). Therefore, it follows that for the development of the AETS task there is a 58.94% for ROUGE-1 and 59.39% for ROUGE-2 to be explored.
In the other hand, it is observed that the performance of Baseline-first heuristic is better to several state-of-the-art methods for ROUGE-1 (36.00%) and ROUGE-2 (39.44%). The performance of the Baseline-first heuristic remains better than several state-of-the-art methods. However, the methods Unitified Rank, Sentence Features and GA-Summarization are better that this heuristic.
The performance of Baseline-random heuristic (0%) was expected to be the lowest in this comparison. However, the methods NetSum (–9.97%), CRF (–11.06%), SVM (–11.31%) and Manifold Ranking (–12.16%) show the lowest performance of ROUGE-2 for DUC02.
In general, the Tables 5 and 6, the new reweighting of the state-of-the-art methods is observed. However, it is possible to determine the level of significance of the best methods of each comparison. To perform this determination, we used the ROUGE-1 and ROUGE-2 reweighting in DUC01 and DUC02 based on Equation (13).
Tables 7 and 8 presents the results obtained to determine the improvement produced by NetSum and GA-Summarization methods with respect to the other state-of-the-art methods in F-measure of ROUGE-1 and ROUGE-2 on DUC01 data respectively. In general, the percent of improvement of NetSum method is in a range of 10–20 percent to some state-of-the-art methods. In the other hand, NetSum method presents an improvement percent greater than Manifold Ranking (45.31%) and TextRank (118.86%) methods (see Table 7). The percent of improvement of GA-Summarization method to other state-of-the-art methods is very distant for ROUGE-2 on DUC01 data. However, the Sentence Features method is close to GA-Summarization with 1.72% (see Table 8).
Improvement percent of NetSum to other methods (ROUGE-1)
Improvement percentage of GA-Summarization to other methods (ROUGE-2)
Table 9 presents the results obtained to determine the percentage of improvement of Unitified Rank method with respect to the other state-of-the-art methods in F-measure of ROUGE-1 on DUC02 data. It is observed that the percent of improvement by Unitified Rank is very close to Sentence Features (0.67%) and GA-Summarizarion (2.22%) methods. Therefore, these improvement percent are not significant. In the other hand, the other methods show a more significant difference exceeding 50%.
Improvement percentage of UnitifiedRank with other methods on DUC01 (ROUGE-1)
Table 10 shows the comparison of the improvement percentage of the Sentence Features method in relation to the other state-of-the-art methods in F-measure ROUGE-2 on DUC02 data. The GA-Summarization method is not significantly distant, obtaining 1.48%. The other methods present percentages greater than 10%. In the other hand, some methods present the label (N/A), because their performance in Table 6 is less than 0% and therefore is not calculable.
Improvement percentage of Sentence Features with other methods on DUC02 (ROUGE-2)
To unify all the performances obtained from ROUGE-1 and ROUGE-2 on DUC01 and DUC02, we propose to use the data from the Tables 5 and 6 to show them in a unified ranking of positions, considering the position of each method of ROUGE-1 and ROUGE-2 on DUC01 and DUC02 datasets.
The results of the unification of these results are shown on the Table 11, using the Equation (14) which has been used by [26, 33].
Ranking of the state-of-the-art methods
Table 11 shows that the performance of GA-Summarization (3.250) and Sentence Features (3.250) methods show the best positions in the method rankings. However, the methods UnitifiedRank (3.125) and NetSum (2.875) show a good performance with the same results.
In the other hand, the methods based on evolutionary approaches show a good tendency to obtain the best levels of performance compared to the machine learning methods. Therefore, these methods represent a viable solution for generating high performance extractive summaries.
The state-of-the-art methods to obtain the upper bounds have been based on exhaustive searches to obtain the best extractive summaries. However, GAs have not been used to obtain extractive summaries from Topline heuristic to reweigh the performance of the AETS methods.
In this paper, some GA operators were used to obtain the best extractive summaries. As a fitness function, it was proposed to use ROUGE-N method of ROUGE-1.5.5 system to evaluate the quality of the generated GA combinations.
In the state-of-the-art, the maximum possible performance value of the AETS was unknown. However, it was possible to approximate the best summaries with the use of GAs, to know the scope of the methods of the AETS.
With the determination of the best sets of sentence combinations, it is possible to introduce them to a supervised machine learning model to improve the quality of extractive summaries.
The best state-of-the-art methods of DUC01 show performance equivalent to 43.12% for ROUGE-1 and 38.39% for ROUGE-2 (reported in Table 5). Therefore, it follows that there is still 56.88% for ROUGE-1 and 61.61% for ROUGE-2 to be explored. The best state-of-the-art method of DUC02 show a performance equivalent to 41.06% for ROUGE-1 and 40.61% for ROUGE-2 (reported in Table 6). Therefore, it follows that there is still a 58.94% of ROUGE-1 and 59.39% of ROUGE-2 to be explored.
With the use of AGs and Topline heuristic, it was possible to reweigh the AETS methods to obtain more objective results and to generate a rank matrix (reported in Table 11), which shows in general the performance of the state-of-the-art methods.
With the new ranking of the state-of-the-art methods, it was possible to determine the percentages of significant improvement among the best state-of-the-art methods.
In Tables 5 and 6, it is observed that the percentage of significance is much close between several methods of the state-of-the-art, so it will be very important to analyze the quality of the summaries generated by means of a Turing test, to demonstrate if the level of achieved performance of extractive summaries is confounded with summaries created by humans.
