Abstract
Rough Sets provide a mathematical tool to handle decision making under uncertainty. One major domain that can be characterized with inherent ambiguity is natural language texts which often leads to uncertainty in understanding the intent and relative importance of a sentence with respect to its context in the whole text. As a consequence, the process of sentence selection for generation of extractive summary can logically be considered as a process of decision making under uncertainty. In this paper we use rough set based techniques to deal with this uncertainty. This paper’s contribution is two-fold. Firstly, this paper proposes a novel Rough Set based uncertainty measure called span and define special Rough subsets of universe called spanning sets. Span is Rough Set based measure for salience of a subset of universe and spanning set is the subset that maximizes the span. This corresponds to the key elements representing a problem and can be used to solve various real-life applications. Secondly, the concepts are applied to determine extracts of text documents. The idea behind the present work is to determine the most suitable subset(s) of the universe of sentences under consideration. An optimization problem is formulated to generate the extract for the text under consideration using the proposed uncertainty measure of span and is solved using Particle Swarm Optimization. The experimental results on DUC2001, DUC2002 single document data sets and Enron Email datasets establish the effectiveness of the proposed technique. There has been substantial work on Rough Sets though considering a stochastic Rough-subset of the universe and determining its aptness as a representative of the universe is still unexplored. The proposed technique is a novel effort to fill this gap.
Keywords
Introduction
Text Summarization [20] condenses a source text into a concise version that preserves the information content while reducing the size of the input text. Typically, summarization is of two types: extractive and abstractive. While abstractive summarization generates new sentences conveying the gist of the input text, an extractive summarization method typically selects important sentences from the original document as they are, and concatenates them to form the summary. Thus, extractive summarization can be viewed as a decision making process, where with respect to each input sentence the system takes a decision regarding possible inclusion of the sentence in the summary.
Summarization systems in recent times have assumed significant importance because of preponderance of text data in electronic form available through the web, various repositories, data centers, among others. Although various techniques have been developed for Text Summarization over the last 60 years or more, no significant work has so far been reported on Text Summarization using Rough Set based techniques.
Rough Set [24, 25] theory is a well-developed mathematical methodology to deal with imperfect data. Rough Set based techniques have been used in different areas of Artificial Intelligence, including Economics and Finance [32], Data Mining [11], Data Classification [12, 27], Information Retrieval [29, 31], Feature Selection [14, 35] among others.
The foundation of Rough Set lies in partitioning a given data set into three classes: positive region, negative region and boundary region with respect to a given knowledge, viz. set of entities represented as attribute-value pairs. While the positive and negative regions clearly demarcate the entities belonging to a specific class or outside it, the boundary region comprises of entities for which such a clear decision cannot be made with certainty. The basic partitioning tool used here is called indiscernibility relation. The underlying intuition is to identify elements that are not discernible from each other in terms of the knowledge R, where R stands for the set of attributes of the elements of U, the universe. The pair (U, R) is typically termed as Information System. In the theory of Rough Set any subset of the universe is represented as a pair of lower approximation and upper approximation using the indiscernibility relation. With respect to classical Rough Sets [24] the indiscernibility relation establishes a collection of equivalence classes that partitions the universe into disjoint collection of elements.
The motivation behind the work is to propose a novel Rough Set based uncertainty measure called span to evaluate the importance of a subset of a given universe. The span of a subset is the weighted average of the proportion of certainly covered concepts, as given by the lower approximation, and proportion of possibly covered concepts, as given by the boundary region of the subset. The subset of the universe that maximizes the proposed measure of span shall be referred to as spanning set. A spanning set is an equivalent representation of the universe satisfying certain user-specific criteria, and will have cardinality smaller than the universe. Several, Rough Set based uncertainty measures, such as membership value, accuracy and roughness have been studied in literature [9, 39] though none of them deals with the spanning capabilities of a subset of universe. The aim of this paper is to apply the proposed concepts to solve the problem of Extractive Text Summarization.
Rough Set based Text Summarization was proposed in literature by Yadav & Chatterjee [34]. The overall scheme proposed therein consists of two major steps. Firstly, determining the key sentences using heuristic function and secondly, selection of sentences using Rough Sets based lower approximation and memberships. We consider this as our baseline system for comparing results. In our experiments with the baseline system we observed the following. Firstly, selecting the candidate set of sentences based entirely on heuristics is not adequate for the task. Secondly, selection from this set based on approximations and membership alone is not suffice for an efficient summarization.
We, in this paper have worked on these drawbacks and improved upon them in the following way. Typically, for extractive summarization of an input text an upper bound is imposed on the size of the target summary in terms of number of words, or number of sentences included in the summary. Hence the task of extractive summarization may be viewed as a problem of generating a subset of the universe of sentences which is the best representation of the universe with a reduced cardinality and without losing on essential information.
The above problem can be viewed as an application of the proposed uncertainty measure i.e. span. The subsets are generated in a dynamic manner and corresponding spanning sets are computed. We have defined the concept of complete span to address the drawback which arises due to sparse indiscernibility relation. The complete span measures the cumulative effect of each attribute of the Information System to produce a spanning set. A spanning set of a collection of sentences under certain constraints represents a set of distinct sentences having maximum information. The extract is computed using spanning set and document similarity. The optimization problem of finding spanning set is solved using Particle Swarm Optimization. Although applied to the domain of extractive text summarization, the novelty of the proposed uncertainty measure is that it can be used to solve various real-life applications as well.
The paper is organized as follows. Section 2 presents the previous work on Text Summarization. Section 3 gives introduction to Rough Sets and the notations used. In Section 4 the proposed concepts of Rough Set based span and spanning set are defined and explained. Section 5 provide the techniques to apply these concepts to compute extractive summary of text documents. Section 6 discusses the results of experiments conducted on single document Text Summarization viz. DUC2001, DUC2002 and Enron datasets. Finally, Section 7 concludes the paper.
Previous works on text summarization
A significant amount of research has been done in the area of Text Summarization over the last 50 years or more. We here discuss some of the important and novel techniques developed and used over the last few decades. In 1950s, due to the lack of powerful computers and difficulty in natural language processing (NLP), early works in this direction [8, 19] focused on the study of text genres using statistical and/or heuristic approaches. Some heuristics that were found to be useful have been word frequency, sentence location, presence of title keyword etc. However, the major contributions towards extractive Text Summarization have been in last two decades or so as mentioned below.
Latent Semantic Analysis (LSA) has been popularly used for Text Summarization [10, 30]. In this approach the term by sentence matrix of the given input text is decomposed into three matrices and subsequently important sentences are extracted depending on the compression degree. However, the major shortcoming of the LSA based techniques is that the matrix computations are expensive in both time and space. As a consequence, better approach has to be developed when a large volume of electronic data is made available and rapid processing of these documents is needed for summarization and corresponding information extraction.
Graph based methods [21, 22] are language independent summarization techniques. Graph based techniques work on ranking algorithm viz. Google’s PageRank algorithm [23]. The method is well established, but it is based on ranking of the sentences of a document. Hence problems may arise when multiple sentences with similar information are present in the text document.
Random indexing (RI) based Text Summarization was first proposed by [4]. Random Indexing differs from traditional Word Space Models in the sense that here instead of conventional word vectors one uses ternary vectors of predetermined size to represent words. Chatterjee & Sahoo [5] have advanced the basic RI based model using convolution in their Modified Random Indexing approach. The benefits of using Random Indexing based techniques are: (i) Here computational cost is much lower as compared to LSA. (ii) The dimensionality is maintained at a fixed number and does not increase with the addition of new sentences in the document. (iii) Experimental results show improved accuracy. However, one disadvantage of this technique is that it uses PageRank based ranking, and consequently is limited by the disadvantages of the graph based methods as mentioned earlier.
Some of the other summarization methods that have used optimization techniques for Text Summarization are the following: Integer Linear Programming [1] method for summarization constructs summaries by solving 0-1 Integer Programing Problem, where either a sentence belongs to a summary or it does not. The linear constraints to the optimization problem are to maximize the importance of the selected sentences and minimize their pairwise similarity. Archetypal Analysis (AA) [2, 6] works on the concept that the input text can be represented by archetypes such that input is represented as convex combination of these archetypes which themselves are convex combination of input. Further, archetypes are utilized to determine the best sentences. However, to determine these archetypes various convex programming problems need to be solved, making summary extraction a difficult task. Evolutionary Algorithms. Genetic Algorithms have been used to create text summaries [37]. The population for Genetic Algorithm is a collection of chromosomes which have genes with binary values of ones and zeros. Optimization is based on maximizing topic relation factor and cohesion. Karwa & Chatterjee [16] have recently used differential evolution for extractive Text Summarization.
Rough sets
Rough Set theory [24, 25] is a well-developed mathematical tool for decision making under uncertainty. Since its inception Rough Set has been used in different areas including classification, feature selection, data mining, knowledge acquisition, information retrieval, decision analysis to mention few. As discussed above, a knowledge representation system is a pair (U, R) where U, the universe, is a nonempty, finite set of objects, and R is a nonempty, finite set of attributes. Every attribute imposes an equivalence relation on U. For any set of attributes P ⊆ R, the indiscernibility relation is defined as:
The equivalence classes of IND(P) are denoted by
Mathematically,
The following section discuss the proposed Rough Set based uncertainty measure viz. span and spanning sets.
Rough set based span, spanning set and weighted rough sets
Span and spanning sets
Let (U, R) be an Information System and X be a subset of U To compute the degree of possibly included and certainly included concepts by a set X ⊆ U, we propose the notion of span. Formally, we define the span for a subset X of U as follows.
Span is defined as the convex combination of degree of certainly covered concepts and possibly covered concepts by the set X. Convex combination is computed for a tradeoff between lower approximation and boundary region. The relative importance of certainly contained and possibly contained concepts are determined by the weights where the weights are application specific entities. We define spanning set of (U, R) as follows.
We shall now discuss some properties of span and spanning set. Before the results an important Lemma is presented.
BN
P
(X) ⊆ BN
Q
(X)
Proposition 1 and Proposition 2 below state that the span of a subset of attributes is larger than the span with respect to the complete attribute subset, subject to some constraints.
Since Q ⊆ P ⊆ R by Lemma 1,
Therefore, δP,X ≤ δQ,X + θ, where
Then since Q ⊆ P ⊆ A,
Hence,
Hence, δP,X ≤ δQ,X
It is evident from these results that a smaller attribute subset always leads to a higher span, subject to satisfying constraints given in Propositions 1 and 2. The above knowledge helps us to identify suitable attributes for computing a spanning set. This motivated us to define the complete-span measure as follows.□
The corresponding spanning sets obtained using complete span shall be referred to as complete spanning set. The complete spanning set of least cardinality is called minimal complete spanning set.
Proposition 3 given below follows from the definition of span.
If X = U, and w1 = 1 then span is maximum and its value is 1.
If X = U, and w1 = 0 then span is minimum and its value is 0.
If w1 = w2 = 0.5, then span is equal to
If BN
P
(X) = U, then span is 1 if w1 = 0 and span is 0 if w2 = 0.
If X =∅, then span is 0 irrespective of values of w1 and w2.
The proofs of (i) to (v) follow from the definitions.
The following proposition proves relation that exists between two different span for a set X.
δP,X′ ≤ δP,X if u1 > w1
δP,X′ ≥ δP,X if u1 < w1
Also, when
Given,
δP,X′ ≤ δP,X
The other inequalities follow similar explanations.□
Proposition 5 provides relation between upper approximations of a spanning set and any other set.
If w2 ≤ w1, then
If w1 ≤ w2 then
If w2 ≤ w1
Similarly, the other inequality follows.□
By definition,
Given that
Hence, it follows that
The accuracy given by
The roughness given by ρ
P
(X) =1 - α
P
(X) satisfies ρ
P
(X) ≤ ρ
P
(Z)
ρ
P
(X) ≤ ρ
P
(Z)
□
In Proposition 6 it is proved that boundary region of spanning set is largest given the condition on lower approximation cardinality. In Proposition 7 it is proved that roughness of a spanning set of given dimensions is highest provided it satisfy some criterion and the accuracy of the decision system formed with the spanning set is least. Hence, spanning set is a Rough Set with high degree of roughness.
Now, we define weighted Rough Sets as follows:
Using weighted Rough Sets each class of
Weighted span measures the spanning capability of a subset using weighted measures of lower approximations and boundary region defined above.
The corresponding spanning sets obtained using complete span shall be referred to as complete weighted spanning set. The complete spanning set of least cardinality is called minimal weighted complete spanning set.
Computing spanning sets
The following example illustrates the concept of computing complete spanning set.
Example on span and spanning set computation
Example on span and spanning set computation
Let R = {x1, x2, x5} , w1 = 0.1 and w2 = 0.9
We compute the span with respect to the three attributes as follows,
The complete span is given by
Now, consider another set Y = {x0, x1,x3} and weights being same as above.
Then,
The span of Y for this problem is less than span of X which means that the set X represent the universe in a better way than the set Y. It is evident from this example that different subsets of U produce different span and complete span.
Table 2 gives the complete spanning sets, minimal complete spanning sets and complete span for the Information System given in Table 1. To compute complete spanning set for the given problem, it requires computing 27 viz. 128 possible subsets of universe and finding the one with maximum complete span. This grows exponentially as size of universe increase.
Minimal Complete Spanning set for Example 1
Section 5 describes application of the concepts developed in this work for dealing with the problem of Text Summarization.
Here, the application of Rough Set based span and spanning set to solve the problem of Extractive Text Summarization is proposed. For the task of extraction, the Information System (U, R) has to be constructed. The universe U consists of all the sentences in the given text. The set of attributes R consists of the features that describe the text.
Determining extract of text using span
The problem of Text Extraction evaluates the sentences for their salience. The extract so created includes all essential information since the span of a text document with more emphasis on boundary region shall cover as much key topics present in the text as possible. This follows from the fact that while lower approximations guarantee that certainly covered topics are included in the summary, a larger boundary region implies higher number of topics of the text will be covered in the extractive summary. Hence spanning set is the subset of all the sentences of the text that includes maximum number of concepts of the text represented as individual features. Further, concepts in different equivalence classes are uniquely distinct this reduces redundancy in the generated extract.
Computing an extract requires evaluating subsets of universe for their span and finding the one with highest value. Evaluating span of all subsets is computationally expensive. Hence, we solve it using optimization problem. In the following section, we describe the optimization problem used to compute the optimal spanning set for the problem of Text Summarization.
Optimization problem
Consider Z to be a subset of U which is the collection of sentences, i.e. Z is spanning set of U. Experiments show that spanning set for an Information System (U, R) may not be unique, hence it is paramount to add additional application specific objective to the optimization problem. For the given problem of extractive summarization, the additional objective function is span with higher similarity with the text document.
Hence, determination of Rough Set based extract for single document summarization is a multi-objective optimization problem given as follows:
maximize (δP,Z, similarity (Z, Document))
subject to,
length(Z) ≤p,
p is the degree of compression.
Here, notations are as discussed in Section 4 and δP,Z is the complete span of all the attributes under consideration.
Proposed solution for computing spanning sets of text documents
Computing a spanning set is a computationally complex problem. In this paper, we propose to use Particle Swarm Optimization [7] to compute a complete spanning set. Particle Swarm Optimization (PSO) is inspired by social behavior of birds and insects. A collection of particles participates to generate an optimal solution to a problem. Each particle has a velocity and a location. The particles move in direction simulated by the guidance of global best (GB) and local best (LB) positions. The velocity vi and position xi of a particle are updated as follows:
Experiments and observations
Experiments
Experiments of the proposed methodology have been conducted on benchmark DUC datasets for Single Document Summarization, viz. DUC2001, DUC2002 (http://www-nlpir.nist.-gov/projects/duc/) and Enron Email datasets [18] viz. Personal Email dataset, Corporate Email Dataset. 105-word summary is computed for both the DUC datasets for evaluations. While 50% summarization was computed for Enron dataset. The pre-processing steps followed are (i) Sentence Segmentation (ii) Removal of stop words (iii) POS tagging and (iv) Stemming.
To evaluate the summarization results, ROUGE package [17] was used. We have used the following ROUGE metrics: (i) ROUGE-1 (1-gram measure), (ii) ROUGE-2 (2-gram measure), (iii) ROUGE-L (Longest common subsequence measure), (iv) ROUGE-SU* (Skip bigram with 1-gram measure). The results were compared with abstracts created by human experts. We have experimented on several Information System and evaluated their effectiveness for Extractive Summarization. The parameters used in evaluations are w1 = 0.1 and w2 = 0.9. The following Rough Set based systems were evaluated: Span: Here the universe is collection of all sentences and the features taken are nouns, adjectives, adverbs and verbs present in the pre-processed text. The Information System is the one-hot representation of the text described by these features. The extract here is a complete spanning set solved using PSO. Span-Similarity: The Information System is the one-hot representation of the text and is created as in (i) above. The extract is a weighted Rough Set based spanning set with binary weights obtained by solving optimization problem given in Section 5.2 using PSO. This system outputs a spanning set with highest similarity with the document. Span-Lexical: Lexical Chains [28] are collection of semantically related words present in the text. Each lexical chain describes the group of similar terms which convey same meaning. It plays a vital role in reducing the dimensionality of the problem. The Information System has top 20% lexical chains as attributes and sentences as the objects of universe. If a sentence i is related to an attribute a the value a(i) is set to one else the value is zero. Once the Information Set is formed, the extract is generated as in Span-Similarity. Span-LSA: Latent Semantic Analysis (LSA) [10] based decomposition is performed on matrix M which is frequency-based representation of the document. The entry M[i, j] of matrix M is equal to frequency of term j in sentence i. The LSA based decomposition of M is M = USVT. V is r x n matrix, r represents the key concepts determined by LSA and n is the number of sentences in text. The matrix VT describe each sentence in terms of each of the key concept described by top lexical chains. The Information System obtained by discretizing the matrix VT obtained by LSA based decomposition. It is evaluated for its use in Span based summarization. The extract is formed by solving optimization problem given in Section 5.2 using PSO. Span-GLOVE: GLOVE [26] vectors are pre-trained vectors which represent words and sentences based on learning huge corpus of textual data. Since a sentence can be represented using GLOVE vectors, experiments are performed to evaluate its effectiveness for Span based summarization. The sentences present in text forms the universe and attributes are the GLOVE features. The Information System is obtained with discretization of GLOVE representation of 100 dimension. The summary generated is spanning set obtained by solving optimization problem given in Section 5.2 using PSO. Baseline: The baseline summarization system developed by Yadav & Chatterjee [34]. Graph: Graph based technique for summarization using PageRank [23] is performed here. LSA: Latent Semantic Analysis based technique for summarization [10]. PSO: Particle Swarm Optimization [7] with fitness function as similarity between document and extract.
The systems are evaluated for comparison of results in terms of ROUGE recall scores in the Tables 3 through 6. Table 3 summarizes the results for DUC2001 dataset. The best results for all the ROUGE parameters evaluated are obtained for Span-Similarity. Table 4 gives the results for DUC2002 dataset. The results for ROUGE-1 is best for Span-Similarity model while ROUGE-L, ROUGE-S and ROUGE-SU are best for Span-Lexical Chain based model. For Enron Personal Email dataset and Enron Corporate Email dataset the best results are obtained for Span similarity which is performing better than all other comparison models. The results presented in Tables 3 to 6 show that the Span-Similarity based methods performed better than all other techniques. It is evident that only spanning set is not enough for efficient summary generation. The reason as mentioned before is the fact that a spanning set is not unique for a given Information System.
Results of Summarization for DUC2001 Single Document Dataset
Results of Summarization for DUC2001 Single Document Dataset
Results of Summarization for DUC2002 Single Document Dataset
Results of Summarization for Enron Personal Single Document Dataset
Results of Summarization for Enron Corporate Single Document Dataset
In this paper, novel Rough Set based uncertainty measure called span is proposed. Further, special Rough subsets of universe called spanning sets and complete spanning sets are defined and their properties are analyzed. There are various practical applications of span and spanning set. Here, the concepts are applied to problem of Single Document Extractive Text Summarization where the task reduces to finding suitable optimal spanning sets. The results are analyzed in terms of ROUGE scores for standard datasets for single document summarization viz. DUC2001, DUC2002, Enron Corporate Email dataset and Enron Personal Email dataset. It follows from our observations that the proposed Rough Set-based Span- Similarity is performing best among all the proposed models and experimented models of Text Summarization. We are working towards the extension of this approach to Multi Document Summarization.
