Abstract
Most extractive multi-document summarization (MDS) methods relies on extraction of content relevant sentences ignoring sentence relationships. In this work, we propose a unified framework for extractive MDS that also considers sentence relationships. We argue that adding a sentence to the summary increases summary score by relevance score of the new sentence plus some additional score which depends on the relationships of new sentence with other summary sentences. The quantification of additional score depends on how coherent the new sentence is with respect to the existing sentences in the summary. Simultaneously, some score is decreased from the summary score due to the redundancy which depends on overlap between new and existing summary sentences. To find the exact solution, sentence extraction problem is modeled as integer linear problem. The sentence relevance score is found using content and surface features of the sentence using topic model and regression framework. To find the relative coherence score, transition probabilities in the entity grid model are used. Redundancy between sentences is found using support vector regression that uses sentence overlapping features. The proposed method is evaluated on DUC datasets over query based multi-document summarization task. DUC 2006 dataset is used as training and development set for tuning parameters. Experimental results produce ROUGE score comparable to the state-of-the-art methods demonstrating the effectiveness of the proposed method.
Introduction
Extractive text summarization systems focus on extracting set of important sentences that optimizes various objectives of text summarization - relevance, coverage and coherence. Relevance is achieved by considering only those sentences (or concepts, bi-grams etc.) that are important or relevant to the query. Coverage of the summary is often achieved by minimizing redundancy. Automatic summaries are often evaluated for coverage using ROUGE [1] that measures the overlap of n-grams between system and reference summaries. There is trade-off between coverage and coherence as maximizing one reduces other. For this reason, most of the work in text summarization are either coverage focused or coherence focused. Coverage focused systems aim to maximize coverage for improved ROUGE score ignoring coherence and coherence focused systems improve coherence not focusing on ROUGE. In this work, we have proposed a summarization method that considers both the coverage and coherence to generate extractive coherent summaries for query focused text summarization.
There has been much work on coherence focused text summarization that uses complex Rhetorical Structure Theory based discourse analysis [2]. But those work which model both the coverage and coherence use topic models or graph based approach for modeling coherence. In [3, 4], coherence of a sentence is found with respect to all sentences in the dataset. Using such coherence measure, final summary may have sentences with high coherence score but may not be locally coherent with other sentences which are included in the summary. In [3], entity graph based on entity grid model [4] is used to find coherence score of sentences. The coherence score of a sentence is defined as out-degree of sentence node in entity graph, which assign coherence score with respect to all sentences in the dataset. In [4], topic based coherence is used for sentences by summing topic specific word probabilities which improves topical coherence but may not improve local coherence in the final summary.
In this paper, coherence score of summary is found using relative coherence score of sentences included in the summary, which guarantees that only coherent sentences will be included in the sentences. Relative coherence score of sentences are found using entity transition probabilities in the entity grid model [5]. Entity grid model has been used for coherence evaluation and improving coherence in coherence focused summarization systems [6]. We use entity grid model to find relative coherence score and integrate with relevance and redundancy score to find summary which are both locally coherent and cover the important concepts.
One of the important goal of summarization is to include only most relevant sentences in the summary. Relevance of a sentence is calculated in different ways in different systems. On the basis of method used for finding relevance the systems can be categorized in frequency based, regression based, topic model based [7]. For finding the relevance score of a sentence, we use an integrated approach that use both regression based and topic model based method. Regression based method use various sentence features to rank sentences and topic model based method use topic distributions to find frequent summary words. Regression based method use certain indicative features like sentence position, length, entities which are not used by topic model based method. Moreover, topic model based approach finds all important topics whereas some important topics may be missed by regression based method. We conduct experiments on the evaluation datasets and observe that on average 15% more relevant (with high ROUGE score) sentences are extracted in top 25 sentences in the integrated model than only in regression based or topic model based approach. We calculate regression score using SVR model and topic score using enriched two-tiered topic model and take weighted average of the two normalized score as final relevance score of the sentence.
Redundancy is considered in existing summarization systems in mainly two ways. First, redundancy is minimized by maximizing coverage of bi-grams or concepts and this approach is used in solution which finds exact solution using some optimization algorithm like integer linear programming. Second approach is used in maximal marginal approach based summarization methods where sentences are added to the summary in a greedy way. Cosine similarity between the sentence and current summary is calculated and is used to decide if this sentence can be added to the summary by comparing the cosine similarity with a fixed threshold similarity value. Both these ways are different from the method used to model the redundancy. We model redundancy in the same way as used to model relevance of sentences unlike other redundancy aware methods which models relevance and redundancy using different methods. We define redundancy of the summary as sum of relative redundancy of sentences included in summary. Relative redundancy of a sentence is calculated using the same method that is used to calculate the relevance of that sentence i.e. weighted average of regression based score and topic model based score.
The contribution of the paper is summarized as: The proposed method model the redundancy term in a novel way by using the same model which is used to predict relevance score unlike other methods which use different models. In the proposed method, both regression model and topic model is used for relevance score prediction for sentences, which improves the extraction of relevant sentences. The proposed method perform comparable with the state of the art summarization methods on DUC datasets using ROUGE-1 recall results.
In the next section, we discuss related work. Section 3 describes the proposed model. In section 4, we present experimental results and discussion. Finally, section 5 describes conclusion and future work.
Related work
Extractive summarization systems works on scoring sentences based on some intermediate representation, which indicates importance of sentences. After scoring, sentences are selected maximizing total importance and/or coherence and minimizing redundancy in the final summary [7].
Coherence modelling in summarization systems
Most summarization systems either does not consider coherence or consider it only after sentence selection in a sentence reorganization task. Better coherent summaries are produced when coherence is also considered during sentence selection. Topical coherence is achieved by ensuring that summary sentences belongs to the same topic. Neural Networks [8] and probabilistic topic models [4, 9] are usually used to capture semantically correlated concepts to identify coherent sentences.
Finer level of coherence is achieved by discourse analysis exploring rhetorical relations among text spans. Christensen et al. [10] builds a system G-FLOW which uses discourse graph considering discourse markers, deverbal noun reference, co-referent mentions to find coherence score which is combined with salience score of the summary. G-FLOW summaries are more cohesive but has less ROUGE score due to not focusing on coverage which is evaluated by ROUGE. Barzilay and Lapata [5] uses entity grid representation to learn discourse. Entity grids are used as feature vectors and machine learning method is used for sentence ordering and summary coherence scoring. In [6], authors uses entity transition probabilities in entity grid to optimize global and local coherence for sequence of summary sentences. They modeled optimized sequence finding as weighted longest path which is solved using integer linear programming. Authors in [11] employs a one-mode projection of bipartite graph of sentences and entities to find local sentence coherence of a sentence as its out-degree.
Modeling importance and redundancy
Peyrard [12] presented a theoretical model for summarization describing relevance, informativeness and redundancy. Like other salience based systems, he did not consider coherence. Informativeness measure is defined for update summarization. He defined redundancy as entropy of the summary S, relevance as cross-entropy between summary S and document set D, informativeness as cross-entropy between summary S and background knowledge K. He argues that KL-divergence between P S and PD/K approximates the importance of sentences considering three measures, where distribution PD/K assigns high probabilities to words occurring in D and not occurring in K. Several works [13, 14] uses KL-divergence for sentence selection because it maximizes coverage and minimizes redundancy [12].
Integer linear programming (ILP) is widely used for sentence selection by optimizing a function of various importance measures. Firstly McDonald [15] proposed the use of ILP for summary sentence selection. In [16], authors optimize importance and redundancy terms using ILP, where importance is inferred from machine learning algorithm and redundancy term is approximated using frequency of common terms in sentences. Gillick [17] optimizes weighted concepts in sentences without considering the redundancy terms because maximizing coverage of concepts automatically minimizes redundant concepts. Authors in [18] uses the bi-grams as concepts, where bi-grams frequency is inferred using regression model. Parveen et al. [11] uses a topic similarity term in ILP formulation for topic based MDS. In another work, Parveen et al. [3] use a bipartite graph of sentences and entities to find relevance score using HITS algorithm and coherence score of a sentence as out-degree in one-mode project of bi-partite graph reflecting global coherence relationships. They did not consider the pairwise sentence relationships for coherence score as used in the proposed work. Wang et al. [6] propose to use ILP for summary sentences sequence optimization that optimizes and importance and coherence score. They consider score of sentence as weighted score of entities present in sentence. They use pairwise sentence coherence score as sum of probabilities of entity transition from one role in first sentence to another role in second sentences using entity-grid model [5].
Topic modeling in summarization
Probabilistic topic models [19] uncover hidden topics, where each topic is a distribution over frequently co-occurring terms. Hierarchical topic models [4, 20] generates a hierarchy of topics in which high level topics provides general concepts in text and low level topics provides specific sub-concepts related to high level topics. Two-tiered topic model (TTM) [4], Enriched TTM (ETTM) [4], Latent Dirichlet Co-clustering [21] uses two levels of topics to capture general concepts in the dataset. In TTM and ETTM, sentences are used as meta-variables defined as distributions over high-level topics which allows to directly find sentences related to general concepts. Document structure has been considered in topic modeling to model the word affinities to different parts of text i.e. phrase, sentences, paragraphs, document etc. In HIERSUM [13], three content distributions are found at three levels- sentence, document and document set. LDCC [21] and SenLDA [20] associates and model a distribution with each paragraph and sentence respectively. The topic distributions are used for sentence scoring and selection for extractive MDS using KL-divergence, average topic probabilities, regression etc.
Proposed model
The goal of an extractive MDS system is to select a set of summary sentences that maximizes the overall summary score. Summary score generally includes relevance score and diversity (or redundancy) score of selected summary sentences. The diversity (or redundancy) term is calculated independent of the relevance term. In both greedy and ILP based approach, the diversity term is modeled as the weighted coverage of concepts (usually bi-grams) and redundancy term is modeled as similarity (E.g. cosine similarity) among selected sentences. Some works have also considered summary coherence, which is also calculated independent of relevance and diversity scores.
In the proposed model, summary score is modeled using individual and relative scores of sentences included in the summary. Relative score of a sentence depends on other sentences already included in the summary. Final score of a sentence may be lower than its individual score because of the redundancy. On the other hand, final score of a sentence may be higher than its individual score because of its high coherence with existing summary sentences. When a new sentence is added to the summary, its final score is considered instead of only its individual score. The individual score of a sentence depends on some individual sentence features like position in the document, length, query matching etc. Whereas, relative score is also affected by sentence relations like redundant and coherent terms.
The summary score Score
sum
in the proposed model is evaluated as follows:
To find the exact relative redundancy score of a sentence with respect to all summary sentences, one should consider overlapping score of all possible c (< Number of summary sentences) sentence combinations according to principle of inclusion and exclusion 1 and overlapping score with exactly one sentence should be subtracted and overlapping scores with exactly two sentences should be added and and so on. Here, we take an approximation which perform good for the summarization task because we observe that the number of summary sentences is small and there are a few sentence combinations of size 3 or more with overlapping common terms. For this reason, we ignore considering all sentence combinations of size 4 or more. For sentence combinations of size 3, we have subtracted some score from sentence combination of size 2 instead of adding some score for sentence combinations of size 3. Constant b (<1) represents this reduction in equation 2.
For finding S I and S R scores, we have used support vector regression model, topic model and a term weighting scheme. For S C score, entity-grid based probability model is used. After having obtained relative sentence scores for each sentence, three summarization methods- TopSentences, MMR based and ILP based are used for generating summary.
Next, we explain how each component of S (s i |sum) is obtained and how summaries are generated using three summary generated methods in the following sub-sections.
The sentence individual score is modeled as sum of its content word importance score. In this work, content word set is defined as set of all words in the document set except stopwords. To find the word importance score, we have used weighted average of two scores- (1) score obtained from a support vector regression (SVR) model [22], (2) score obtained from Enriched Two-level topic model [4]. These scores have been normalized before taking the weighted average. Next, we explain how these two scores are obtained in following paragraphs.
SVR model for relevance score. Support vector regression (SVR) learns a function from training data that is used to predict the continuous label for new data. SVR has earlier been used for predicting sentence score in text summarization task. SVR model takes a set of sentence features and predict the sentence score. We user ε - SVR with RBF kernel, which is able to learn non-linear function.
For generating instances for training SVR model, certain individual features and sentence score are identified for each sentence in the training dataset. We consider Rouge R1 score as training label as it has been used earlier for sentence score regression. We identify one new feature which is never used for sentence score prediction- sentence-query semantic overlap using fuzzy bag of words. We explain only this new feature as other features have earlier been used and are self-explanatory. Earlier work have used WordNet [23] based semantic similarity between query and sentences. Sentence-query semantic overlap captures semantic relationships between sentence and query. It is modeled using fuzzy bag of word (FBOW) model [24], which is defined as follows:
Where t and q are sentence and query context words respectively;
SVR Features for individual sentence relevance score
Once sentences have been ranked using SVR model, we use top 10% sentences in the document to find the word importance score. All the words which appear in the top 10% sentences are assigned score proportional to their frequency. Remaining words are assigned a smoothing score equal to a small fixed value. These word importance scores are biased towards the top 10% sentences found using SVR model.
ETTM model for relevance score. Regression model presented in previous subsection does not guarantee the coverage of important topics of documents. Most of the top sentences may belong to a superior topic leading to poor topical coverage of document set. Enriched two-tiered topic model [4] uses an unsupervised probabilistic approach to model general concepts hidden in documents to generate topically coherent extractive summaries. We use a modified ETTM [25] to find topically frequent general words, which we define as having high probabilities with respect to high level topics in ETTM. In the modified ETTM, textRank scores are used to restrict the set of sentences from which randon variable x is drawn. In the generative process of the ETTM, for each word w ij of sentence s i of the document d, variable x ij is drawn from a binomial distribution. If x ij is 0, it is not related to the query and word w ij is sampled from a background distribution θ. If x ij is sampled as 1, it is query related and word w ij is sampled from three level hierarchy. The variable y ij is sampled uniformly from those sentences containing word w ij . Each sentence y is associated with multinomial distribution θ y over K1 high level topics. A high level topic H is sampled from θ y . Each high level topic h is associated with a multinomial distribution θ h over K2 low level topics. A low level topic T is sampled from θ h . Each low level topic t has a multinomial distribution φ t over W vocabulary words. Finally, word w ij is sampled from φ t . The sampling distributions for query related and unrelated words are shown below for completion. All the counts and samples are same as defined in [4].
We use word probabilities w. r. t. high level topics to find word importance score from ETTM. Let p (w/h) is the probability of word w in high level topic h and H is the number of high level topics. We use BDC weighting scheme [26] to demote domain specific stopwords which may appear in several topics as high probability word. The word score W
ettm
is calculated as
To find the relative reduction in score of a sentence with respect to another sentence, we use weighted average of regression based score and ETTM score of common content words of the two sentences. ETTM reduction score is the sum of ettm scores of common content words. The method for finding regression based reduction score is described in the following paragraph.
SVR model for relative redundancy score. Since Rouge score is used as individual sentence relevance score for training in sub-section 3.1, we use reduction in Rouge score as redundancy score relative to another sentence. ε - SVR with RBF kernel is used for regression. We identify five sentence relation features for training. Ren et. al [27] have also used similar regression framework using common sentence features. They used multilayer perceptron for regression for predicting relative Rouge score of a sentence with respect to another sentence. Instead, we use a SVR based regression model to predict the reduction in Rouge score of a sentence with respect to another sentence. Reduction in Rouge score of sentence s1 with respect to sentence s2 is obtained as follows:
This reduction score is used as label for instances in training data. All the sentence relation features used in SVR model and thier description are listed in Table 2. Note that Reds1|s2 is same as Reds2|s1. We take the average of Reds1|s2 and Reds2|s1 as reduction in Rouge score of s1 with repect to s2.
SVR Features for relative redundancy score
Coherence score is combined with relevance score in [3, 10], but they have used absolute coherence score for each sentence. In [11], coherence score of a sentence is defined as out-degree of sentence node in a graph obtained from entity-grid model. In [10], a discourse graph is used to find absolute sentence coherence graph. Our method is different from these in that we consider relative coherence score of a sentence with respect to another sentence. We use entity-grid model for obtaining relative coherence score but in a different way from [11]. Our method of obtaining relative coherence score is similar to [6], in which authors have used relative coherence gain in sequence optimization for coherence-focused summarization.
In entity-grid model, each entity in the sentence is assigned a state. We use four sates S, O, X, and - for subject, object, other and absent respectively. Two sentences are considered coherent if they have some common entity. Degree of coherence depends on number and type of entities shared by those sentences. Relative coherence score of sentence s1 with respect to s2 is defined as follows:
We use three summarization methods explained below-
TopRank Summarization. In this greedy approach, we assign absolute relevance score to each sentences using the method defined in section 3.1 using weighted average of SVR regression score and ETTM score. Sentences are then ranked according to decreasing relevance score. Top sentences are selected from ranked list iteratively until length limit is reached.
MMR Summarization. This method use maximal Marginal Relevance [28] to reduce redundancy. Both absolute and relative relevance score are used in this greedy method. Firstly, the sentence having highest absolute score is included in the summary. After that, all the remaining sentences s i are assigned new scores as S I (s i ) - b . ∑s j ∈SS R (s i |s j ). Sentence having highest new score is added to the summary. This process is repeated until required summary length is achieved.
ILP based Summarization. Integer Linear Programming (ILP) based method is used to find exact solution. ILP method consider absolute relevance score and relative coherence and redundancy scores in objective function. In the objective function, order of summary sentences is not considered. Coherence of summary is defined as sum of relative coherence score for all pair of summary sentences. The objective function and constraints for ILP formulation is as follows:
L is the summary length limit. l i is the length of sentence i. Binary variable x i indicates that sentence i is included in the summary. Binary variable y ij indicates that both sentences i and j are included in the summary. Summary score is the sum of individual and relative scores of all sentences included in the summary. First constraint states that variable x and y are binary variables. Second constraint restrict the summary length up to maximum limit L. Last three constraint ensures that if y ij is 1 then both x i and y j must be 1.
In this section, we describe how the experimental settings are tuned and results are obtained.
Datasets
For the evaluation of proposed models, we use standard DUC datasets 4 over query focused multi-document summarization task. DUC 2006 dataset is used for SVR training and models are evaluated over DUC 2005 and DUC 2007 datasets. Each document cluster has a query statement which describes the details of the summary to be obtained. Each document cluster has four or nine reference summaries written by human experts.
Experiment settings
For both the SVR models, java library of LIBSVM is used 5 . The parameters C and γ are set using grid based search. For ETTM, Gibbs sampling chain is run for 2000 iterations with first 1000 iterations as burn-in. Samples are collected after every 100 iterations and averaged samples are used for ettm score calculations. All hyper-parameters are set using grid-based approach. For taking the weighted average of regression based score and ettm score, high weight is assigned to the regression based score as for almost all dataset clusters, on average 75% top sentences are fetched in the top 50 sentences set. Regression based score and ettm score are multiplied with 0.65 and 0.35 respectively. The value of constant b in equations 2 and 10 is set to 0.85.
For MMR summarization, we conduct experiments using two variants. In the first variant, called MMR-SVR, only SVR regression scores are used for calculation of both individual and relative sentence scores. In the second variants, called MMR-SVRTM, both SVR regression and ETTM topic model scores are used for calculation of both individual and relative sentence scores.
Evaluation criterion
The DUC2005 and DUC2007 task are to generate the maximum 250 words query focused summary for each document cluster. Standard DUC evaluation metric ROUGE [1] is used, which measure recall over n-gram statistics from a system generated summary against a set of human generated summary. Rouge-1 (recall against unigrams) and Rouge-2 (recall against bigrams) results with stop-words are reported. For the evaluation of coherence, we use the sum of coherence score S C of all summary sentences.
Results and discussion
Rouge-1 and Rouge-2 results for DUC2005 and DUC2007 datasets are shown in Table 3. For other methods, only those results are shown in table which are provided in respective papers. Since ETTM is a probabilistic topic model, the average results of recall over several experiments are reported.
Rouge-1 and Rouge-2 results for DUC2005 and DUC2007 datasets
Rouge-1 and Rouge-2 results for DUC2005 and DUC2007 datasets
Rouge results in Table 3 shows that our approach is comparable with state of the art summarization methods. Our ILP method produces slightly less score than our MMR method because of inclusion of coherence measure in ILP formulation which increase coherence score at the cost of coverage. MultiMR [29] is a graph based manifold ranking method that consider both with-in document and cross-document sentence relationships. NCBsum-A [30] is coverage focused summarization method and produces better rouge scores than our method. We observe that our regression model MMR-SVR perform slightly worse than SVR [31]. Our model MMR-SVRTM perform better than MMR-SVR which shows that integrating the regression model with topic model improves the coverage results.
In this work, we have proposed and implemented an extractive text summarization method that simultaneously model coverage and coherence with some novel approaches. Regression based and topic model based summarization methods are integrated to find the sentence relevance score. Redundancy score is modeled in the same framework which is used for finding relevance score unlike other methods. Local coherence is used in the optimization instead of global coherence. The proposed system achieves comparable results in Rouge evaluation with state of the art methods which model both coverage and coherence. The summary produced are also better in coherence measure.
Our regression based model underperformed slightly in comparison to state of the art. Regression based state of the art achieves 0.4342 rouge-1 recall for DUC2007, whereas our model can achieve only 0.4269 rouge-1 recall. For future work, we plan to tune our regression model to achieve better results. The coherence measure used in this paper does not consider the exact sequence of sentences in the summary. We plan to use regression model based on rouge-2 score as other methods also maximizes rouge-2 score and produce better results for rouge-2 recall. We also plan to include coherence measure based on sentence sequence in ILP formulation.
