Extractive multi-document summarization using relative redundancy and coherence scores

Abstract

Most extractive multi-document summarization (MDS) methods relies on extraction of content relevant sentences ignoring sentence relationships. In this work, we propose a unified framework for extractive MDS that also considers sentence relationships. We argue that adding a sentence to the summary increases summary score by relevance score of the new sentence plus some additional score which depends on the relationships of new sentence with other summary sentences. The quantification of additional score depends on how coherent the new sentence is with respect to the existing sentences in the summary. Simultaneously, some score is decreased from the summary score due to the redundancy which depends on overlap between new and existing summary sentences. To find the exact solution, sentence extraction problem is modeled as integer linear problem. The sentence relevance score is found using content and surface features of the sentence using topic model and regression framework. To find the relative coherence score, transition probabilities in the entity grid model are used. Redundancy between sentences is found using support vector regression that uses sentence overlapping features. The proposed method is evaluated on DUC datasets over query based multi-document summarization task. DUC 2006 dataset is used as training and development set for tuning parameters. Experimental results produce ROUGE score comparable to the state-of-the-art methods demonstrating the effectiveness of the proposed method.

Keywords

Multi-document summarization topic model support vector regression entity grid rouge

1 Introduction

Extractive text summarization systems focus on extracting set of important sentences that optimizes various objectives of text summarization - relevance, coverage and coherence. Relevance is achieved by considering only those sentences (or concepts, bi-grams etc.) that are important or relevant to the query. Coverage of the summary is often achieved by minimizing redundancy. Automatic summaries are often evaluated for coverage using ROUGE [1] that measures the overlap of n-grams between system and reference summaries. There is trade-off between coverage and coherence as maximizing one reduces other. For this reason, most of the work in text summarization are either coverage focused or coherence focused. Coverage focused systems aim to maximize coverage for improved ROUGE score ignoring coherence and coherence focused systems improve coherence not focusing on ROUGE. In this work, we have proposed a summarization method that considers both the coverage and coherence to generate extractive coherent summaries for query focused text summarization.

There has been much work on coherence focused text summarization that uses complex Rhetorical Structure Theory based discourse analysis [2]. But those work which model both the coverage and coherence use topic models or graph based approach for modeling coherence. In [3, 4], coherence of a sentence is found with respect to all sentences in the dataset. Using such coherence measure, final summary may have sentences with high coherence score but may not be locally coherent with other sentences which are included in the summary. In [3], entity graph based on entity grid model [4] is used to find coherence score of sentences. The coherence score of a sentence is defined as out-degree of sentence node in entity graph, which assign coherence score with respect to all sentences in the dataset. In [4], topic based coherence is used for sentences by summing topic specific word probabilities which improves topical coherence but may not improve local coherence in the final summary.

In this paper, coherence score of summary is found using relative coherence score of sentences included in the summary, which guarantees that only coherent sentences will be included in the sentences. Relative coherence score of sentences are found using entity transition probabilities in the entity grid model [5]. Entity grid model has been used for coherence evaluation and improving coherence in coherence focused summarization systems [6]. We use entity grid model to find relative coherence score and integrate with relevance and redundancy score to find summary which are both locally coherent and cover the important concepts.

One of the important goal of summarization is to include only most relevant sentences in the summary. Relevance of a sentence is calculated in different ways in different systems. On the basis of method used for finding relevance the systems can be categorized in frequency based, regression based, topic model based [7]. For finding the relevance score of a sentence, we use an integrated approach that use both regression based and topic model based method. Regression based method use various sentence features to rank sentences and topic model based method use topic distributions to find frequent summary words. Regression based method use certain indicative features like sentence position, length, entities which are not used by topic model based method. Moreover, topic model based approach finds all important topics whereas some important topics may be missed by regression based method. We conduct experiments on the evaluation datasets and observe that on average 15% more relevant (with high ROUGE score) sentences are extracted in top 25 sentences in the integrated model than only in regression based or topic model based approach. We calculate regression score using SVR model and topic score using enriched two-tiered topic model and take weighted average of the two normalized score as final relevance score of the sentence.

Redundancy is considered in existing summarization systems in mainly two ways. First, redundancy is minimized by maximizing coverage of bi-grams or concepts and this approach is used in solution which finds exact solution using some optimization algorithm like integer linear programming. Second approach is used in maximal marginal approach based summarization methods where sentences are added to the summary in a greedy way. Cosine similarity between the sentence and current summary is calculated and is used to decide if this sentence can be added to the summary by comparing the cosine similarity with a fixed threshold similarity value. Both these ways are different from the method used to model the redundancy. We model redundancy in the same way as used to model relevance of sentences unlike other redundancy aware methods which models relevance and redundancy using different methods. We define redundancy of the summary as sum of relative redundancy of sentences included in summary. Relative redundancy of a sentence is calculated using the same method that is used to calculate the relevance of that sentence i.e. weighted average of regression based score and topic model based score.

The contribution of the paper is summarized as:

The proposed method model the redundancy term in a novel way by using the same model which is used to predict relevance score unlike other methods which use different models.

In the proposed method, both regression model and topic model is used for relevance score prediction for sentences, which improves the extraction of relevant sentences.

The proposed method perform comparable with the state of the art summarization methods on DUC datasets using ROUGE-1 recall results.

In the next section, we discuss related work. Section 3 describes the proposed model. In section 4, we present experimental results and discussion. Finally, section 5 describes conclusion and future work.

2 Related work

Extractive summarization systems works on scoring sentences based on some intermediate representation, which indicates importance of sentences. After scoring, sentences are selected maximizing total importance and/or coherence and minimizing redundancy in the final summary [7].

2.1 Coherence modelling in summarization systems

Most summarization systems either does not consider coherence or consider it only after sentence selection in a sentence reorganization task. Better coherent summaries are produced when coherence is also considered during sentence selection. Topical coherence is achieved by ensuring that summary sentences belongs to the same topic. Neural Networks [8] and probabilistic topic models [4, 9] are usually used to capture semantically correlated concepts to identify coherent sentences.

Finer level of coherence is achieved by discourse analysis exploring rhetorical relations among text spans. Christensen et al. [10] builds a system G-FLOW which uses discourse graph considering discourse markers, deverbal noun reference, co-referent mentions to find coherence score which is combined with salience score of the summary. G-FLOW summaries are more cohesive but has less ROUGE score due to not focusing on coverage which is evaluated by ROUGE. Barzilay and Lapata [5] uses entity grid representation to learn discourse. Entity grids are used as feature vectors and machine learning method is used for sentence ordering and summary coherence scoring. In [6], authors uses entity transition probabilities in entity grid to optimize global and local coherence for sequence of summary sentences. They modeled optimized sequence finding as weighted longest path which is solved using integer linear programming. Authors in [11] employs a one-mode projection of bipartite graph of sentences and entities to find local sentence coherence of a sentence as its out-degree.

2.2 Modeling importance and redundancy

Peyrard [12] presented a theoretical model for summarization describing relevance, informativeness and redundancy. Like other salience based systems, he did not consider coherence. Informativeness measure is defined for update summarization. He defined redundancy as entropy of the summary S, relevance as cross-entropy between summary S and document set D, informativeness as cross-entropy between summary S and background knowledge K. He argues that KL-divergence between P_S and P_D/K approximates the importance of sentences considering three measures, where distribution P_D/K assigns high probabilities to words occurring in D and not occurring in K. Several works [13, 14] uses KL-divergence for sentence selection because it maximizes coverage and minimizes redundancy [12].

Integer linear programming (ILP) is widely used for sentence selection by optimizing a function of various importance measures. Firstly McDonald [15] proposed the use of ILP for summary sentence selection. In [16], authors optimize importance and redundancy terms using ILP, where importance is inferred from machine learning algorithm and redundancy term is approximated using frequency of common terms in sentences. Gillick [17] optimizes weighted concepts in sentences without considering the redundancy terms because maximizing coverage of concepts automatically minimizes redundant concepts. Authors in [18] uses the bi-grams as concepts, where bi-grams frequency is inferred using regression model. Parveen et al. [11] uses a topic similarity term in ILP formulation for topic based MDS. In another work, Parveen et al. [3] use a bipartite graph of sentences and entities to find relevance score using HITS algorithm and coherence score of a sentence as out-degree in one-mode project of bi-partite graph reflecting global coherence relationships. They did not consider the pairwise sentence relationships for coherence score as used in the proposed work. Wang et al. [6] propose to use ILP for summary sentences sequence optimization that optimizes and importance and coherence score. They consider score of sentence as weighted score of entities present in sentence. They use pairwise sentence coherence score as sum of probabilities of entity transition from one role in first sentence to another role in second sentences using entity-grid model [5].

2.3 Topic modeling in summarization

Probabilistic topic models [19] uncover hidden topics, where each topic is a distribution over frequently co-occurring terms. Hierarchical topic models [4 , 20] generates a hierarchy of topics in which high level topics provides general concepts in text and low level topics provides specific sub-concepts related to high level topics. Two-tiered topic model (TTM) [4], Enriched TTM (ETTM) [4], Latent Dirichlet Co-clustering [21] uses two levels of topics to capture general concepts in the dataset. In TTM and ETTM, sentences are used as meta-variables defined as distributions over high-level topics which allows to directly find sentences related to general concepts. Document structure has been considered in topic modeling to model the word affinities to different parts of text i.e. phrase, sentences, paragraphs, document etc. In HIERSUM [13], three content distributions are found at three levels- sentence, document and document set. LDCC [21] and SenLDA [20] associates and model a distribution with each paragraph and sentence respectively. The topic distributions are used for sentence scoring and selection for extractive MDS using KL-divergence, average topic probabilities, regression etc.

3 Proposed model

The goal of an extractive MDS system is to select a set of summary sentences that maximizes the overall summary score. Summary score generally includes relevance score and diversity (or redundancy) score of selected summary sentences. The diversity (or redundancy) term is calculated independent of the relevance term. In both greedy and ILP based approach, the diversity term is modeled as the weighted coverage of concepts (usually bi-grams) and redundancy term is modeled as similarity (E.g. cosine similarity) among selected sentences. Some works have also considered summary coherence, which is also calculated independent of relevance and diversity scores.

In the proposed model, summary score is modeled using individual and relative scores of sentences included in the summary. Relative score of a sentence depends on other sentences already included in the summary. Final score of a sentence may be lower than its individual score because of the redundancy. On the other hand, final score of a sentence may be higher than its individual score because of its high coherence with existing summary sentences. When a new sentence is added to the summary, its final score is considered instead of only its individual score. The individual score of a sentence depends on some individual sentence features like position in the document, length, query matching etc. Whereas, relative score is also affected by sentence relations like redundant and coherent terms.

The summary score Score_sum in the proposed model is evaluated as follows: ${Score}_{sum} = \sum_{i} S (s_{i} | sum)$ (1)

$\begin{matrix} S (s_{i} | sum) \\ = S_{I} (s_{i}) + \sum_{j \neq i} S_{C} (s_{i} | s_{j}) - b . \sum_{j \neq i} S_{R} (s_{i} | s_{j}) \end{matrix}$ (2) where S_I, S_C and S_R are individual relevance score, relative coherence and redundancy score of a sentence respectively. sum is the set of summary sentences and i, j are sentence indexes included in the sum. b (<1) is a constant which adjust relative redundancy score.

To find the exact relative redundancy score of a sentence with respect to all summary sentences, one should consider overlapping score of all possible c (< Number of summary sentences) sentence combinations according to principle of inclusion and exclusion 1 and overlapping score with exactly one sentence should be subtracted and overlapping scores with exactly two sentences should be added and and so on. Here, we take an approximation which perform good for the summarization task because we observe that the number of summary sentences is small and there are a few sentence combinations of size 3 or more with overlapping common terms. For this reason, we ignore considering all sentence combinations of size 4 or more. For sentence combinations of size 3, we have subtracted some score from sentence combination of size 2 instead of adding some score for sentence combinations of size 3. Constant b (<1) represents this reduction in equation 2.

For finding S_I and S_R scores, we have used support vector regression model, topic model and a term weighting scheme. For S_C score, entity-grid based probability model is used. After having obtained relative sentence scores for each sentence, three summarization methods- TopSentences, MMR based and ILP based are used for generating summary.

Next, we explain how each component of S (s_i|sum) is obtained and how summaries are generated using three summary generated methods in the following sub-sections.

3.1 Individual sentence relevance score S_I

The sentence individual score is modeled as sum of its content word importance score. In this work, content word set is defined as set of all words in the document set except stopwords. To find the word importance score, we have used weighted average of two scores- (1) score obtained from a support vector regression (SVR) model [22], (2) score obtained from Enriched Two-level topic model [4]. These scores have been normalized before taking the weighted average. Next, we explain how these two scores are obtained in following paragraphs.

SVR model for relevance score. Support vector regression (SVR) learns a function from training data that is used to predict the continuous label for new data. SVR has earlier been used for predicting sentence score in text summarization task. SVR model takes a set of sentence features and predict the sentence score. We user ε - SVR with RBF kernel, which is able to learn non-linear function.

For generating instances for training SVR model, certain individual features and sentence score are identified for each sentence in the training dataset. We consider Rouge R1 score as training label as it has been used earlier for sentence score regression. We identify one new feature which is never used for sentence score prediction- sentence-query semantic overlap using fuzzy bag of words. We explain only this new feature as other features have earlier been used and are self-explanatory. Earlier work have used WordNet [23] based semantic similarity between query and sentences. Sentence-query semantic overlap captures semantic relationships between sentence and query. It is modeled using fuzzy bag of word (FBOW) model [24], which is defined as follows: $QSO = \sum_{t} \sum_{q} {Sim}_{t, q}^{fbow} / | Q |$ (3)

$\begin{matrix} {Sim}_{t, q}^{fbow} = Cos (W [t], W [q]) if Cos (W [t], W [q]) > 0 \\ = 0 Otherwise \end{matrix}$ (4)

Where t and q are sentence and query context words respectively; ${Sim}_{t, q}^{fbow}$ is fuzzy membership function for t and q; W[t] and W[q] denotes word embedding for term t and q. We use word2vec provided by Google 2 as word embeddings. The word2vec are trained on Google News corpus having 100 billion words and dimensionality of 300. Before extracting features, all document and query sentences are stemmed using Porter stemming algorithm 3 and stop-words are removed using stop-word list. All the sentence features used in SVR model are listed in Table 1.

Table 1

SVR Features for individual sentence relevance score

Feature	Formulation	Description
Length	\|s\|/maxlength	maxlength is maximum length of any sentence in dataset.
Position	$1 - \sqrt{(p - 1) / \| S \|}$	p is position of of sentence and \|S\| is number of sentences in document d.
Document Frequency	∑_tDF_t/\|s\|	DF_t is document frequency for context word t. \|s\| is sentence length
Term Frequency	∑_tTF_t * log(1 + N/n)/\|s\|	t is context word in the sentence. TF_t is its global term frequency of t. n is number of documents having term t and N is total number of documents.
Query overlap	\|q_c\|/\|Q\|	\|q_c\| is number of common terms between query Q and sentence. \|Q\| is length of query.
Query semantic overlap	$\sum_{t} \sum_{q} {Sim}_{t, q}^{fbow} / \| Q \|$	t and q are sentence and query context terms respectively. Sim^fbow is semantic similarity between t and q using fbow model.
Title overlap	\|t_c\|/\|T\|	\|t_c\| is number of common terms between document title T and sentence. \|T\| is length of document title.
Named Entities	$\| ne \| / N_{ne}^{\max}$	\|ne\| is number of named entities in sentence. $N_{ne}^{\max}$ is maximum number of named entities in any sentence of dataset.
Stop-word ratio	\|sw\|/\|s\|	\|sw\| is number of stop-words in sentence s.

Once sentences have been ranked using SVR model, we use top 10% sentences in the document to find the word importance score. All the words which appear in the top 10% sentences are assigned score proportional to their frequency. Remaining words are assigned a smoothing score equal to a small fixed value. These word importance scores are biased towards the top 10% sentences found using SVR model.

ETTM model for relevance score. Regression model presented in previous subsection does not guarantee the coverage of important topics of documents. Most of the top sentences may belong to a superior topic leading to poor topical coverage of document set. Enriched two-tiered topic model [4] uses an unsupervised probabilistic approach to model general concepts hidden in documents to generate topically coherent extractive summaries. We use a modified ETTM [25] to find topically frequent general words, which we define as having high probabilities with respect to high level topics in ETTM. In the modified ETTM, textRank scores are used to restrict the set of sentences from which randon variable x is drawn. In the generative process of the ETTM, for each word w_ij of sentence s_i of the document d, variable x_ij is drawn from a binomial distribution. If x_ij is 0, it is not related to the query and word w_ij is sampled from a background distribution θ. If x_ij is sampled as 1, it is query related and word w_ij is sampled from three level hierarchy. The variable y_ij is sampled uniformly from those sentences containing word w_ij. Each sentence y is associated with multinomial distribution θ_y over K₁ high level topics. A high level topic H is sampled from θ_y. Each high level topic h is associated with a multinomial distribution θ_h over K₂ low level topics. A low level topic T is sampled from θ_h. Each low level topic t has a multinomial distribution φ_t over W vocabulary words. Finally, word w_ij is sampled from φ_t. The sampling distributions for query related and unrelated words are shown below for completion. All the counts and samples are same as defined in [4].

$\begin{matrix} p (H_{k 1}, T_{k 2}, x = 1 | .) \propto \frac{(n_{d}^{k_{1}} + α_{h})}{(n^{d} + \sum_{h} α_{h})} \times \frac{(n_{d}^{k_{1} k_{2}} + α_{t})}{(n_{h}^{d} + \sum_{t} α_{t})} \times \frac{(n_{x}^{k_{1} k_{2}} + η)}{(n^{k_{1} k_{2}} + 2 η)} \times \frac{(n_{k_{1} k_{2} x}^{w} + β_{w})}{(n_{k_{1} k_{2} x} + \sum_{w}^{'} β_{w}^{'})} \\ p (x = 0 | .) \propto \frac{(n_{k_{1} k_{2}}^{x} + η)}{(n_{k_{1} k_{2}} + 2 η)} \times \frac{(α_{w} + n^{w})}{(n + \sum_{w}^{'} α_{w}^{'})} \end{matrix}$ (5)

We use word probabilities w. r. t. high level topics to find word importance score from ETTM. Let p (w/h) is the probability of word w in high level topic h and H is the number of high level topics. We use BDC weighting scheme [26] to demote domain specific stopwords which may appear in several topics as high probability word. The word score W_ettm is calculated as $W_{ettm} = 1 + \frac{\sum_{h = 1}^{H} \frac{p (w / h)}{\sum_{i = 1}^{H} p (w / i)} \times log \frac{p (w / h)}{\sum_{i = 1}^{H} p (w / i)}}{log H}$ (6)

3.2 Relative redundancy score

To find the relative reduction in score of a sentence with respect to another sentence, we use weighted average of regression based score and ETTM score of common content words of the two sentences. ETTM reduction score is the sum of ettm scores of common content words. The method for finding regression based reduction score is described in the following paragraph.

SVR model for relative redundancy score. Since Rouge score is used as individual sentence relevance score for training in sub-section 3.1, we use reduction in Rouge score as redundancy score relative to another sentence. ε - SVR with RBF kernel is used for regression. We identify five sentence relation features for training. Ren et. al [27] have also used similar regression framework using common sentence features. They used multilayer perceptron for regression for predicting relative Rouge score of a sentence with respect to another sentence. Instead, we use a SVR based regression model to predict the reduction in Rouge score of a sentence with respect to another sentence. Reduction in Rouge score of sentence s1 with respect to sentence s2 is obtained as follows: ${Red}_{s 1 | s 2} = Rouge (s 1) + Rouge (s 2) - Rouge (s 1, s 2)$ (7)

This reduction score is used as label for instances in training data. All the sentence relation features used in SVR model and thier description are listed in Table 2. Note that Red_s1|s2 is same as Red_s2|s1. We take the average of Red_s1|s2 and Red_s2|s1 as reduction in Rouge score of s1 with repect to s2.

Table 2

SVR Features for relative redundancy score

Feature	Formulation	Description
Overlap Precision	\|s1 ∩ s2\|/\|s1\|	\|s1 ∩ s2\| is number of context words common in both s1 and s2.
Overlap Recall	\|s1 ∩ s2\|/\|s2\|	length(s2) is length of sentence s2.
Overlap Document Frequency	∑_t∈s1∩s2DF_t/\|s1 ∩ s2\|	DF_t is document frequency for context word t. \|s1 ∩ s2\| is sentence length
Overlap Term Frequency	∑_t∈s1∩s2TF_t/\|s1 ∩ s2\|	TF_t is its global term frequency of t.
Overlap Cosine Similarity	Cos (TF (s1) , TF (s2))	TF (s) is term frequency vector for sentence s.

3.3 Relative coherence score

Coherence score is combined with relevance score in [3, 10], but they have used absolute coherence score for each sentence. In [11], coherence score of a sentence is defined as out-degree of sentence node in a graph obtained from entity-grid model. In [10], a discourse graph is used to find absolute sentence coherence graph. Our method is different from these in that we consider relative coherence score of a sentence with respect to another sentence. We use entity-grid model for obtaining relative coherence score but in a different way from [11]. Our method of obtaining relative coherence score is similar to [6], in which authors have used relative coherence gain in sequence optimization for coherence-focused summarization.

In entity-grid model, each entity in the sentence is assigned a state. We use four sates S, O, X, and - for subject, object, other and absent respectively. Two sentences are considered coherent if they have some common entity. Degree of coherence depends on number and type of entities shared by those sentences. Relative coherence score of sentence s1 with respect to s2 is defined as follows: $S_{C} (s_{i} | s_{j}) = \sum_{e \in s_{i} \cap s_{j}} p_{e} ({es}_{i}, {es}_{j})$ (8) Where e is a shared entity between s_i and s_j. Transition probability p_e is calculated as follows: $p_{e} ({es}_{i}, {es}_{j}) = n_{{es}_{i}, {es}_{j}}^{e} / n - k$ (9) Where $n_{{es}_{i}, {es}_{j}}^{e}$ is the number of times entity e appeared in sentence s_i with entity state es_i and in sentence s_j with entity state es_j. n is total number of sentences and k is number of documents in document set. It is apparent that coherence score is higher if several entities are shared between sentences and entity transition probability is also high. Usually, transition probability for subject state to object state is higher than other transition probabilities.

3.4 Summarization methods

We use three summarization methods explained below-

TopRank Summarization. In this greedy approach, we assign absolute relevance score to each sentences using the method defined in section 3.1 using weighted average of SVR regression score and ETTM score. Sentences are then ranked according to decreasing relevance score. Top sentences are selected from ranked list iteratively until length limit is reached.

MMR Summarization. This method use maximal Marginal Relevance [28] to reduce redundancy. Both absolute and relative relevance score are used in this greedy method. Firstly, the sentence having highest absolute score is included in the summary. After that, all the remaining sentences s_i are assigned new scores as S_I (s_i) - b . ∑_{s_j∈S}S_R (s_i|s_j). Sentence having highest new score is added to the summary. This process is repeated until required summary length is achieved.

ILP based Summarization. Integer Linear Programming (ILP) based method is used to find exact solution. ILP method consider absolute relevance score and relative coherence and redundancy scores in objective function. In the objective function, order of summary sentences is not considered. Coherence of summary is defined as sum of relative coherence score for all pair of summary sentences. The objective function and constraints for ILP formulation is as follows:

$\begin{matrix} Maximize {Score}_{sum} = \\ \sum_{i = 1}^{N} x_{i} . S_{I} (s_{i}) + y_{ij} . (\sum_{j \neq i} S_{C} (s_{i} | s_{j}) - b . \sum_{j \neq i} S_{R} (s_{i} | s_{j})) \\ s . t : \\ \forall i, j x_{i}, y_{ij} \in 0, 1 \\ \sum_{i} x_{i} . l_{i} \leq L \\ y_{ij} - x_{i} \leq 0 \\ y_{ij} - x_{j} \leq 0 \\ x_{i} + x_{j} - y_{ij} \leq 1 \end{matrix}$ (10)

L is the summary length limit. l_i is the length of sentence i. Binary variable x_i indicates that sentence i is included in the summary. Binary variable y_ij indicates that both sentences i and j are included in the summary. Summary score is the sum of individual and relative scores of all sentences included in the summary. First constraint states that variable x and y are binary variables. Second constraint restrict the summary length up to maximum limit L. Last three constraint ensures that if y_ij is 1 then both x_i and y_j must be 1.

4 Experimental settings and results

In this section, we describe how the experimental settings are tuned and results are obtained.

4.1 Datasets

For the evaluation of proposed models, we use standard DUC datasets 4 over query focused multi-document summarization task. DUC 2006 dataset is used for SVR training and models are evaluated over DUC 2005 and DUC 2007 datasets. Each document cluster has a query statement which describes the details of the summary to be obtained. Each document cluster has four or nine reference summaries written by human experts.

4.2 Experiment settings

For both the SVR models, java library of LIBSVM is used 5 . The parameters C and γ are set using grid based search. For ETTM, Gibbs sampling chain is run for 2000 iterations with first 1000 iterations as burn-in. Samples are collected after every 100 iterations and averaged samples are used for ettm score calculations. All hyper-parameters are set using grid-based approach. For taking the weighted average of regression based score and ettm score, high weight is assigned to the regression based score as for almost all dataset clusters, on average 75% top sentences are fetched in the top 50 sentences set. Regression based score and ettm score are multiplied with 0.65 and 0.35 respectively. The value of constant b in equations 2 and 10 is set to 0.85.

For MMR summarization, we conduct experiments using two variants. In the first variant, called MMR-SVR, only SVR regression scores are used for calculation of both individual and relative sentence scores. In the second variants, called MMR-SVRTM, both SVR regression and ETTM topic model scores are used for calculation of both individual and relative sentence scores.

4.3 Evaluation criterion

The DUC2005 and DUC2007 task are to generate the maximum 250 words query focused summary for each document cluster. Standard DUC evaluation metric ROUGE [1] is used, which measure recall over n-gram statistics from a system generated summary against a set of human generated summary. Rouge-1 (recall against unigrams) and Rouge-2 (recall against bigrams) results with stop-words are reported. For the evaluation of coherence, we use the sum of coherence score S_C of all summary sentences.

4.4 Results and discussion

Rouge-1 and Rouge-2 results for DUC2005 and DUC2007 datasets are shown in Table 3. For other methods, only those results are shown in table which are provided in respective papers. Since ETTM is a probabilistic topic model, the average results of recall over several experiments are reported.

Table 3
Rouge-1 and Rouge-2 results for DUC2005 and DUC2007 datasets

DUC 2005 DUC 2007

Method Rouge-1 Rouge-2 Rouge-1 Rouge-2

TopRank 0.3680 0.7514 0.4061 0.1012

MMR-SVR 0.3730 0.7573 0.4269 0.1109

MMR-SVRTM 0.3795 0.7653 0.4363 0.1092

ILP 0.3704 0.7539 0.4302 0.1065

Asli et. al [4] - - 0.447 0.104

MMR [28] 0.3479 0.0601 0.3798 0.0692

MultiMR [29] 0.37183 0.0683 0.42041 0.10302

Daraksha et. al [11] - 0.0797 - 0.1092

SVR [31] 0.3849 0.0757 0.4342 0.1110

NCBsum-A [30] 0.3886 0.0787 0.4289 0.1113

	DUC 2005	DUC 2007
TopRank	0.3680	0.7514	0.4061	0.1012
MMR-SVR	0.3730	0.7573	0.4269	0.1109
MMR-SVRTM	0.3795	0.7653	0.4363	0.1092
ILP	0.3704	0.7539	0.4302	0.1065
Asli et. al [4]	-	-	0.447	0.104
MMR [28]	0.3479	0.0601	0.3798	0.0692
MultiMR [29]	0.37183	0.0683	0.42041	0.10302
Daraksha et. al [11]	-	0.0797	-	0.1092
SVR [31]	0.3849	0.0757	0.4342	0.1110
NCBsum-A [30]	0.3886	0.0787	0.4289	0.1113

Rouge results in Table 3 shows that our approach is comparable with state of the art summarization methods. Our ILP method produces slightly less score than our MMR method because of inclusion of coherence measure in ILP formulation which increase coherence score at the cost of coverage. MultiMR [29] is a graph based manifold ranking method that consider both with-in document and cross-document sentence relationships. NCBsum-A [30] is coverage focused summarization method and produces better rouge scores than our method. We observe that our regression model MMR-SVR perform slightly worse than SVR [31]. Our model MMR-SVRTM perform better than MMR-SVR which shows that integrating the regression model with topic model improves the coverage results.

5 Conclusion and future work

In this work, we have proposed and implemented an extractive text summarization method that simultaneously model coverage and coherence with some novel approaches. Regression based and topic model based summarization methods are integrated to find the sentence relevance score. Redundancy score is modeled in the same framework which is used for finding relevance score unlike other methods. Local coherence is used in the optimization instead of global coherence. The proposed system achieves comparable results in Rouge evaluation with state of the art methods which model both coverage and coherence. The summary produced are also better in coherence measure.

Our regression based model underperformed slightly in comparison to state of the art. Regression based state of the art achieves 0.4342 rouge-1 recall for DUC2007, whereas our model can achieve only 0.4269 rouge-1 recall. For future work, we plan to tune our regression model to achieve better results. The coherence measure used in this paper does not consider the exact sequence of sentences in the summary. We plan to use regression model based on rouge-2 score as other methods also maximizes rouge-2 score and produce better results for rouge-2 recall. We also plan to include coherence measure based on sentence sequence in ILP formulation.

Footnotes

DUC Past Data:

References

Lin

C.-Y.

, Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.

Taboada

and Mann

W.C.

, Rhetorical structure theory: Looking back and moving ahead, Discourse studies8(3) (2006), 423–459.

Parveen

and Strube

, Integrating importance, non-redundancy and coherence in graph-based extractive summarization. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.

Celikyilmaz

and Hakkani-Tür

, Discovery of topically coherent sentences for extractive summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies- Volume 1, pages 491–499. Association for Computational Linguistics, 2011.

Barzilay

and Lapata

, Modeling local coherence: An entity-based approach, Computational Linguistics34(1) (2008), 1–34.

Wang

, Nishino

, Hirao

, Sudoh

and Nagata

, Exploring text links for coherent multi-document summarization. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 213–223, 2016.

Nenkova

and McKeown

, A survey of text summarization techniques. In Mining text data, pages 43–76. Springer, 2012.

Logeswaran

, Lee

and Radev

, Sentence ordering and coherence modeling using recurrent neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Yang

, et al., A novel contextual topic model for multi-document summarization, Expert Systems with Applications42(3) (2015), 1340–1352.

10.

Christensen

, Soderland

, Etzioni

, et al., Towards coherent multi-document summarization. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 1163–1173, 2013.

11.

Parveen

and Strube

, Multi-document summarization using bipartite graphs. In Proceedings of TextGraphs-9: the workshop on Graph-based Methods for Natural Language Processing, pages 15–24, 2014.

12.

Peyrard

, A simple theoretical model of importance for summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1059–1073, 2019.

13.

Haghighi

and Vanderwende

, Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 362–370. Association for Computational Linguistics, 2009.

14.

Celikyilmaz

and Hakkani-Tur

, A hybrid hierarchical model for multi-document summarization. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 815–824. Association for Computational Linguistics, 2010.

15.

McDonald

, A study of global inference algorithms in multi-document summarization. In European Conference on Information Retrieval, pages 557–564. Springer, 2007.

16.

Peyrard

and Eckle-Kohler

, Optimizing an approximation of rouge-a problem-reduction approach to extractive multi-document summarization. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, (Volume 1: Long Papers), pages 1825–1836, 2016.

17.

Gillick

and Favre

, A scalable global model for summarization. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing, pages 10–18. Association for Computational Linguistics, 2009.

18.

, Qian

and Liu

, Using supervised bigrambased ilp for extractive summarization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1004–1013, 2013.

19.

Blei

D.M.

, Probabilistic topic models, Communications of the ACM55(4) (2012), 77–84.

20.

Balikas

, Amini

M.-R.

and Clausel

, On a topic model for sentences. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 921–924. ACM, 2016.

21.

Shafiei

M.M.

and Milios

E.E.

, Latent dirichlet coclustering. In Sixth International Conference on Data Mining (ICDM’06), pages 542–551. IEEE, 2006.

22.

Galanis

, Lampouras

and Androutsopoulos

, Extractive multi-document summarization with integer linear programming and support vector regression. In Proceedings of COLING 2012, pages 911–926, 2012.

23.

Miller

G.A.

, Wordnet: a lexical database for English, Communications of the ACM38(11) (1995), 39–41.

24.

Zhao

and Mao

, Fuzzy bag-of-words model for document representation, IEEE Transactions on Fuzzy Systems26(2) (2017), 794–804.

25.

Akhtar

, Beg

M.M.S.

and Javed

, Textrank enhanced topic model for query focussed text summarization. In 2019 Twelfth International Conference on Contemporary Computing (IC3), pages 1–6. IEEE, 2019.

26.

Wang

, Cai

, Leung

H.-F.

, Cai

and Min

, Entropy-based term weighting schemes for text categorization in vsm. In 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pages 325–332. IEEE, 2015.

27.

Ren

, Wei

, Zhumin

, Jun

and Zhou

, A redundancy-aware sentence regression framework for extractive summarization. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 33–43, 2016.

28.

Carbonell

J.G.

and Goldstein

, The use of mmr, diversity-based reranking for reordering documents and producing summaries. In volume, SIGIR98 (1998), 335–336.

29.

Wan

and Xiao

, Graph-based multi-modality learning for topic-focused multi-document summarization. In Twenty-First International Joint Conference on Artificial Intelligence, 2009.

30.

, Shen

Y.-D.

, Du

and Xiong

C.-Y.

, Exploiting novelty, coverage and balance for topic-focused multidocument summarization. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 1765–1768. ACM, 2010.

31.

Ouyang

, Li

and Lu

, Applying regression models to query-focused multi-document summarization.&, Management47(2) (2011), 227–237.

	DUC 2005		DUC 2007
Method	Rouge-1	Rouge-2	Rouge-1	Rouge-2
TopRank	0.3680	0.7514	0.4061	0.1012
MMR-SVR	0.3730	0.7573	0.4269	0.1109
MMR-SVRTM	0.3795	0.7653	0.4363	0.1092
ILP	0.3704	0.7539	0.4302	0.1065
Asli et. al [4]	-	-	0.447	0.104
MMR [28]	0.3479	0.0601	0.3798	0.0692
MultiMR [29]	0.37183	0.0683	0.42041	0.10302
Daraksha et. al [11]	-	0.0797	-	0.1092
SVR [31]	0.3849	0.0757	0.4342	0.1110
NCBsum-A [30]	0.3886	0.0787	0.4289	0.1113