Query-focused multi-document text summarization using fuzzy inference

Abstract

The present paper proposes a fuzzy inference system for query-focused multi-document text summarization (MTS). The overall scheme is based on Mamdani Inferencing scheme which helps in designing Fuzzy Rule base for inferencing about the decision variable from a set of antecedent variables. The antecedent variables chosen for the task are from linguistic and positional heuristics, and similarity of the documents with the user-defined query. The decision variable is the rank of the sentences as decided by the rules. The final summary is generated by solving an Integer Linear Programming problem. For abstraction coreference resolution is applied on the input sentences in the pre-processing step. Although designed on the basis of a small set of antecedent variables the results are very promising.

Keywords

Query-focused text summarization mamdani fuzzy inference text similarity fuzzy ranking integer linear programming

1 Introduction

Multi-document text summarization (MTS) has assumed huge importance in modern times due to the exponential growth of electronic text in different domains. The task of MTS is to create a gist of the contents present over several documents [1]. The task becomes even more challenging when the summary is to be made with respect to some specific user query. In the present work, the query contains a title and a description composed of one to four sentences as given in DUC 2006 dataset. The MTS system in such a case needs to produce a summary that is not generic, rather one that is pertinent from the query perspective. It is quite natural that the task involves dealing with ambiguities with respect to selection of the appropriate content from a collection of documents. This is because similar sentences may exist in more than one document. Straightforward selection of all of them based on some important keywords may make the summary repetitive [2].

Fig. 1

The Proposed Approach.

In order to deal with this uncertainty we propose a fuzzy rule-based inference system to achieve the intended goal. In particular, the proposed scheme uses Mamdani inferencing scheme [3] with Mean of Maxima defuzzification for deriving crisp ranks of the sentences from the entire set of sentences made out of the whole collection of documents. The antecedents of the rules are constructed by taking into account certain heuristics that are being practiced for several decades for the task of summarization. Additionally, it takes into account similarity of the document with the user query as well. Constrained Integer Linear Programming (ILP) has been used to maximize an objective function designed on the basis of the fuzzy rank of the sentences, TextRank [4] of the sentences and keyphrase scores. The summaries thus generated are evaluated and compared with past works using different ROUGE scores [5]. Furthermore, different automatic metrics based on lexical overlap as well as vector cosine similarity have also been reported in the present work. Coreference resolution [6] is employed as a pre-processing step for generation of abstractive summaries in addition to extractive ones. Ablation tests have been conducted to examine the importance of individual components and antecedent variables. The overall architecture of the system is given in Figure 1. The present work reports an initial version of the system which uses as few as only five antecedent variables. Even with this small set of variables the proposed system achieved performance better than the baseline systems. We plan to extend the scheme by considering additional antecedent variables.

The paper is organised as follows. Section 2 presents related works on existing fuzzy logic based summarization techniques. The proposed summarization scheme is described in Section 3. Section 4 contains experimental details and description of evaluation metrics. Section 5 presents results of different experiments. Section 6 concludes the paper.

2 Related works on fuzzy summarization

Automatic Text summarization (ATS) has a long history of more than 60 years now [7]. However, application of fuzzy decision making in ATS is relatively recent. The first fuzzy logic based ATS system was developed by Kiani and Akbarzadeh [8]. Their system performed multi-document extractive summarization using a combination of Genetic Algorithm and Genetic Programming for optimization of the rule base and fuzzy sets. Nonstructural features, such as number of title words in the sentence, position of the sentence in the paragraph among others were used as input to the inference system. The fitness function used in their work was aimed at maximizing the presence of important words, such as title words, thematic words, emphasize words, in the summary and minimizing overlap of words between summary sentences. Their system achieved an average of 0.728 F1-score on a test set containing three document sets.

Witte and Bergler [9, 10] proposed a fuzzy clustering algorithm for generation of a context-sensitive summary from multiple input documents. In their work, topic clusters are generated using Noun Phrase (NP) chunks of the sentence and fuzzy coreference chains. For generation of generic summary the sentences belonging to the topic clusters having large size and spanning multiple documents were selected. Context-sensitive summary is generated by selecting sentences for the topic cluster which have high overlap with the given context.

Suanmali et al. [11] used features, such as sentence length, sentence location, number of words occurring in title, term frequency, number of proper nouns and number of numerical data for summarization of single input documents. A fuzzy inference engine with hand-crafted rules and Gaussian membership function has been used for sentence ranking. Kyoomarsi et al. [12] performed extraction of summary sentences from a single input document using fuzzy logic and WordNet. Their system incorporated nine input variables and four output variables for ranking sentences.

Megala et al. [13] performed summarization using fuzzy logic and artificial neural network. The neural network is used to identify important features and fuse common features. Megala et al. [14] also combined fuzzy logic and Conditional Random Field (CRF) for generation of extractive summaries. Important sentences are extracted based on their sentence features using fuzzy rules and membership functions. CRF is used to segment sentences based on the rhetorical roles present in the source document.

Jafari et al. [15] performed single-document summarization of 50 research articles using semantic and syntactic features of the text along with combination of these two aspects. The feature set included 13 features, such as TF-IDF, sentence length, sentence location, semantic similarity, word order similarity among others. In their work fuzzy logic is used to measure the degree of importance and correlation between sentences of the input documents.

Abbasi-ghalehtaki et al. [16] performed single-document ATS using Cellular Learning Automata (CLA), some evolutionary algorithm and fuzzy logic. CLA is used to optimize similarity between summary sentences in order to reduce redundancy in the summary. Importance of each sentence is taken as a linear combination of features, such as word features, similarity measure, position, length of the sentence.

Patel et al. [17] developed a sentence scoring technique using fuzzy logic for generation of extractive summaries in a multi-document setting with low redundancy. Similarity between sentences is calculated as the cosine similarity between TF-IDF vectors. Sentences are scored using fuzzy logic based on word features, e.g. title word, thematic word, proper noun words and sentence features, such as position and length. Top scoring sentences are added to the summary after deleting redundant sentences. The system achieved a ROUGE-2 score of 0.155 on DUC-2004 dataset.

All the above-mentioned systems except Witte and Bergler [10] used a fuzzy logic system for sentence scoring. Most of them produce generic extractive summaries and the features used as input to the logic system only considered the properties of the sentence.

The aim of the present work is to generate a query-focused multi-document summarization by incorporating query relevance in the features used for ranking the sentences. The proposed system performs both extractive and abstractive ATS. For generation of abstractive summaries coreference resolution is applied on the input sentences in the pre-processing step. This step replaces pronouns in the source sentences by its coreferent noun phrases.In a multi-document setting, the presence of similar sentences in the source documents poses a problem of redundancy in the summary. To address the issue of redundancy, similar sentences are grouped into clusters and only one sentence per cluster is selected in the final summary. While the fuzzy inference system ranks the sentences based on query relevance, keyphrases, position and rank, a mutual similarity based ranking of sentences is also performed using TextRank algorithm [4, 18]. Summary sentences are selected by solving an Integer Linear Program [19].

3 The proposed approach

Given a set of similar text documents and a relevant query, the proposed algorithms uses fuzzy inference system along with an ILP based sentence selection system for summary generation. The overall scheme works as follows:

Ranking of source sentences using a Mamdani fuzzy inference scheme

Summary generation using an Integer Linear Programming to maximize an objective function which in turn is a function of fuzzy rank of sentences, textrank of sentences and keyphrase scores.

Calculation of similarity between different sentences or phrases, and extraction of keyphrase along with their respective scores are applied in various steps of the algorithm. These are described in Section 3.1 and 3.2, respectively. The fuzzy sentence ranking algorithm is presented in Section 3.3.

3.1 Text similarity

The semantic similarity between different sentences/phrases are measured using cosine similarity between their embeddings. Let $\vec{S_{1}}$ and $\vec{S_{2}}$ denote the vector representing two sentences S₁ and S₂, respectively. The semantic similarity between S₁ and S₂ is measured as given in Equation 1 $Sim (S_{1}, S_{2}) = \frac{\vec{S_{1}} \cdot \vec{S_{2}}}{∥ \vec{S_{1}} ∥ ∥ \vec{S_{2}} ∥}$ (1)

Table 1

Correlation between cosine similarity and relatedness score

Technique	Pre-trained Model	Embedding Dimension	Correlation
Sentence-BERT	en_stsb_roberta_large	1024	0.7923
Sentence-BERT	en_stsb_bert_large	1024	0.7888
Universal Sentence Encoder	en_use_lg	512	0.7689
Avg. Word Embedding	Word2Vec	300	0.6236
Avg. Word Embedding	GloVe	300	0.5535

Fig. 2

Fuzzy inference system.

Text embedding can be obtained by averaging the individual word embeddings, such as pre-trained GloVe [20] or Word2Vec [21], a technique commonly used in literature. Alternatively, a fixed-size-vector representation of the text can also be obtained by processing the entire text using Transformer based Sentence-BERT [22] or Universal Sentence Encoder [23] architecture. In order to determine the best text vectorization scheme for measuring semantic similarity between texts, the SICK dataset [24] has been used. The SICK dataset contains 9840 pairs of sentences annotated with a relatedness score between 1 to 5, where a score of 1 indicates that the pair of sentences are least related (or unrelated) and 5 indicates that the sentences are extremely related. In order to assess different vectorization scheme the correlation between cosine similarity of the respective sentence vectors and relatedness score is studied. For all the embeddings maximum possible dimension has been considered. The results are presented in Table 1. Since vectors extracted using the pre-trained model en_stsb_roberta_large have the highest correlation between cosine similarity and relatedness score, this has been used in the present work for vectorization of sentences / phrases.

3.2 Keyphrase extraction

Keyphrases are the words or phrases that represent important topics mentioned in a given text. These can be one to ten words long [4]. A sentence containing multiple important (in terms of score) keyphrases will have a stronger candidature in the summary. From a given set of sentences, keyphrases along with its respective score (weight) is extracted using Rapid Automatic Keyword Extraction (RAKE) [25]. The score of each keyphrase is the sum of individual word scores calculated using a graph of word co-occurrences.

3.3 Fuzzy ranking

The source sentences are ranked using Mamdani Fuzzy Inference System [3] as described in Fig. 2. The proposed inference system incorporates five antecedent variables along with manually defined rules to derive a crisp value for the consequent variable Rank. The antecedent variables are named as QTitleSim, QDescriptionSim, KeyScore, RelPos and Length. The crisp inputs corresponding to the antecedent variables are fuzzified using Triangular Fuzzy Numbers as discussed in Section 3.3.2, the heuristics involved in creation of rules are described in Section 3.3.3. For defuzzification of consequent variable, Mean of Maxima is employed in the present work. Experiments with centroid / center of gravity defuzzification strategy yielded inferior results.

3.3.1 Antecedent variables

Let {S₁, S₂, …, S_n} denote the set of source sentences, QT denote the query title, and {QD₁, …, QD_m} denote the m sentences present in the query description. Let SK = {sk₁, …, sk_r} denote the set of keyphrases extracted from the source sentences and QK = {qk₁, …, qk_t} denote the set of keyphrases extracted from the query description sentences. Table 2 illustrates these terms with a couple of examples taken from DUC 2006.

Table 2
Examples of Queries from DUC-2006 dataset

Query Title Query Description Query Keyphrases (QK)

El Nino and La Nina weather condition Describe the causes and effects of the El Nino and La Nina weather condition. What programs and scientific techniques are in effect to better predict and cope with the conditions? la nina weather condition, scientific techniques. el nino, programs, conditions, causes

Organic methods of pest control What methods or products are used to control pests for organic gardens or farms? Include information on methods of controlling such pests as insects or fungus which do not involve the use of chemical pesticides and are accepted by organizations which certify organic produce for the marketplace. certify organic produce, organic gardens, chemical pesticides, control pests, products, organizations, methods, marketplace, insects, fungus, farms

Query Title	Query Description	Query Keyphrases (QK)
El Nino and La Nina weather condition	Describe the causes and effects of the El Nino and La Nina weather condition. What programs and scientific techniques are in effect to better predict and cope with the conditions?	la nina weather condition, scientific techniques. el nino, programs, conditions, causes
Organic methods of pest control	What methods or products are used to control pests for organic gardens or farms? Include information on methods of controlling such pests as insects or fungus which do not involve the use of chemical pesticides and are accepted by organizations which certify organic produce for the marketplace.	certify organic produce, organic gardens, chemical pesticides, control pests, products, organizations, methods, marketplace, insects, fungus, farms

For each source sentence S_i a crsip value for five antecedent variables are calculated as follows:

QTitleSim (T_i): Represents similarity between the source sentence and query title. $T_{i} = Sim (S_{i}, QT), \forall i \in {1, \dots n}$ (2)

QDescriptionSim (D_i): Represents the maximum of the similarity between the source sentence and each query description sentence. $\begin{array}{l} D_{i} = M a x {S i m (S_{i}, Q D_{1}), ... S i m (S_{i}, Q D_{m})}, \\ \forall i \in {1, ... n} \end{array}$ (3)

KeyScore (K_i): Represents the sum of scores of the keyphrases present in the source sentences. To incorporate query relevance in keyphrases a subset of SK is derived on the basis of similarity with query keyphrases. Each keyphrase sk_j in SK is assigned a query similarity score calculated as max {Sim (sk_j, qk₁) , …, Sim (sk_j, qk_t)},

∀j ∈ {1, … r}

Top N keyphrases based on the query similarity score is collected in a set called SK_N. KeyScore of S_i is calculated as the sum of weights of the keyphrases belonging to SK_N which are also present in S_i. $K_{i} = Σ_{{sk}_{j} \in {SK}_{N} \land {sk}_{j} \in S_{i}} weight ({sk}_{j})$ (4)

RelPos (P_i): Represents the relative position of the source sentence in the respective source document. Suppose S_i belongs to document D such that total number of sentences in D is M_D and S_i is at m^th position then, $P_{i} = \frac{m}{M_{D}}$ (5)

Length (L_i): Represents the length of the sentence in terms of number of words present in the sentence.

T_i, D_i and K_i are normalised using min-max scaling such that the final values belong to [0,1].

3.3.2 Fuzzification

In this section the fuzzy sets corresponding to antecedent and consequent variables are defined. In the present work, a Triangular Membership Function (TMF) is used for fuzzification of the crisp variable values described in Section 3.3.1. A TMF is characterized by three parameters a, b and c such that a and c denote the feet of the triangle and b denotes the peak. $TMF (x; a, b, c) = {\begin{matrix} l (x), & x \in [a, b) \\ 1 & x = b \\ r (x), & x \in (b, c] \\ 0, & otherwise \end{matrix}$ $where l (x) = {\begin{matrix} \frac{x - a}{b - a}, & a \neq b \\ 0 & otherwise \end{matrix}$ $and r (x) = {\begin{matrix} \frac{c - x}{c - b}, & b \neq c \\ 0 & otherwise \end{matrix}$ The linguistic terms for the fuzzy variables QTitleSim, QDescriptionSim, KeyScore, RelPos, Length and Rank along with their respective parameters are described in Table 3. It can be observed that the parameters for the antecedent variables are not static. This enables the inference system to adjust the linguistic terms for different set of antecedent variable values as per corresponding to different input document sets. The linguistic terms Average and Middle have a triangular shaped membership while Low, High, Begin, End, Short and Long have a straight line shaped or semi-triangular membership.

Table 3
Fuzzy Variable Modelling

Variable Linguistic Terms a b c

Low 0 0 $\underset{i \in {1, \dots, n}}{Median} {T_{i}}$

QTitlesim Average 0 $\underset{i \in {1, \dots, n}}{Median} {T_{i}}$ 1

High $\underset{i \in {1, \dots, n}}{Median} {T_{i}}$ 1 1

Low 0 0 $\underset{i \in {1, \dots, n}}{Median} {D_{i}}$

QDescriptionSim Average 0 $\underset{i \in {1, \dots, n}}{Median} {D_{i}}$ 1

High $\underset{i \in {1, \dots, n}}{Median} {D_{i}}$ 1

Low 0 0 $\underset{i \in {1, \dots, n}}{Median} {K_{i}}$

KeyScore Average 0 $\underset{i \in {1, \dots, n}}{Median} {K_{i}}$ 1

High $\underset{i \in {1, \dots, n}}{Median} {K_{i}}$ 1 1

Begin 0 0 $\underset{i \in {1, \dots, n}}{Median} {P_{i}}$

RelPos Middle 0 $\underset{i \in {1, \dots, n}}{Median} {P_{i}}$ 1

End $\underset{i \in {1, \dots, n}}{Median} {P_{i}}$ 1 1

Length Short 0 0 $\underset{i \in {1, \dots, n}}{Max} {L_{i}}$

Long 0 $\underset{i \in {1, \dots, n}}{Max} {L_{i}}$ $\underset{i \in {1, \dots, n}}{Max} {L_{i}}$

Low 0 0 0.5

Rank Average 0 0.5 1

High 0.5 1 1

Variable	Linguistic Terms	a	b	c
Low	0	0	$\underset{i \in {1, \dots, n}}{Median} {T_{i}}$
QTitlesim	Average	0	$\underset{i \in {1, \dots, n}}{Median} {T_{i}}$	1
High	$\underset{i \in {1, \dots, n}}{Median} {T_{i}}$	1	1
Low	0	0	$\underset{i \in {1, \dots, n}}{Median} {D_{i}}$
QDescriptionSim	Average	0	$\underset{i \in {1, \dots, n}}{Median} {D_{i}}$	1
High	$\underset{i \in {1, \dots, n}}{Median} {D_{i}}$	1
Low	0	0	$\underset{i \in {1, \dots, n}}{Median} {K_{i}}$
KeyScore	Average	0	$\underset{i \in {1, \dots, n}}{Median} {K_{i}}$	1
High	$\underset{i \in {1, \dots, n}}{Median} {K_{i}}$	1	1
Begin	0	0	$\underset{i \in {1, \dots, n}}{Median} {P_{i}}$
RelPos	Middle	0	$\underset{i \in {1, \dots, n}}{Median} {P_{i}}$	1
End	$\underset{i \in {1, \dots, n}}{Median} {P_{i}}$	1	1
Length	Short	0	0	$\underset{i \in {1, \dots, n}}{Max} {L_{i}}$
Long	0	$\underset{i \in {1, \dots, n}}{Max} {L_{i}}$	$\underset{i \in {1, \dots, n}}{Max} {L_{i}}$
Low	0	0	0.5
Rank	Average	0	0.5	1
High	0.5	1	1

3.3.3 Rule base

The process of designing rules is an important part of any fuzzy inference engine. In the present work, if-then rules have been built manually using human knowledge. A set of 67 rules 1 were finalised by two experts. The rules are designed giving high priority to QTitleSim, QDescriptionSim and KeyScore. For the feature RelPos, the linguistic terms Begin and End are given higher preference in order to account for the fact that sentences appearing in the beginning and ending of a document contains salient information [26, 27]. Shorter sentences having higher KeyScore are preferred. The rules are of the form:

IF [QTitleSim is $T$ ] AND [QDescriptionSim is $D$ ] AND [KeyScore is $K$ ] AND [RelPos is $P$ ] AND [Length is $L$ ] THEN [Rank is $R$ ]

Table 4
Examples of fuzzy rules

$T$ $D$ $K$ $P$ $L$ $R$

High High High Begin High

High High High End High

High High High Middle Short High

High High High Middle Long Average

Average High High End Short High

Average High High Middle Short Average

Low Average Average Low

Low Average Low Low

$T$	$D$	$K$	$P$	$L$	$R$
High	High	High	Begin	High
High	High	High	End	High
High	High	High	Middle	Short	High
High	High	High	Middle	Long	Average
Average	High	High	End	Short	High
Average	High	High	Middle	Short	Average
Low	Average	Average	Low
Low	Average	Low	Low

A few examples of the rules that have been used in the present work are described in Table 4. For illustration the first row of the Table will be read as

IF [QTitleSim is High] AND [QDescriptionSim is High] AND [KeyScore is High] AND [RelPos is Begin] THEN [Rank is High]

3.4 TextRank

The Fuzzy Ranking step measures the importance of a sentence based on its relevance to the query, presence of keyphrases, position and length. The importance of a sentence may also be analyzed using TextRank [18]. It is a graph based ranking algorithm, where vertices represent sentences and weighted edges are formed by connecting sentences by a similarity metric. TextRank determines the importance of each vertex based on global information extracted recursively from the entire graph. The Rank of each vertex is initialized as 1 and updated as described in Equation 6 until convergence. Here, 0 ≤ d ≤ 1 is the dampening factor which is set to 0.85 as suggested in [28] and N (S_i) denotes the set of neighbours of S_i. In Text Rank [18], only lexical overlap is used to determine similarity between sentences. In the present work the semantic similarity between sentences is measured using cosine similarity of vectors as described in Sec 3.1. $T e x t R a n k (S_{i}) = (1 - d) + d \cdot \sum_{S_{j} \in N (S_{i})} \frac{S i m (S_{i}, S_{j})}{\sum_{S_{k} \in N (S_{j})} S i m (S_{j}, S_{k})} T e x t R a n k (S_{j})$ (6)

3.5 Sentence clustering

In a multi-document setting many similar sentences conveying the same information may be present in the source text. Selection of multiple similar sentences in the summary leads to redundancy. This step is aimed at grouping similar sentences. Selection of at most one sentence from each cluster helps to decrease redundancy in the summary. Moreover, selection of sentences from diverse clusters increases the information coverage in the summary.

Hierarchical Agglomerative Clustering [29] with complete linkage criterion is used for clustering. The clusters are created incrementally with each individual sentence considered as a cluster initially. Similar clusters are merged using a bottom up approach. In complete linkage the similarity between two clusters is measured as the minimum similarity between the sentences of two clusters. Clusters with similarity above a threshold λ are merged together. Similarity between sentences are measured as discussed in Sec 3.1.

3.6 Summary generation

The summary is generated by selecting the most important sentences relevant to the query while avoiding redundancy or selection of similar sentences. The sentence selection problem is formulated as a concept-based Integer Linear Programming (ILP) problem [19]. Let s_i ∈ {0, 1} be a variable indicating the presence of sentence S_i in the summary. Let tr_i be the textrank of S_i as obtained using the method explained in Section 3.4 and fr_i be the query relevant fuzzy ranking of S_i as discussed in Section 3.3. Let sk_j ∈ {0, 1} be a variable indicating the presence of keyphrase sk_j ∈ SK_N in the summary. Let pr_ij indicate the presence of keyphrase sk_j in the sentence S_i i.e. ${pr}_{ij} = {\begin{matrix} 1, & {sk}_{j} \in S_{i} \\ 0 & otherwise \end{matrix}$ (7) The summary is obtained by solving the following ILP problem:

$Max \sum_{i} {tr}_{i} \cdot s_{i} + \sum_{i} {fr}_{i} \cdot s_{i} + \sum_{j} weight ({sk}_{j}) \cdot {sk}_{j}$ (8)

subject to: $s_{i} \cdot (l - L_{i}) \leq 0$ (9) $\sum_{i} s_{i} \cdot L_{i} \leq W$ (10) $\sum_{S_{i} \in C} s_{i} \leq 1 for every cluster C$ (11) ${pr}_{ij} \cdot s_{i} \leq {sk}_{j} \forall i, j$ (12) $\sum_{i} {pr}_{ij} \cdot s_{i} \geq {sk}_{j} \forall j$ (13)

Equation 9 ensures that the sentences selected in the summary contains at least l words and Equation 10 limits the summary length to W words. Here, W is set to 250. Equation 11 prevents redundancy in the summary by allowing only one sentence per cluster, where each cluster of similar sentences are obtained as explained in Section 3.5. The constraints in Equation 12 and Equation 13 ensure compatibility between selection of the keyphrases and the sentences. The sentences selected using the above scheme are arranged on the basis of their relative position in the corresponding input document for generation of the summary.

3.7 Coreference resolution

In the present work, the sentences of the input source are ranked individually without taking into consideration its context. This leads to poor ranks for sentences having many coreferent terms. For illustration, consider the sentence It outlaws smoking in public and the workplace. The presence of pronoun ’it’ makes this sentence unsuitable for inclusion in the summary in spite of the presence of important information. A pre-processing step of coreference resolution [6] viz. replacement of pronouns by its coreferent head enables each sentence capable of communicating its relevant context by itself. The coreference resolved sentence for the above-mentioned example is, A drastic anti-smoking bill outlaws smoking in public and the workplace. The summary thus generated by applying the proposed summarization method is of abstractive form because the source sentences are modified using coreference resolution.

4 Experimental details

Experiments were conducted on the DUC-2006 dataset. It contains 50 document sets along with four manually written summaries. Every document set contains a user-specified query consisting of a title and one or more description sentences. The data statistics are presented in Table 5.

Table 5
Data statistics for DUC 2006 dataset

Number of Articles per Document Set 25

Required Summary Length 250 words

Range of Number of Sentences per Document set [192, 1315]

Average Number of Sentences per Document set 728.62

Range of Number of Query Description Sentences [1, 4]

Average Number of Query Description Sentences 2.1

Sentence segmentation and tokenization are performed using spaCy 2 . The fuzzy ranking system has been implemented using Scikit-Fuzzy 3 . Coreference Resolution is performed using AllenNLP 4 .

4.1 Evaluation metrics

Different automatic metrics based on lexical overlap as well as vector similarity have been used to evaluate the output summaries. The metrics used in the present work are the following:

ROUGE (Recall-Oriented Understudy of Gisting Evaluation) [5] is a popular metric for evaluation of summaries. It computes the n-gram overlap between output summary and reference summaries. In the present work, ROUGE-n for n = 1, 2 and SU4 have been used for evaluating the summaries. ROUGE-SU4 allows skipping of at most 4 unigrams inside bigram components. These metrics have been shown to have high correlation with human evaluation [30].

BLEU Score (Bilingual Evaluation Understudy Score) [31] is a precision-based metric which is most commonly used for evaluation of machine translation systems. However, BLEU has demonstrated a high correlation with human judgment for paraphrase generation [32] and summarization [33] as well. In the present work n-grams up to n = 4 have been considered for calculation of BLEU.

METEOR (Metric for Evaluation of Translation with Explicit ORdering) [34], which creates an explicit alignment between the system output and reference sentences. The alignment is based on exact token matching, followed by WordNet [35] synonyms, stemmed tokens, and paraphrases. Given a set of alignments, the METEOR score is the harmonic mean of precision and recall between the generated and the reference sentence.

CIDEr (Consensus-based Image Description Evaluation) [36] computes n-gram co-occurrences (for n up to 4) between the reference and the generated sentences. It down-weighs the frequently occurring n-grams, and calculates the average cosine similarity between the n-grams of the reference and generated texts.

VECS (Vector Extrema Cosine Similarity) [37] also computes the cosine similarity between the reference and generated sentences. However, in VECS the sentence embeddings are calculated by max-pooling the individual word embeddings [38]. Pre-trained GloVe embeddings 5 have been used to calculate the word embeddings.

BERTScore [39] similarity is calculated as a weighted sum of cosine similarities between words of the sentences of the generated and the reference summaries. It correlates well with human judgment for evaluating different language generation tasks [40]. Let the system generated sentence be S = s₁s₂ … s_n and the reference be R = r₁r₂ … r_m then precison and recall for BERTScore are calculated as shown in Equation 14. Here, $\vec{s_{i}}$ and $\vec{r_{j}}$ , denote the contextualized BERT 6 embedding of the words s_i and r_j, respectively. F1-Score (F1_BERT) is calculated as the harmonic mean of precision (P_BERT) and recall (R_BERT).

$\begin{matrix} P_{BERT} & = \frac{1}{n} \sum_{i = 1}^{n} max_{r_{j} \in R} {\vec{r_{j}}}^{T} \vec{s_{i}}; \\ R_{BERT} & = \frac{1}{m} \sum_{j = 1}^{m} max_{s_{i} \in S} {\vec{r_{j}}}^{T} \vec{s_{i}} \end{matrix}$ (14)

5 Results and discussion

The mean recall values of ROUGE-1, ROUGE-2 and ROUGE-SU4 are reported in Table 6 for the proposed system along with the NIST Baseline 7 and ERSS system [9]. Experiments were conducted for different values of parameter l ∈ {0, 2, 4, 6, 8, …, 20}, N ∈ {100, 500, 1000, 1500, all} and λ ∈ {0.1, 0.2, . . . .0.9}. The optimal parameter values were selected on the basis of the ROUGE score. The NIST Baseline corresponds to selection of lead sentences of the input document in the summary and ERSS generates query-focused multi-document summaries using fuzzy corefrence cluster graphs. It can be observed that both extractive and abstractive summarization strategies outperform the NIST baseline by a significant margin. The proposed system also outperforms the ERSS system in terms of ROUGE-2 and ROUGE-SU4 while ROUGE-1 scores are competitive.

Table 6
Experimental results

System l N λ R-1 R-2 R-SU4

NIST Baseline - - - 0.3022 0.0495 0.0979

ERSS [9] - - - 0.3690 0.0648 0.1239

Proposed (Abstractive) 10 500 0.5 0.3531 0.0732 0.1299

Proposed (Extractive) 10 500 0.5 0.3558 0.0722 0.1313

System	l	N	λ	R-1	R-2	R-SU4
NIST Baseline	-	-	-	0.3022	0.0495	0.0979
ERSS [9]	-	-	-	0.3690	0.0648	0.1239
Proposed (Abstractive)	10	500	0.5	0.3531	0.0732	0.1299
Proposed (Extractive)	10	500	0.5	0.3558	0.0722	0.1313

Table 7

Analysis of individual components for generation of extractive summary

Ranking	Selection	l	N	λ	R-1	R-2	R-SU4
TextRank	TopRank	8	-	-	0.2850	0.0456	0.0928
FuzzyRank	TopRank	0	100	-	0.3154	0.0587	0.1104
TextRank	One per Cluster	14	-	0.5	0.3231	0.0482	0.1073
FuzzyRank	One per Cluster	10	500	0.5	0.3425	0.0681	0.1237
TextRank	ILP	12	500	0.6	0.3376	0.0552	0.1142
FuzzyRank	ILP	10	500	0.5	0.3489	0.0659	0.1235

Table 8

Analysis of individual components for generation of abstractive summary

Ranking	Selection	l	N	λ	R-1	R-2	R-SU4
TextRank	TopRank	0	-	-	0.2682	0.0401	0.0869
FuzzyRank	TopRank	0	500	-	0.2994	0.0555	0.1045
TextRank	One per Cluster	12	-	0.4	0.3307	0.0578	0.1129
FuzzyRank	One per Cluster	10	500	0.5	0.3415	0.0671	0.1202
TextRank	ILP	12	500	0.5	0.3438	0.0608	0.1187
FuzzyRank	ILP	10	500	0.5	0.3470	0.0653	0.1230

The efficacy of individual components of the proposed system are also studied, and results for extractive and abstractive summarization are presented in Tables 7 and 8, respectively. In these table TopRank selection strategy refers to selection of top ranking sentences in the summary such that total number of words is less than 250. Similarly One per Cluster refers to selection of sentences such that the sum of ranks of the summary sentences is maximized while selecting at most one sentence per sentence cluster. ILP refers to sentence selection process using Integer Linear Programming as discussed in Section 3.6. Here, the objective function contains only one sentence ranking term. The summaries generated using FuzzyRank outperforms TextRank across all selection strategies. ILP sentence selection also emerged superior to other selection strategies. The scores for extraction is marginally higher than abstraction for all the summarization methods.

Table 9

Ablation results for analyzing the importance of individual antecedent variables for generation of extractive summary

R-1	R-2	R-SU4	BLEU	METEOR	CIDEr	VECS	P _BERT	R _BERT	F1_BERT
All variables	0.3558	0.0722	0.1313	6.9449	0.2690	0.6593	0.5145	0.5952	0.6086	0.6003
w/o QTitleSim	0.3399	0.0554	0.1146	6.6248	0.2548	0.5915	0.4114	0.5791	0.6018	0.5885
w/o QDescriptionSim	0.3471	0.0589	0.1187	7.6103	0.2581	0.6282	0.4008	0.5829	0.6033	0.5912
w/o KeyScore	0.3414	0.0586	0.1166	6.4037	0.2586	0.6136	0.4337	0.5859	0.6033	0.5930
w/o RelPos	0.3527	0.0651	0.1241	5.9937	0.2671	0.6435	0.5134	0.5920	0.6068	0.5978
w/o Length	0.3519	0.0639	0.1228	6.3830	0.2672	0.6441	0.5086	0.5937	0.6086	0.5996

Table 10

Ablation results for analyzing the importance of individual antecedent variables for generation of abstractive summary

R-1	R-2	R-SU4	BLEU	METEOR	CIDEr	VECS	P _BERT	R _BERT	F1_BERT
All variables	0.3531	0.0732	0.1299	6.4463	0.2642	0.6027	0.5396	0.5977	0.6065	0.6004
w/o QTitleSim	0.3351	0.0545	0.1131	6.0095	0.2479	0.5689	0.4753	0.5752	0.5960	0.5839
w/o QDescriptionSim	0.3397	0.0581	0.1157	7.0457	0.2526	0.5515	0.4755	0.5791	0.5977	0.5864
w/o KeyScore	0.3453	0.0602	0.1195	6.3041	0.2574	0.5740	0.4953	0.5861	0.6010	0.5917
w/o RelPos	0.3511	0.0691	0.1212	6.1143	0.2602	0.5947	0.5376	0.5953	0.6052	0.5987
w/o Length	0.3505	0.0672	0.1252	6.0079	0.2616	0.5994	0.5224	0.5966	0.6061	0.5997

Table 11

Summary generated for document set D0623E

Query Title: Anti-Smoking Laws

Query Description: Describe anti-smoking laws passed or rejected world-wide which prohibit smoking in public places or work places.

Include any arguments used for or against such laws.

Extractive Summary

He is skeptical that any anti-smoking bill would discourage underage smoking unless it hikes tobacco prices. The regulation she was referring to, which took effect today, bans smoking in public places. The government regularly announces anti-smoking drives, but it owns the biggest cigarette company, too. FRANCE: Smoke-filled cafes are still pretty much the norm. Smoking is prohibited in subways, trains, airplanes, hospitals and public places. Canada has stiffened anti-smoking regulations in recent years. In the United States, cigarette makers are required to include warning labels on packages. Many smokers have given up their long-term habit since the current anti-smoking regulations went into effect in Beijing and Shanghai, where many cigarette stalls reported low sales. The proposed fines would range from 100,000-300,000 lire (about 50-150 you). The ban stems from California’s indoor workplace smoking prohibition that started in 1994. Some people also said that they would quit smoking because of the new law. The seminar was held to mark the International Anti-Smoking Day. Jordan has enacted law banning smoking in public places, but many people ignore. Opponents say the bill infringes individual and free-trade rights. Bars and restaurants with designated smoking areas also would escape the ban. The planned law bans all tobacco advertising and sponsorship. A survey shows that over 90 percent of smokers smoke in public places. You can stub out your cigarette on the floor like those before you. In government buildings, smoking is prohibited only in ones that deal directly with the public.

Abstractive Summary

A drastic anti-smoking bill bans all tobacco advertising. A health committee approved minor amendments to lighten the anti-smoking bill, such as granting permission for tobacco logos to be displayed on clothing. McCain is skeptical that any anti-smoking bill would discourage underage smoking unless any anti-smoking bill hikes tobacco prices. The government regularly announces anti-smoking drives, but the government owns the biggest cigarette company, too. Canada has stiffened anti-smoking regulations in recent years. Smoking is prohibited in subways, trains, airplanes, hospitals and public places. In the United States, cigarette makers are required to include warning labels on packages. Three provinces in north China’s Shanxi Province also enacted the smoking bans. Hefty fines would range from 100,000-300,000 lire (about 50-150 you). The local legislative bodies passed regulations, and city governments published orders, to prohibit tobacco use in public. A seminar was held to mark the International Anti-Smoking Day. Jordan, where one in four smokes has enacted law banning smoking in public places, but many people ignore. Even ads for non-tobacco products would not be allowed to show people smoking cigarettes. Bars and restaurants with designated smoking areas also would escape the bill, said committee member Ruth Rabinowitz. All public areas, air conditioned shopping centres and shops, offices, schools, private clubs, taxis, buses and public toilets in Singapore are designated smoke-free. In other public places, such as stores and public transport, smoking is banned under law but some people light up anyway. In government buildings, smoking is prohibited only in ones that deal directly with the public.

In order to analyze the importance of individual antecedent variables ablation test was performed. Summaries were generated using the proposed method by removing one antecedent variables at a time. The values of different evaluation strategies for extractive and abstractive summaries are reported in Tables 9 and 10, respectively. It can be observed that for all the evaluation metrics except BLEU, removal of each antecedent variable decreases the system score. This indicates that each variable has a positive contribution towards the final performance of the summarizer. Removal of query relevant features, viz. QTitleSim, QDescriptionSim and KeyScore leads to higher decrease in score than RelPos and Length. QTitleSim emerged as the most important variable for both extractive and abstractive summarization as its removal lead to the highest decrease in scores. It can also be noticed that with respect to ROUGE scores and CIDEr the importance of KeyScore is higher than QDescriptionSim for extractive summarization; while it is the opposite for abstractive summarization. For BLEU metric the highest value is obtained upon removal of QDescriptionSim. The sub-optimality of BLEU score, which is a precision based metric, may be attributed to the fact that the parameter values l, N, and λ have been selected on the basis of highest ROUGE scores which are based on recall values. However, removal of all the other variables decreases the BLEU score of the system. Furthermore, it can be observed that for the metrics based on lexical overlap, viz. ROUGE, BLEU, METEOR and CIDEr, abstractive summary have lower values than extractive summary. In contrast, with respect to vector similarity metrics VECS and BERTScore, abstractive summaries have higher values than corresponding extractive ones. An example of the summary generated using the proposed system is presented in Table 11. It can be observed that the extractive summary contains sentences with unresolved coreferences.

6 Conclusion

The present work proposes a query-focused multi-document summarization scheme. Mamdani fuzzy inference with manually defined rules have been applied to rank the sentences on the basis of a small pool of five features. The features represent properties of the sentence, such as query relevance, position and length. The summary is generated by solving a constrained ILP problem. The objective function aims at selecting a subset of sentences in order to maximize the fuzzy logic based rank, TextRank and presence of important keyphrases in the summary. A constraint of the ILP problem allows the selection of atmost one sentence per cluster of similar sentences, and thereby reducing redundancy in the summary. Coreference resolution of input sentences is performed for generation of abstractive summaries. Automatic evaluation of the generated summaries in terms of ROUGE are particularly promising. While the proposed system outperforms baseline systems, Fuzzy ranking also outperforms graph-based TextRank scheme for different sentence selection methods. Extractive summaries have higher scores than abstractive summaries which indicates that replacement of pronouns increases the informativeness of a sentence at the cost of higher length and repetition of words. In future we would like to perform selective replacement of pronouns along with other abstractive techniques, such as lexical paraphrasing of complex words, sentence compression and fusion. We would also like to extend the current set of features used in the fuzzy ranking scheme to incorporate other aspects of the document, such as cohesion, readability, discourse analysis.

Footnotes

Acknowledgements

Raksha Agarwal acknowledges Council of Scientific and Industrial Research (CSIR), India for supporting the research under Grant number: SPM-06/086(0267)/2018-EMR-I.

Available at

allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz

glove.6B.300d

bert-base-uncased

References

Mani

and Maybury

M.T.

, Advances in Automatic Text Summarization, MIT Press; (1999).

Goldstein

, Mittal

V.O.

, Carbonell

J.G.

and Kantrowitz

, Multidocument summarization by sentence extraction. In: NAACLANLP 2000 Workshop: Automatic Summarization; (2000).

Mamdani

E.H.

and Assilian

, An experiment in linguistic synthesis with a fuzzy logic controller, International Journal of Manmachine Studies7(1) (1975), 1–13.

Mihalcea

and Tarau

, Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing; (2004), 404–411.

Lin

C.Y.

, Looking for a few good metrics: Automatic summarization evaluation-how many samples are enough? In: NTCIR; (2004).

Lee

, He

and Zettlemoyer

, Higher-order coreference resolution with coarse-to-fine inference, arXiv preprint arXiv:180405392. (2018).

Luhn

H.P.

, The automatic creation of literature abstracts, IBM Journal of Research and Development2(2) (1958), 159–165.

Kiani-B

and Akbarzadeh-T

, Automatic text summarization using hybrid fuzzy ga-gp, In: 2006 IEEE International Conference on Fuzzy Systems. IEEE; (2006), 977–983.

Witte

, Krestel

and Bergler

, Context-based multi-document summarization using fuzzy coreference cluster graphs, In: Proceedings of Document Understanding Workshop (DUC), New York City, NY, USA, June; (2006), 8–9.

10.

Witte

and Bergler

, Fuzzy clustering for topic analysis and summarization of document collections, In: Conference of the Canadian Society for Computational Studies of Intelligence. Springer; (2007), 476–488.

11.

Suanmali

, Salim

and Binwahlan

M.S.

, Fuzzy logic based method for improving text summarization, arXiv preprint arXiv:09064690. (2009).

12.

Kyoomarsi

, Khosravi

, Eslami

and Davoudi

, Extractionbased text summarization using fuzzy analysis, Iranian Journal of Fuzzy Systems7(3) (2010), 15–32.

13.

Megala

S.S.

, Kavitha

and Marimuthu

, Enriching text summarization using fuzzy logic, International Journal of Computer Science and Information Technologies5(1) (2014), 863–867.

14.

Megala

, Kavitha

and Marimuthu

, Text summarization system using fuzzy logic and conditional random field algorithm, Int J Comput Sci Inf Technol1(5) (2015), 863–867.

15.

Jafari

, Wang

, Qin

, Gheisari

, Shahabi

and Tao

, Automatic text summarization using fuzzy inference, In: 2016 22nd International Conference on Automation and Computing (ICAC); (2016), 256–260.

16.

Abbasi-ghalehtaki

, Khotanlou

and Esmaeilpour

, Fuzzy evolutionary cellular learning automata model for text summarization, Swarm and Evolutionary Computation30 (2016), 11–26.

17.

Patel

, Shah

and Chhinkaniwala

, Fuzzy logic based multi document summarization with improved sentence scoring and redundancy removal technique, Expert Systems with Applications134 (2019), 167–177.

18.

Mihalcea

, Graph-based ranking algorithms for sentence extraction, applied to text summarization, In: Proceedings of the ACL interactive poster and demonstration sessions; (2004), 170–173.

19.

Gillick

and Favre

, A scalable globalmodel for summarization. In: Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing; (2009), 10–18.

20.

Pennington

, Socher

and Manning

C.D.

, Glove: Global vectors for word representation, In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); (2014), 1532–1543.

21.

Mikolov

, Sutskever

, Chen

, Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, In: Advances in neural information processing systems; (2013), 3111–3119.

22.

Reimers

and Gurevych

, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:190810084. (2019).

23.

Cer

, Yang

, Sy

, Hua

, Limtiaco

, St John

, et al., Universal Sentence Encoder for English, In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics; (2018), 169–174. Available from: https://aclanthology.org/D18-2029

24.

Marelli

, Menini

, Baroni

, Bentivogli

, Bernardi

, Zamparelli

, et al., A SICK cure for the evaluation of compositional distributional semantic models. In: Lrec. Reykjavik; (2014), 216–223.

25.

Rose

, Engel

, Cramer

and Cowley

, Automatic keyword extraction from individual documents, Text Mining: Applications and Theory1 (2010), 1–20.

26.

Villa-Monte

, Lanzarini

, Corvi

and Bariviera

A.F.

, Document summarization using a structural metrics based representation, Journal of Intelligent & Fuzzy Systems38(5) (2020), 5579–5588.

27.

Mendoza

G.A.M.

, Ledeneva

and Garcia-Hernandez

R.A.

, Determining the importance of sentence position for automatic text summarization, Journal of Intelligent & Fuzzy Systems39(2) (2020), 2421–2431.

28.

Brin

and Page

, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems30(1–7) (1998), 107–117.

29.

Murtagh

and Legendre

, Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?Journal of Classification31(3) (2014), 274–295.

30.

Steinberger

and Ježek

, Evaluation measures for text summarization, Computing and Informatics28(2) (2009), 251–275.

31.

Papineni

, Roukos

, Ward

and Zhu

W.J.

, BLEU: A Method for Automatic Evaluation of Machine Translation, In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. USA: Association for Computational Linguistics; (2002), 311–318. Available from: https://doi.org/10.3115/1073083.1073135

32.

Wubben

, van den Bosch

and Krahmer

, Paraphrase Generation as Monolingual Translation: Data and Evaluation. In: Proceedings of the 6th International Natural Language Generation Conference; (2010), 203–207. Available from: https://aclanthology.org/W10-4223

33.

Graham

, Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics; (2015), 128–137. Available from: https://aclanthology.org/D15-1013

34.

Banerjee

and Lavie

, METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization; (2005), 65–72.

35.

Miller

G.A.

, WordNet: a lexical database for English, Communications of the ACM38(11) (1995), 39–41.

36.

Vedantam

, Lawrence Zitnick

and Parikh

, Cider: Consensusbased image description evaluation, In: Proceedings of the IEEE conference on computer vision and pattern recognition; (2015), 4566–4575.

37.

Forgues

, Pineau

, Larcheveque

J.M.

and Tremblay

, Bootstrapping dialog systems with word embeddings, In: Nips, modern machine learning and natural language processing workshop2 (2014).

38.

Sharma

, Asri

L.E.

, Schulz

and Zumer

, Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation; (2017).

39.

Zhang

, Kishore

, Wu

, Weinberger

K.Q.

and Artzi

, BERTScore: Evaluating Text Generation with BERT, In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, (2020). Open Review. net; 2020. Available from: https://openreview.net/forum?id=SkeHuCVFDr

40.

, Lei