An unsupervised semantic sentence ranking scheme for text documents

Abstract

This paper presents Semantic SentenceRank (SSR), an unsupervised scheme for automatically ranking sentences in a single document according to their relative importance. In particular, SSR extracts essential words and phrases from a text document, and uses semantic measures to construct, respectively, a semantic phrase graph over phrases and words, and a semantic sentence graph over sentences. It applies two variants of article-structure-biased PageRank to score phrases and words on the first graph and sentences on the second graph. It then combines these scores to generate the final score for each sentence. Finally, SSR solves a multi-objective optimization problem for ranking sentences based on their final scores and topic diversity through semantic subtopic clustering. An implementation of SSR that runs in quadratic time is presented, and it outperforms, on the SummBank benchmarks, each individual judge’s ranking and compares favorably with the combined ranking of all judges.

Keywords

Sentence ranking phrase-word embedding word mover’s distance semantic subtopic clustering article-structure-biased PageRank

1. Introduction

Ranking sentences in a single document according to their relative importance plays a central role in various applications, including summary extraction from a given document (e.g., see [1]), structured-overview generation over a large corpus of documents [2], and layered reading for fast comprehension of a given document. The last application1 enables the reader to read a layer of the most important sentences first, then subsequent layers of next important sentences until the entire document is read.

There are supervised and unsupervised methods for ranking sentences. Most unsupervised methods use easy-to-compute counting features, such as TF-IDF (term frequency-inverse document frequency) [3] and co-occurrences of words [4]. Such approaches are domain and language independent, and are often preferred over other approaches. Not using semantic information in the context, these methods may only produce suboptimal sentence ranking. Other unsupervised algorithms that use semantic information include Semantic Role Labelling [5], WordNet [6], and Named Entity Recognition [7]. These methods, unfortunately, impose a common limitation of language dependence. In other words, using these algorithms for a given language requires software tools to provide the underlying semantic information for the language, which may not be available.

Supervised methods require labeled data to train models. Modern supervised methods that learn feature representations automatically using a deep neural network model rely on a significant amount of labeled texts (e.g., see [8]). Lacking training data is a major obstacle when developing a supervised sentence-ranking algorithm. The SummBank dataset [9] for evaluating sentence-ranking algorithms, for example, contains only 200 news articles, which is far from sufficient to train a deep neural network model. Other larger datasets sufficient to train neural networks for summarization, such as CNN/DailyMail,2 are unsuitable to train models for ranking sentences, because they only provide a few human-written highlights as a summary for each document.

There are no datasets available at this time for other languages that are suitable for training sentence-ranking models. For example, for the Chinese language, the LSCTS [10]dataset consists of news articles and an average of 1 to 2 sentences written by human annotators as a summary for each article; and the NLPCC 2017 dataset3 also consists of news articles with a summary of upto 45 Chinese characters written by human annotators. These datasets cannot be used to train sentence-ranking models. Lacking training data for a particular language has hindered adaption of a good supervised model for one language to different languages.

These concerns suggest a direction of investigating unsupervised sentence-ranking algorithms using semantic features that can be computed readily for a given language. Word-embedding representations [11] and Word Mover’s Distance (WMD) [12], for example, are semantic features of this kind. A good word-embedding representation provides useful semantic and syntactic information. Requiring only a large amount of unlabeled texts, it is straightforward to compute word embedding representations for any language using unlabeled out-of-band data such as Wikipedia dumps of the underlying language. WMD uses word embedding representations to measure semantic distance between two sentences, which can be used to measure their semantic similarity.

This paper presents an unsupervised semantic senten-ce-ranking scheme called Semantic SentenceRank (SSR) using semantic features at the word, phrase, and sentence levels. SSR uses phrase and word embedding representations and co-occurrences to construct a semantic phrase-word graph, scores words and phrases using a variant of article-structure-biased PageRank, adjusts scores using Solfplus elevation, and computes a normalized score for each sentence. SSR then constructs a semantic sentence graph using WMD, scores sentences using a variant of article-structure-based PageRank, combines these sentence scores to generate the final sentence scores, computes semantic subtopic clustering of sentences, and ranks sentences by solving a multi-objective 0–1 knapsack problem that maximizes the final scores of selected sentences and the diversity of subtopic coverage.

The major contributions of this paper are as follows:

Flexibility: SSR is an unsupervised scheme using semantic features that can be computed readily for any language.

Efficiency: SSR runs in quadratic time when using AutoPhrase [13] to extract phrases, Affinity Propagation [14] to generate semantic subtopic clusters for sentences, and a greedy algorithm to approximate the multi-objective 0–1 knapsack problem.

Accuracy: Running on the DUC-02 dataset [15], the aforementioned implementation of SSR outperforms all previous algorithms under the ROUGE measures. More significantly, on the SummBank dataset [9], SSR outperforms each individual judge’s ranking and compares favorably with the combined sentence ranking of all judges.

The rest of the paper is organized as follows: Section 2 provides a brief overview of related work. SSR is presented in Section 3. Detail descriptions of the major components of SSR are presented in Sections 4–7. Implementations and evaluation results are presented in Section 8. Conclusions and final remarks are presented in Section 9.

2. Related work

Extractive summarization algorithms that can specify the number of sentences in a summary can be used to rank sentences, and vice versa.

2.1 Supervised methods

Supervised summarization methods can be categorized into two categories: sentence labeling and sentence scoring.

Supervised sentence-labeling methods assign a binary label to each sentence $S$ to indicate whether the summary to be produced should include $S$ , where the number of sentences with a “yes” label is determined by the training data. These methods, while being extractive, cannot be used to rank sentences, for they have no control of the number of “yes” labels to be produced. Early sentence-labeling algorithms assign labels to each sentence independently trained on handcrafted features [16] such as thematic words and uppercase words. A sequential Hidden Markov Model [17] was devised to account for local relations between sentences using three handcrafted features. Recently, deep recurrent-neural-network models [18, 8, 19] were used to derive a meaningful representation of a document and carry out sequence labeling based on labels of previous sentences. Trained with cross-entropy loss, however, these models were redundancy-prone and tend to generate verbose summaries [20]. Supervised sentence-labeling methods also have the following two downsides. First, labeled datasets that are large enough to train a model may be hard to come by. Second, labeled datasets required to train a model that exist for one language may not exist for a different language, making it difficult to adapt a model to different languages.

Supervised sentence-scoring methods depend on the underlying similarity measure of a sentence in a document to a benchmark summary. For example, CNN-W2V [21] is a model that computes a ROUGE score for each sentence in a document using the corresponding summaries in a labeled data as references, and uses such scores as the ground truth to train a convolutional-neural-network model to score sentences independently of input documents. This means that the same sentence appearing in different documents always has the same score. This is problematic, for the same sentence in different documents is unlikely to be equally important. A more reasonable approach is to score sentences via global optimization by taking previously scored sentences and their scores into consideration. Refresh [20], for example, is a recent model in this direction. It scores a sentence using previously scored sentences and the summaries of the underlying document in a labeled dataset, where sentence scores are used as a reward function in the model.

No matter what the underlying method is, it is necessary to have a large labeled dataset to train a supervised neural network model. This necessity remains a major obstacle, for such a dataset may not exist for a given language.

2.2 Unsupervised methods

Unsupervised summarization methods exploit relations between words, as related words “promote” each other. For example, TextRank [4] and LexRank [22] each models a document as a sentence graph based on word relations, but they use only syntactic features. PageRank [23] is used to score words.

Other methods incorporate additional information for achieving a higher accuracy. UniformLink [24], for example, constructs a sentence graph on a set of similar documents, where a sentence is scored based on both of the in-document score and cross-document score. URank [25], on the other hand, uses a unified graph-based framework to study both single-document and multi-document summarization.

The quality of a summary may be improved using max-margin methods [26] or integer-linear programming (ILP) [27, 28]. Among the previous algorithms, $\textit{CP}_{3}$ [29] offers the highest ROUGE-1, ROUGE-2, and ROUGE-SU4 scores over DUC-02. It uses a bipartite graph to represent a document, and a different algorithm, Hyperlink-Induced Topic Search (HITS) [30], is used to score sentences. $\textit{CP}_{3}$ treats the summarization problem as an ILP problem, which maximizes the sentence importance, non-redundancy, and coherence simultaneously. However, since solving ILP is NP-hard, obtaining an exact solution to an ILP problem is intractable.

It is worth noting that a recent unsupervisedmethod [31], while not related to the problem studied in this paper, may provide new ideas. On a collection of consumer reviews on a particular product, the method generates an abstractive sentence as the main review point, and ranks sentences by the number of descendants in a discourse tree rooted on the main review point. It would be interesting to investigate if this method can be modified to rank sentences of a given document according to their relative importance.

Early unsupervised methods have two common downsides: (1) They don’t promote diversity. This is because the importance of sentences are based only on sentence scores, and so sentences of high scores representing the same subtopic may all be included in a summary, leaving no room to include sentences with lower scores but with different subtopics. (2) These methods do not used semantic features.

3. Semantic SentenceRank

To overcome the downsides of the existing methods (supervised or unsupervised), a better sentence-ranking algorithm should incorporate semantic features and topic diversity, and it should be unsupervised.

Figure 1.

Major components of (a) SSR and (b) SPR/SWR.

Let $D$ denote a document consisting of $n$ sentences indexed as $S_{1},S_{2},\ldots,S_{n}$ in the order they appear, each with a length $l_{i}$ , along with a maximum length capacity $L$ , where $l_{i}$ is the number of characters contained in $S_{i}$ . Let $F_{s}(S_{i})$ and $F_{d}(D)$ denote a semantic sentence scoring function and a diversity coverage measure, respectively. Then the semantic sentence-ranking problem is modeled as follows:

$\displaystyle\text{maximize}\sum_{i=1}^{n}F_{s}(S_{i})x_{i}\text{ and }F_{d}(D),$ $\displaystyle\text{subject to}\sum\limits_{i=1}^{n}l_{i}x_{i}\leqslant L\text{% and }x_{i}\in\{0,1\}.$

where $x_{i}$ is a 0–1 variable such that $x_{i}=1$ if sentence $S_{i}$ is selected, and 0 otherwise. By setting $L$ appropriately from small to large, one can obtain from solving the optimization problem the first sentence, then the second, then the third, and so on until all sentences are ranked. Unfortunately, this problem is NP-hard and so an approximation approach is needed.

SSR computes $F_{s}$ by combining salience scores at three levels: words, phrases, and sentences. In particular, it first constructs a semantic phrase-word graph (SPG) on phrases and words, and a semantic sentence graph (SSG) on sentences. It then computes $F_{d}$ using semantic subtopic clustering. Finally, SSR uses an approximation algorithm based on $F_{s}$ and $F_{d}$ to rank sentences. Figure 1a depicts the data flow diagram for the major components of SSR.

3.1 Sub-models

SSR contains two sub-models: one at the word level known as Semantic WordRank (SWR) [32], and one at the phrase-word level referred to as Semantic PhraseRank (SPR). In other words, SPR is SSR excluding semantic sentence graph and ABS-biased PageRank-2. Both SWR and SPR follow the same data follow diagram (see Fig. 1b), except that SWR does not consider phrase-level similarities. Both are faster than SSR, and perform well on selecting top-ranked sentences, which is sufficient for certain applications.

3.2 Phrase and word embedding

A searchable dataset of phrase and word embedding representations is calculated independently of SSR. Such an embedding dataset may be available for free download for some languages. If unavailable, it is straightforward to compute word and phrase embedding over an unlabeled Wikipedia dump using a standard method. To extract phrases, a linear-time unsupervised algorithm such as AutoPhrase [13] may be used. Recalculations of phrase and word embedding may be carried out once in a while on a larger Wikipedia dump.

3.3 Preprocessing

The preprocessing component computes, on a given document $D$ , the phrases contained in $D$ using a phrase extractor, and the set of essential words contained in $D$ that, excluding these phrases, pass a part-of-speech (POS) filter, a stop-word filter, and a stemmer for reducing inflected words to the word stem. It removes all non-essential words. In what follows, unless otherwise stated, when words are mentioned, they are essential words.

Descriptions of the remaining components of SSR are presented in Sections 4–7.

4. Semantic graph representations

4.1 Semantic word graph

In addition to considering co-occurrence between words as in TextRank [4], the semantic word graph (SWG) for the underlying document adds embedding similarity of words to enhance connectivity of the graph. Adding semantic similarity is vital for processing analytic languages, such as Chinese, which seldom use inflections. When stemming is not applicable, semantic similarity serves as an alternative to represent the relations between words with similar meanings, allowing them to share the importance when computing PageRank scores for these nodes.

Let $G=(V,E)$ be a weighted graph of words in document $D$ . Two words in $V$ are connected if either they co-occur within a window of $\Delta_{\text{SWG}}$ successive words in the document (e.g. $\Delta_{\text{SWG}}=2$ ), or the cosine similarity of their embedding representations exceeds a threshold value $\delta_{\text{SWG}}$ (e.g. $\delta_{\text{SWG}}=0.6$ ).

Let $u$ and $v$ be two adjacent nodes. For each edge $(u,v)$ , if only one type of connection exists, then treat the weight of the other type 0. Assign the co-occurrence count of $u$ and $v$ as the initial weight to the co-occurrence connection and the cosine similarity value as the initial weight to the semantic connection. Normalize the initial weights of co-occurrence connections; namely, divide the initial co-occurrence weight by the total initial co-occurrence weight. Normalize the initial weights of semantic connections; namely, divide the initial semantic weight by the total initial semantic weight. Let $w_{c}(u,v)$ and $w_{s}(u,v)$ denote, respectively, the normalized weight for the co-occurrence connection and the semantic connection of $u$ and $v$ . Finally, assign $w(u,v)=w_{c}(u,v)+w_{s}(u,v)$ as the weight to the edge $(u,v)$ .

4.2 Semantic phrase graph

In a SWG, words in a phrase (e.g., names, scientific terms, and general entity names) would have high co-occurrence counts if the phrase appears multiple times in the document. The meaning of a word inside a phrase may be different from that outside the phrase. For example, in the sentence “There is an apple on top of her Apple computer”, the word “Apple” appears outside and inside the phrase of “Apple computer”, which has different meanings. Thus, a high-quality phrase extractor is desired when building a phrase graph.

Given a document $D$ , SSR applies a phrase extractor to segment phrases in $D$ . Let $P$ denote the set of phrases and $W$ the set of words in $D$ after phrases are removed. If a phrase $p\in P$ of the given document does not appear in the database of phrases, then remove $p$ from $P$ and add the words $w\in p$ to $W$ (Note that the probability of this to happen is small if the construction of the database of phrase embedding uses the same phrase extractor).

A semantic phrase-word graph (SPG) is a weighted graph $(V,E)$ with $V=P\cup W$ such that two nodes are connected if either they co-occur in a small sliding window of $\Delta_{\text{SPG}}$ consecutive words and phrases or the cosine similarity of their embedding representations is greater than a threshold value $\delta_{\text{SPG}}$ .

4.3 Semantic sentence graph

A semantic sentence graph (SSG) of a document $D$ is a weighted graph with sentences in $D$ being its nodes, where two sentences $S_{i}$ and $S_{j}$ are connected if either they contain a common word or phrase, or the WMD of $S_{i}$ and $S_{j}$ is below a certain value, which may be determined by how large a percentage of sentences should be connected. The weight of an edge $(S_{i},S_{j})$ is determined as follows:

Let $p_{ij}$ denote the number of phrases contained in both $S_{i}$ and $S_{j}$ . After removing common phrases, let $v_{ij}$ denote the number of words contained in both $S_{i}$ and $S_{j}$ . Let $|S_{i}|$ and $|S_{j}|$ denote, respectively, the number of words contained in $S_{i}$ and $S_{j}$ . Let

$\displaystyle w_{c}(i,j)=\frac{p_{ij}+v_{ij}}{\log_{10}|S_{i}|+\log_{10}|S_{j}% |}.$

Define a different similarity measure of $S_{i}$ and $S_{j}$ by

$\displaystyle\text{sim}_{P}(S_{i},S_{j})=\frac{1}{1+\text{WMD}(S_{i},S_{j})}.$ (1)

Sort the similarity scores given by Eq. (1) in descending order. Select a $\Gamma\%$ of the edges with corresponding similarity scores being the top $\Gamma\%$ of the similarity scores (e.g., $\Gamma=30$ ). Add these semantic edges to the graph with weights being the corresponding similarity scores.

Normalize the co-occurrence edge weight $w_{c}(i,$ $j)$ ; namely, divide $w_{c}(i,$ $j)$ by $\sum_{i\not=j}w_{c}(i,j)$ . Normalize the semantic edge weight; namely, divide the similarity given by Eq. (1) by the total similarity weight of all semantic edges. Sum up the two normalized weights to be the final edge weight $w_{ij}$ .

5. Sentence scoring

5.1 Article-structure-biased PageRank-1

Article structures define how information is presented. For example, the typical structure of news articles is an inverted pyramid [33], where critical information is presented at the beginning, followed by additional information with less important details. In academic writing, the structure of an article would look like an hourglass4, which includes an additional conclusion piece at the end of the article. Thus, sentence locations in an article according to the underlying structure also plays a role in ranking sentences. SWR uses a position-biased PageRank algorithm [34].

Directly applying PageRank, one can compute a score $W(v_{i})$ of a node $v_{i}\in G$ by iterating the following equation until converging:

$\displaystyle W(v_{i})=d\bigg{(}\sum_{v_{j}\in Adj(v_{i})}\frac{w_{ji}}{\sum_{% v_{k}\in Adj(v_{j})}w_{jk}}W(v_{j})\bigg{)}++(1-d),$ (2)

where $Adj(v_{i})$ denotes the set of nodes adjacent to $v_{i}$ , $d\in(0,1)$ is a damping factor, and $w_{ji}$ is the weight of the edge between node $j$ and node $i$ . The value of $d$ is set to 0.85 as in the original PageRank paper [35] and the TextRank paper [4]. The intuition behind this equation is that the importance of a node $v_{i}$ is affected by the scores of its adjacent nodes and the probability of $1-d$ for jumping from a random node to node $v_{i}$ .

Equation (2) is an unbiased PageRank, where each word is assumed equally likely to start from. In article-structure-biased (ASB) PageRank, each word $v_{i}$ is biased with a probability $P(v_{i})$ according to the underlying article structure. For example, in the inverted pyramid structure, a higher probability is assigned to a word that appears closer to the beginning of the article.

Rank the importance of sentence locations from the most important to the least important based on the underlying article structure (Note: This is not the sentence ranking to be computed). Let $\text{LS}_{i}(w)$ denote the location score of $w\in S_{i}$ , where $S_{i}$ is the $i$ -th sentence.

The probability for node $v_{i}$ can now be computed by

$\displaystyle P(v_{i})=\frac{\sum_{k:v_{i}\in S_{k}}{\text{LS}_{k}(v_{i})}}{% \sum_{j,k:v_{j}\in S_{k}}{\text{LS}_{k}(v_{j})}}.$

Note that the above computation is at the sentence level, which can be easily adapted to the word level by ranking words instead of sentences.

The ASB PageRank score $W^{\prime}(v_{i})$ for node $v_{i}$ is computed as follows:

$\displaystyle W^{\prime}(v_{i})\!=d\bigg{(}\sum_{v_{j}\in Adj(v_{i})}\frac{w_{% ji}}{\sum_{v_{k}\in Adj(v_{j})}w_{jk}}W^{\prime}(v_{j})\bigg{)}\!++(1-d)P(v_{i% }).$ (3)

Computation starts with an arbitrary initial value for each node, and iterates the computation of Eq. (3) until it converges.

$W^{\prime}(v_{i})$ , referred to as salient score, represents its importance relative to the other words in the document.

5.2 Softplus adjustment

Let $S$ be a sentence. To score $S$ , one may simply sum up the salient score of each word contained in $S$ and normalize it by $|S|$ (the number of essential words contained in $S$ ). Normalization ensures that longer sentences and shorter sentences are comparable (otherwise, larger scores may have larger scores just because they have more words). Namely, let

$\displaystyle\text{sal}(S)=\frac{1}{|S|}\sum_{v_{i}\in S}W^{\prime}(v_{i}).$

This way of scoring, however, has a drawback. To see this, suppose that $S_{1}$ and $S_{2}$ are two sentences with similar scores under this method, and contain about the same number of words. If the distribution of word scores for words contained in $S_{1}$ follows the Pareto Principle, namely, a few words have very high scores and the rest have very low scores close to 0, while $S_{2}$ has roughly a uniform word score distribution, where the high scores of a few words in $S_{1}$ are much larger than the (almost uniform) scores of words in $S_{2}$ , then the few words in $S_{1}$ with very high scores would make $S_{1}$ appear more important than $S_{2}$ . Using direct summation of salient word scores, it is possible to end up with the opposite outcome.

Using the Softplus function $sp(x)=\ln(1+e^{x})$ helps overcome this drawback [36]. Commonly used as an activation function in neural networks, $sp(x)$ offers a significant elevation of $x$ when $x$ is a small positive number. If $x$ is large, then $sp(x)\approx x$ .

Apply the Softplus function to each word, and sum up the elevated values to be the salient score of $S$ , denoted by $\text{sal}_{sp}(S)$ . Namely,

$\displaystyle\text{sal}_{sp}(S)=\frac{1}{|S|}\sum_{v_{i}\in S}\ln(1+e^{W^{% \prime}(v_{i})}).$ (4)

To illustrate this using a numerical example, assume that $S_{1}$ and $S_{2}$ each consists of 5 words, with original scores ( $W^{\prime}$ ) and Softplus scores ( $sp^{\prime}=sp\circ W^{\prime}$ ) given in the following table (Table 1).

Table 1

Numerical examples with $W^{\prime}$ and $sp^{\prime}$ scores

$S_{1}$	$v_{11}$	$v_{12}$	$v_{13}$	$v_{14}$	$v_{15}$	Sal
$W^{\prime}$	2.60	2.20	2.10	0.30	0.20	1.480
$sp^{\prime}$	2.67	2.31	2.22	0.85	0.80	1.768
$S_{2}$	$v_{21}$	$v_{22}$	$v_{23}$	$v_{24}$	$v_{25}$
$W^{\prime}$	1.60	1.50	1.50	1.50	1.40	1.500
$sp^{\prime}$	1.78	1.70	1.70	1.70	1.62	1.702

Sentence $S_{1}$ is more important than $S_{2}$ because it contains three words of much higher $W^{\prime}$ -scores than those of $S_{2}$ . However, $\text{sal}(S_{1})=1.48<\text{sal}(S_{2})=1.5$ and so $S_{2}$ will be selected. After using Softplus, $\text{sal}_{sp}(S_{1})=1.768>\text{sal}_{sp}(S_{2})=1.702$ , and so $S_{1}$ is selected as it should be. Experiments (see Section 8.7) indicate that using the Softplus elevation does improve ranking accuracy in practice.

5.3 Article-structure-biased PageRank-2

For the semantic sentence graph, SSR uses a modified ASB PageRank algorithm to score sentences, as ASB PageRank-1 suitable for words may not be suitable for sentences. To see this, assume that a document has the inverted pyramid structure, then using the reciprocal of the location index of a sentence as its location score will result in putting too much weight on the first few sentences and too little weight on the subsequent sentences. Clearly, this Pareto phenomenon is not practical. Instead, let $\text{LS}(S_{i})$ denote the location score of sentence $S_{i}$ (recall that the subscript $i$ is the location index of the sentence). Normalize $\text{LS}(S_{i})$ to generate $P(S_{i})$ for the modified ASB PageRank algorithm to score $S_{i}$ similar to Eq. (3) as follows:

$\displaystyle W^{\prime}(S_{i})\!=d\bigg{(}\sum_{S_{j}\in Adj(S_{i})}\frac{w_{% ji}}{\sum_{S_{k}\in Adj(S_{j})}w_{jk}}W^{\prime}(S_{j})\bigg{)}\!++(1-d)P(S_{i% }).$

5.4 Combined sentence scoring

The sentence scoring function $F_{s}$ is defined as follows: For any given sentence $S$ contained in $D$ ,

$\displaystyle F_{s}(S)=\frac{1}{2}(\text{sal}_{sp}(S)+W^{\prime}(S)).$ (5)

6. Semantic subtopic clustering

Selecting a sentence based only on sentence scores would result in a poor diversity of topic coverage, as multiple sentences of the same subtopic could have higher scores than sentences of different subtopics. To avoid this drawback, a sentence subtopic clustering method is needed.

Clustering algorithms depend on a chosen similarity measure for the underlying objects to be clustered. Clustering may be carried out based on thematic similarity measures or semantic similarity measures. Thematic clustering groups sentences of the same context into the same cluster. For example, under thematic similarity measures, sentences that contain the following words may be grouped into the same cluster: frog, pond, green, or tree [37]. TextTiling [38], for example, is a thematic clustering algorithm. First used in text summarization [36], TextTiling groups several consecutive paragraphs into the same cluster by finding thematic shifts between consecutive paragraphs. Unfortunately, it often fails to generate multiple clusters on short articles or when there are no clear thematic shifts between consecutive paragraphs.

Clustering methods based on TF-IDF over the BOW (bag-of-words) representations are in general unsuitable for measuring document distances or similarities due to frequent near-orthogonality [12, 39].

Semantic clustering, on the other hand, groups sentences that have similar meanings or convey similar information into the same cluster. For example, under semantic similarity measures, sentences about eyes, noses, and ears may be group into the same cluster. Semantic clustering is based on semantic measures between sentences. WMD, in particular, can be used to define semantic measures.

Efficiency, accuracy, and easy implementation are criteria to choose a clustering algorithm. When choosing a clustering algorithm to implement SSR, it is imperative to choose one that can also be easily modified to use semantic measures. Moreover, the complexity of the algorithm should not exceed quadratic time, as higher complexity may cause interactive applications (such as the layered-reading tool at http://www.dooyeed.com) unacceptable in practice. Note that pairwise comparisons of objects alone in a clustering algorithm require a quadratic-time lower bound.

Spectral clustering [40] and affinity propagation [14] are both based on k-means using Euclidean distance as the underlying similarity measure, which can easily be replaced with a semantic measure such as WMD, and they can be carried out in quadratic time. Any clustering method that possesses these properties may also serve as candidates. Other context-aware similarity measures such as the one described in [41] may be explored. The topic diversity function $F_{d}$ can be represented using such a semantic subtopic clustering algorithms.

6.1 Semantic spectral clustering

Spectral clustering [40] uses eigenvalues of a similarity matrix (aka. affinity matrix) to reduce dimension before clustering a given set of data points into $K$ clusters, where $K$ is a preset positive integer. Spectral clustering can handle data points that do not satisfy convexity.

Treat each sentence as a data point. Let WMD $(S_{i},S_{j})$ denote the Word Mover’s Distance between two sentences $S_{i}$ and $S_{j}$ . The similarity matrix is an $n\times n$ matrix, where $n$ is the total number of sentences in a document, and the entry $s_{ij}$ of the matrix corresponds to a similarity measure between two sentences $S_{i}$ and $S_{j}$ defined in Eq. (6). Under the WMD metric, a smaller value between two sentences means that they are more similar, while a larger value means that they are less similar. This can be transformed to a similarity metric using the RBF kernel as follows:

$\displaystyle\text{sim}_{G}(S_{i},S_{j})=e^{-\gamma\cdot\text{WMD}(S_{i},S_{j}% )^{2}},$ (6)

where $\gamma$ may be set to 1. It then uses k-means to generate clusters over eigenvectors corresponding to the $K$ smallest eigenvalues.

The number of clusters $K$ is related to $n$ . Empirical studies suggest that 30% of the original text size would be the best size for a summary to contain almost all significant points contained in a single document. In other words, extracting about $0.3n$ sentences appropriately would cover almost all key points in the original document. On the other hand, to avoid having too many clusters that could deteriorate performance, it is necessary to set an upper bound $C$ . For typical news articles, for example, an upper bound $C=8$ would be appropriate. Thus, let

$\displaystyle K=\min\{\lfloor 0.3n\rfloor,C\}.$

Let $n_{i}$ denote the number of non-repeated essential words contained in sentence $S_{i}$ , which is bounded above by a constant $M$ (e.g., $M=30$ . If in a rare occasion $n_{i}$ is longer than this bound, one can split the sentence into natural clauses). The time complexity of computing $\text{WMD}(S_{i},S_{j})$ is $O(M^{3})=O(1)$ . Hence, $\text{sim}_{G}(S_{i},S_{j})$ defined in Eq. (6) and $\text{sim}_{P}(S_{i},S_{j})$ defined in Eq. (1) can both be computed in $O(1)$ time.

Thus, computing the similarity matrix incurs $O(n^{2})$ time. Using the implicitly restricted Lanczosmethod [42], finding the $K$ largest eigenvalues and the corresponding eigenvectors over an $n\times n$ symmetric real matrix can be done in $O(Kn^{2})$ time. There are a number of heuristic algorithms to approximate k-means that run in $O(Kn^{2})$ time [43]. Since $K$ is set to be less than a constant $C$ , semantic spectral clustering can be carried out in quadratic time.

6.2 Semantic affinity propagation

Recall that spectral clustering must fix a number of clusters before clustering. This could be problematic in practice. Affinity propagation (AP) clustering [14] overcomes this problem. It is an exemplar-based clustering algorithm such as k-means [44] and k-medoids [45] except that AP does not need to preset the number of clusters.

Let $S_{1},S_{2},\ldots,S_{n}$ be the sentences to be clustered under the similarity measure of sim ${}_{P}(S_{i},S_{j})$ defined in Eq. (1), which runs in constant time. Each sentence $S_{i}$ is a potential exemplar and let $\text{sim}_{P}(S_{i},S_{i})=m$ , where $m$ is the median of $\text{sim}_{P}(S_{i},S_{j})$ for all $i,j\in\{1,2,\ldots,n\}$ with $i\not=j$ . AP proceeds by updating two $n\times n$ matrices $\bm{R}=(r_{ij})$ (the responsibility matrix) and $\bm{A}=(a_{ij})$ (the availability matrix) as follows until they converge for all $i$ and $j$ :

Initially, set $r_{ij}\leftarrow 0$ and $a_{ij}\leftarrow 0$ .

Set $r_{ij}\leftarrow\text{sim}_{P}(S_{i},S_{j})-b_{ij}$ , where

$\displaystyle b_{ij}=\max_{j^{\prime}\not=j}\{\text{sim}_{P}(S_{i},S_{j^{% \prime}})+a(i,j^{\prime})\}.$

If $i\not=j$ , then set

$\displaystyle a_{ij}\leftarrow\min\{0,r_{jj}\}+\sum_{i^{\prime}\not\in\{i,j\}}% \max\{0,r_{i^{\prime}j}\}.$

Otherwise, set

$\displaystyle a_{ij}\leftarrow\sum_{i^{\prime}\not=j}\max\{0,r_{i^{\prime}j}\}.$

If $r_{ii}+a_{ii}>0$ , then $S_{i}$ is selected as an exemplar and $S_{j}$ belongs to the cluster of $S_{i}$ if $S_{j}$ has the largest similarity with $S_{i}$ among all other exemplars. Semantic AP runs in $O(n^{2})$ time.

7. Sentence selection and ranking

An approximation algorithm is needed to cope with the NP-hardness of the multi-objective 0–1 knapsack problem. There are various techniques for tackling multi-objective optimization problems such as those described in [46, 47, 48, 49, 50, 51]. Other recent optimization techniques on solving integrated computer-aided engineering problems include [52, 53, 54, 55]. A good approximation algorithm should balance between accuracy and efficiency. The following greedy approximation in a round-robin style is used in the current implementation of SSR.

7.1 Round-robin selection

Each sentence $S_{i}$ is now associated with four values: (1) sentence index $i$ , (2) salient score $F_{s}(S_{i})$ computed by Eq. (5), (3) sentence length $l_{i}$ , and (4) cluster index $j$ of the cluster $S_{i}$ belongs to. Select sentences greedily in a round robin fashion and rank them as follows:

Let ${\cal S}$ denote the set of selected sentences. Initially, ${\cal S}\leftarrow\emptyset$ .

For each sentence $S_{i}$ , compute the value per unit length to obtain a unit score $s^{\prime}_{i}=s_{i}/l_{i}$ .

For each cluster $c_{j}$ , sort the sentences contained in it in descending order according to their unit scores.

While there are still sentences that have not been selected, do the followings:

Sort the remaining clusters in descending order according to the highest unit score contained in a cluster. For example, if the highest unit score in cluster $c_{i}$ is smaller than the highest unit score in cluster $c_{j}$ , then $c_{j}$ comes before $c_{i}$ in the sorted clusters.

Select the sentence from the remaining sentences with the highest unit score, one from each cluster in the order of sorted clusters, and add it to $\cal S$ . That is,

$\displaystyle{\cal S}\leftarrow{\cal S}\cup\{S_{i_{1}},S_{i_{2}},\ldots,S_{i_{% k}}\},$

where $S_{ij}$ are the selected sentences and $k$ is the number of remaining clusters that are nonempty.

Remove the selected sentences from their corresponding clusters.

Rank sentences according to the order they are selected.

8. Implementations and evaluations

8.1 Embedding database and parameters

The word-embedding database uses a pre-computed word embedding representations with subword information [56], which can handle out-of-vocabulary words and generate better word embedding for rare words.

The phrase-embedding database uses Auto-Phrase [13] on an English Wikipedia dump and the CNN/DailyMail dataset to extract phrases and words. fastText [56] is then used to compute embedding representations for these phrases and words with respect to the same datasets. AutoPhrase is an unsupervised phrase extractor that supports any language as long as a general knowledge base and a pre-trained POS-tagger (recall that POS stands for part-of-speech) in that language are available. Wikipedia is typically used as a general knowledge base and pre-trained POS-taggers are widely available for different languages.

The sliding-window size for computing co-occurr-ence of words for SWG is set to $\Delta_{\text{SWG}}=2$ , and the sliding-window size for computing co-occurrence of words for SPG is set to $\Delta_{\text{SPG}}=3$ . As noted in constructing TextRank word graph on co-occurence [4]: “A larger window does not seem to help – on the contrary, the larger the window, the lower the precision, probably explained by the fact that a relation between words that are further apart is not strong enough to define a connection in the text graph.” Setting a window size of 2 to capture co-occurrence for words was recommended. Because most phrases consist of two words, the window size to capture co-occurrence of phrases and words should be just larger than 2, hence setting the window-size to 3 for SPG is reasonable. Note that setting window sizes slightly larger may slightly degrade the precision.

A cosine similarity value of word or phrase embedding that is larger than 0.6 is deemed sufficient to indicate two words are semantically similar, which ensures that the underlying semantic graph has sufficient connectivity, but not too dense. Because there are more words than phrases, setting the threshold value of semantic similarity for words to 0.65 and for phrases to be 0.6 is reasonable. Note that setting the threshold values slightly larger will only slightly affect the precision.

For the semantic sentence graph, setting the percentage $\Gamma$ % $=$ 30% provides sufficient connectivity.

8.2 Implementations and complexity analysis

The implementation of SPG uses AutoPhrase to extract phrases. A straightforward table lookup of the embedding database provides the embedding representations of words and phrases needed to construct SPG.

Since the datasets to be used to evaluate sentence-ranking algorithms are news articles and news articles in general have the inverted pyramid structure, to implement ABS-biased PageRank-1, the following location score for word $w$ in the $i$ -th sentence $S_{i}$ is used:

$\displaystyle\text{LS}(w|S_{i})=\frac{1}{i}.$

Likewise, to implement ABS-biased PageRank-2, the following location score for sentence $S_{i}$ is used:

$\displaystyle\text{LS}(S_{i})=\frac{1}{\log_{10}(1+i)}.$

Note that location scores may be defined using different functions. For example, the structure of an academic research paper may in general have the hourglass structure.5

Finally, semantic special clustering is used to implement SWR and SPR, while semantic affinity propagation is used to implement SSR.

Under the said implementations, SWR, SPR, and SSR on a given document $D$ all run in $O(|D|^{2})$ time, where $|D|$ is the number of essential words contained in $D$ . This can be shown as follows:

The preprocessing of extracting phrases using AutoPhrase can be done in $O(|D|)$ time, so does extracting words.

The embedding database may be implemented as a dictionary, and so looking-up a word $w$ takes $O(1)$ time. Looking-up a phrase is similar. Thus, retrieving embedding representations of all $m$ different words and phrases contained in $D$ takes $O(m)$ time. Constructing a SPG takes $O(m^{2})$ time (constructing a SWG is similar), and constructing SSG takes $O(n^{2})$ time, where $n$ is the number of sentences. Both $m$ and $n$ are less than $|D|$ .

The running time of both ABS-biased PageRank algorithms is $O(\ell|D|^{2})$ , where $\ell$ is the number of iterations, which tends to be small in practice, and so can be considered as a constant. (One may also fix a reasonable number of iterations and use whatever the values returned at the end of the last iteration. This is sufficient in practice.)

The Softplus adjustment can be done in $O(|D|)$ time.

Semantic spectral clustering runs in $O(|D|^{2})$ time (see Section 6.1).

Semantic affinity propagation runs in $O(n^{2})<O(|D|^{2})$ time (see Section 6.2).

To shortern the running time, a linear-time relaxed version of Word Mover’s Distance [57] is used in the experiments and evaluations.

8.3 Datasets for evaluation

Most datasets for evaluating summarization algorithms consist of one or more human-written summaries for each text document. These summaries either have a fixed number of words or a fixed number of sentences. For example, each summary in DUC-02 consists of about 100 words or less. Some datasets, such as CNN/Daily Mail, may only contain summaries of one to three human-written sentences.

To evaluate a sentence-ranking algorithm using such datasets, one may use an appropriate number of sentences of the highest rank it produces to match the size of the underlying summary and compare their ROUGE scores. This approach, however, cannot be used to establish the accuracy of sentence ranking for all sentences in a document.

It is customary to use the DUC-02 benchmarks to evaluate the effect of summarization algorithms, including extractive summarization, even though DUC-02 benchmarks are abstractive summaries. In particular, DUC-02 contains a total of 567 news articles with an average of 25 sentences per document. Each article has at least two abstractive summaries written by human annotators, and each summary consists of at most 100 words.

The SummBank dataset [9] is the best dataset there is at this time for evaluating sentence-ranking algorithms. It provides benchmarks produced by three human judges. The judges annotated 200 news articles written in English with an average of 20 sentences per document, resulting in, for each article, three sets of sentence rankings, one by each judge. Ranking scores of sentences can be derived from their rankings. In addition, SummBank also provides, for each article, a set of combined sentence ranking of all judges, where the combined ranking of each sentence is the average ranking scores of all three judges.

8.4 The ROUGE measures

ROUGE [58] is a widely used metric to evaluate accuracy of summaries. ROUGE-n is an n-gram recall between the automatic summary and a set of references, where ROUGE-SU4 evaluates an algorithm-generated summary using skip-bigram and unigram co-occurrence statistics, allowing at most four intervening unigrams when forming skip-bigrams.

An ultimate objective for any machine ranking method would be to achieve the highest possible ROUGE measures against rankings of all judges, which are served as references. In particular, in the SummBank dataset, if the corresponding mean ROUGE scores of a machine ranking and the combined ranking of all judges against the three reference rankings are comparable, then the machine ranking is deemed as good as the combined wisdom of all three human judges.

8.5 Comparisons on DUC-02

On the DUC-02 dataset, SWR, SPR, and SSR extract, respectively, sentences of the highest ranks with a total length bounded by 100 words. The results under ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-SU4 (R-SU4) against the DUC-02 benchmarks are shown in Table 2. The corresponding scores published in $\textit{CP}_{3}$ [29] and a number of major algorithms before it are also presented. In the table, the highest scores are shown in italic, where UA means that the value is unavailable in the corresponding publications. Note that $\textit{CP}_{3}$ outperforms all algorithms before it under these ROUGE measures.

Table 2
Comparison results (%) on DUC-02

Methods	R-1	R-2	R-SU4
SSR	49.3	25.1	26.5
SPR	49.2	25.0	26.3
SWR	49.2	24.7	26.1
$\textit{CP}_{3}$	49.0	24.7	25.8
CNN-W2V	48.6	22.0	UA
$E_{\text{Coh.}}$	48.5	23.0	25.3
URank	48.5	21.5	UA
$T_{\text{Coh.}}$	48.1	24.3	24.2
TextRank	47.1	19.5	21.7
ULink ( $k=$ 10)	47.1	20.1	UA

The following results can be seen against the DUC-02 benchmarks:

SSR outperforms SPR under every measure.

SPR outperforms SWR under ROUGE-2 and ROUGE-SU4, and has the same ROUGE-1 score as SWR.

SWR outperforms $\textit{CP}_{3}$ under ROUGE-1 and ROUGE-SU4, and has the same ROUGE-2 score as $\textit{CP}_{3}$ .

Figure 2.

ROUGE-1% comparisons of individual judges and combined ranking with TextRank, SWR, SPR, and SSR over the summBank benchmarks: (a) Judge 1 against Judges 2 and 3; (b) Judge 2 against Judges 1 and 3; (c) Judge 3 against Judges 1 and 3; (d) Combined ranking against all judges.

8.6 Comparisons on summBank

Comparisons are made with each individual judge’s ranking and the combined ranking of sentences of all judges. The combined ranking of sentences on a given document is obtained as follows: First derive individual judges’ ranking scores for each sentence contained in the document, then average individual ranking scores as the combined ranking score of the sentence.

To compare with each judge’s ranking of sentences, for Judge $i$ ( $i=1,2,3$ ), the evaluation uses the other two judges’ rankings of sentences as references.

To compare with the combined ranking of sentences by all judges, the sentence rankings of all individual judges are used as references.

Table 3 depicts the ROUGE-1, ROUGE-2, and ROUGE-SU4 scores of different methods on selections of top 5% of sentences.

Table 3
ROUGE (%) comparisons results on SummBank with the 5% constraint on sentence selections

Methods	R-1	R-2	R-SU4
Judge 1	38.42	28.20	26.61
SSR	61.34	51.61	50.54
SPR	60.51	51.20	50.12
SWR	59.42	50.69	49.89
TextRank	49.52	38.58	37.76
Judge 2	30.79	19.95	19.38
SSR	50.03	39.52	38.31
SPR	48.81	38.09	37.62
SWR	47.51	37.68	36.18
TextRank	43.10	32.66	31.42
Judge 3	35.74	25.86	24.66
SSR	54.22	44.39	43.91
SPR	53.83	43.44	42.87
SWR	53.01	43.38	42.11
TextRank	45.81	34.32	33.08
Combined	51.60	43.50	41.50
SSR	51.66	43.32	41.32
SPR	51.14	42.36	40.53
SWR	50.92	41.60	40.31
TextRank	44.66	33.63	32.56

The following results are evident:

Under all categories, SSR outperforms each judge by a significant margin and also outperforms SPR, which outperforms SWR, and SWR significantly outperforms TextRank.

SSR slightly outperforms the combined ranking of all judges under ROUGE-1, and is slightly below but very close to the combined ranking under ROUGE-2 and ROUGE-SU4, with the percentage differences being, respectively, 0.029%, 0.104%, and 0.109%. Moreover, SPR is slightly below SSR, SWR is slightly below SPR, and TextRank is substantially below SWR.

A full range of comparisons under ROUGE-1 with individual judges and the combined ranking of all judges are given at Fig. 2, where the percentage indicates that a portion of sentences are selected according to their ranks by the underlying methods. Thus, when 90% or more sentences are selected, all methods are comparable.

To demonstrate the robustness of an algorithm, it is customary to also compare ROUGE-2 and ROUGE-SU4 scores against the corresponding references. An algorithm is robust if it compares consistently against the references under different ROUGE measures. Full-range comparisons under the ROUGE-2 and ROUGE-SU4 measures (see Table 4) show similar trends to those under ROUGE-1, indicating that SSR, SPR, and SWR are robust. Table 4 shows the comparison results from 10% top-ranked sentences to 90%, with an increment of 10% each time.

It follows from Table 4 that, under the ROUGE-1, ROUGE-2, and ROUGE-SU4 measures, comparison results similar to those on the 5% top-ranked sentences discussed on Table 3 hold true on the entire spectrum. More specifically,

SSR is better than SPR and is comparable with the combined ranking of all judges at all percentage levels with slightly smaller scores.

Table 4

Full-range comparisons of SSR, SPR, SWR, and textRank with individual judges and the combined ranking of all judges over the summBank benchmarks, where shaded numbers in a group of comparisons for an individual judge are the highest scores under the corresponding measures, while in the group of comparisons for combined ranking, the shaded numbers are the highest and the second highest scores

	10%			20%			30%			40%			50%			60%			70%			80%			90%
	R-1	R-2	R-SU4	R-1	R-2	R-SU4	R-1	R-2	R-SU4	R-1	R-2	R-SU4	R-1	R-2	R-SU4	R-1	R-2	R-SU4	R-1	R-2	R-SU4	R-1	R-2	R-SU4	R-1	R-2	R-SU4
Judge 1	45.74	34.91	33.32	58.39	50.18	47.05	64.14	57.11	55.81	69.87	62.90	61.19	73.82	68.41	66.73	79.55	75.47	73.86	86.54	83.77	82.48	90.64	89.05	88.09	96.05	95.53	95.12
SSR	63.93	54.72	54.44	69.82	61.69	59.89	74.03	67.81	65.81	79.95	73.48	72.22	83.31	78.79	76.77	86.03	83.09	80.42	90.66	88.19	86.47	93.01	91.63	90.31	96.19	95.64	95.20
SPR	64.44	54.49	53.83	68.69	60.70	59.56	72.78	66.02	64.67	78.30	72.18	70.73	81.86	76.90	75.37	85.23	81.68	79.96	89.44	86.92	85.36	92.11	90.65	89.46	96.02	95.48	95.01
SWR	64.02	54.87	52.78	67.44	59.38	58.39	70.35	64.47	63.04	75.30	69.27	67.85	79.52	74.64	73.09	82.87	79.17	77.44	87.58	85.06	83.60	90.82	89.24	88.13	95.98	95.44	94.91
TextRank	51.93	41.25	40.27	57.43	48.88	47.96	65.56	58.28	56.79	71.06	65.13	63.58	74.65	69.75	68.11	80.23	76.78	75.03	85.63	83.36	81.93	89.86	88.49	87.37	95.71	95.24	94.75
Judge 2	34.01	23.89	23.10	47.52	38.59	37.54	58.94	51.50	49.95	69.00	62.51	60.76	74.35	68.99	67.18	80.73	76.63	74.89	85.61	82.98	81.70	89.89	88.36	87.42	95.84	95.31	94.89
SSR	55.68	45.23	44.34	64.70	55.89	54.39	69.78	63.18	62.31	78.97	72.45	71.33	83.02	77.99	76.60	87.14	83.38	81.78	90.46	87.81	86.54	93.71	92.14	91.06	97.25	96.66	96.28
SPR	55.08	44.76	43.61	63.49	54.44	53.22	68.21	61.98	60.84	77.52	71.16	69.68	81.44	76.46	74.90	85.72	82.13	80.54	88.81	86.23	84.62	90.93	89.43	88.17	96.51	95.94	95.44
SWR	54.53	44.76	44.17	63.17	53.98	52.82	67.53	61.45	60.14	76.05	69.67	68.19	80.93	75.90	74.34	84.88	81.09	79.42	88.11	85.39	83.90	91.66	90.02	88.84	96.22	95.65	95.11
TextRank	48.88	38.66	37.39	59.15	49.92	48.77	65.92	59.09	57.82	71.75	65.59	63.97	75.13	70.20	68.54	80.76	77.30	75.63	86.93	84.58	83.06	90.92	89.56	88.38	96.02	95.54	95.02
Judge 3	43.28	33.41	32.20	56.36	47.30	46.14	65.75	58.44	57.28	72.76	66.34	64.69	77.50	72.25	70.53	82.52	78.62	77.01	86.20	83.43	82.11	89.95	88.26	87.26	95.94	95.39	94.95
SSR	59.51	49.56	48.01	66.32	57.65	55.31	74.13	68.78	67.16	79.61	73.28	72.14	82.74	77.53	76.31	86.46	82.78	80.75	90.83	88.26	86.75	93.12	91.65	90.43	95.39	94.87	94.07
SPR	59.02	48.52	47.50	64.76	55.94	54.82	73.03	68.21	67.38	77.29	70.94	69.32	80.73	75.69	74.05	84.38	80.80	79.04	88.71	86.26	84.75	91.89	90.49	89.34	96.17	95.64	95.19
SWR	57.15	46.92	46.54	63.45	54.41	53.36	72.69	67.65	66.67	75.11	68.76	67.20	79.05	74.15	72.49	82.35	78.62	76.93	88.02	85.51	84.08	91.64	90.11	89.01	95.58	95.05	94.53
TextRank	50.77	39.74	38.33	58.63	49.76	48.61	64.75	57.32	55.70	70.39	64.19	62.46	73.78	68.69	66.89	79.00	75.48	73.64	85.53	83.30	81.86	90.00	88.70	87.57	95.92	95.45	94.96
Combined	57.53	50.03	47.92	69.08	63.14	61.83	75.37	70.42	68.91	82.03	78.15	76.68	84.85	81.73	80.30	88.34	85.90	84.60	91.28	89.55	88.56	94.02	92.59	91.83	96.78	96.30	95.81
SSR	57.63	50.18	47.86	68.53	61.24	59.54	74.92	69.83	68.29	80.40	76.22	73.89	84.21	79.89	77.91	87.51	83.85	82.08	91.23	88.67	86.73	93.79	92.73	92.03	96.41	95.86	95.33
SPR	57.02	48.97	46.81	68.19	60.04	58.47	74.04	68.76	67.07	78.71	74.45	70.94	82.29	77.61	75.74	85.52	81.95	80.27	89.33	86.82	85.25	92.70	91.12	89.99	96.28	95.83	95.31
SWR	56.82	48.18	47.03	67.83	59.12	58.06	73.58	68.09	66.33	76.48	71.74	70.39	80.21	75.29	73.70	84.88	81.15	79.45	88.21	86.17	85.19	92.36	90.91	89.71	95.69	95.31	95.00
TextRank	49.22	38.49	37.26	58.40	49.54	48.48	65.44	58.03	56.51	71.16	65.08	63.45	74.55	69.58	67.89	80.01	76.54	74.79	86.04	83.76	82.30	90.26	88.92	87.77	95.88	95.33	94.86

Table 5

ROUGE (%) comparison with different features removed

Methods	R-1	R-2	R-SU4	R-1	R-2	R-SU4	R-1	R-2	R-SU4
	10%			40%			70%
SWR	56.82	48.18	47.03	76.48	71.74	70.39	88.21	86.17	85.19
SWR_NSE	55.02	46.08	45.15	71.98	66.76	65.42	85.94	83.18	81.83
SWR_NAS	51.74	40.87	40.13	73.52	68.26	67.12	87.82	84.95	84.12
SWR_NSC	56.36	47.34	46.44	75.68	70.35	69.03	87.54	84.49	83.54
SWR_NSP	56.77	48.12	46.97	76.39	71.59	70.25	88.14	86.04	85.07
TextRank	49.22	36.07	35.29	71.16	65.47	63.94	86.04	83.52	82.29

SPR is better than SWR and narrows the gap between SWR and the combined ranking.

SWR is significantly better than TextRank.

SWR is substantially better than each individual judge’s ranking.

SWR is compatible on top ranked sentences (up to 30%), but incurs a moderate gap on lower ranked sentences between 30% and 90%.

TextRank is substantially worse than the combined ranking of all judges. On the other hand, TextRank is better than individual judges on sentences of higher ranks but worse on sentences of lower ranks.

Comparison with Judge 1: TextRank is better on top ranked sentences (up to 20%) and worse on the rest of the sentences.

Comparison with Judge 2: TextRank is much better on top ranked sentences (up to 30%) and slightly better on the rest of the sentences.

Comparison with Judge 3: TextRank is better or comparable on top ranked sentences (up to 30%), worse on lower ranked sentences from 30% to 70%, and comparable on the rest of the sentences.

TextRank is based only on co-occurrences of words for computing sentence scores, considering neither semantic information, nor subtopic diversity, nor structure of the underlying document. Incorporating these three features is expected to significantly improve the accuracy of sentence ranking, and the experiment results confirm that it is true. While incorporating semantics of words is better than without them, incorporating semantics of phrases and words is better than just using semantics of words, and adding semantics of sentences is even better.

It will be shown next how word semantics, article structure, Softplus function adjustment, and subtopic clustering each contribute to the improvement of sentence ranking.

8.7 Significance of each feature

It is interesting to understand how each of the features of semantic edges, Softplus function adjustment, ASB PageRank, and subtopic clustering actually contributes to the improvement of sentence rankings. To answer this question it suffices to evaluate the basic model SWR by removing a feature from it one at a time. Let SWR_NSE, SWR_NAS, SWR_NSC, and SWR_NSP denote, respectively, the variant of SWR without semantic edges, article-structure information, subtopic clustering, and Softplus adjustment.

Table 5 is the results obtained from evaluations over SummBank on selections of 10%, 40%, and 70% of sentences. TextRank is included as a baseline. The numbers in italic are the most severe drops, indicating that the corresponding features are the most critical. The following results are drawn:

With each of the features removed, the corresponding ROUGE scores drop, indicating that each feature has contributed to the improvement.

When the percentage of selecting sentences is smaller, removing article structure results in much larger drops, indicating that article structures are more significant on top-ranked sentences.

When the percentage becomes larger, semantic edges would become increasingly more critical.

While the Softplus function adjustment does improve ROUGE scores, it is not as significant as the other features.

Each SWR variant outperforms TextRank under all ROUGE measures, except that at the 70% level, when semantic edges are removed, SWR_NSE is lightly below TextRank.

9. Conclusions and final remarks

SSR is an efficient and accurate scheme for ranking sentences of a given document. In particular, an implementation of SSR presented in this paper runs in quadratic time, and outperforms, on the SummBank benchmarks, each individual human judge’s ranking under standard ROUGE measures and compares well with the combined ranking of all judges. Moreover, extracting sentences of the highest ranks with an appropriate number of sentences as an extractive summary achieves the state-of-the-art results over the DUC-02 benchmarks under standard ROUGE measures.

SSR is an unsupervised scheme that does not rely on language-specific features or deep linguistic computations, and so it is readily adaptable across languages. What it needs is a reasonable corpus of digital documents available in the adapted language (such as Wikipedia dumps) for extracting phrases and training word and phrase embedding representations. However, the similarity threshold values presented in this paper for constructing a semantic word graph and a semantic phrase graph for a given document, while appropriate for the English language and not sensitive to a small change, may need to be adjusted for other languages. Better threshold values can be determined using a small amount of labeled data. When no labeled data is available, an expected number of synonyms may be used to determine appropriate thresholds [59].

The following directions may be explored for further improvement of accuracy, in addition to what has been mentioned in the earlier sections:

Investigate other embedding methods such as ELMo [60], LEAR [61], Poincaré embed-dings [62], hierarchical embedding [63], and spherical embedding [64].

Investigate unsupervised sentence representations such as skip-thought vectors [65] and BERT [66].

Fine-tune location scoring for better representing article structures.

Explore new approximation algorithm for the NP-hard multi-objective knapsack problem. For example, instead of using round-robin approximation that treats each cluster equally likely, a weighted round-robin strategy that treats each cluster according to the subtopic distribution in the given document may help improve accuracy.

Finally, readers should keep in mind that employing new methods, while possibly improving accuracy, may also degrade efficiency. Whether a trade-off is acceptable depends on the underlying applications.

Footnotes

A prototype is available at http://www.dooyeed.com.

Available at https://github.com/abisee/cnn-dailymail.

Available at http://tcci.ccf.org.cn/conference/2017/taskdata.php.

See, for example, discussions in https://www.unbc.ca/sites/de-fault/files/assets/academic_success_centre/writing_support/hour-glass.pdf.

https://www.unbc.ca/sites/default/files/assets/academic_success_ centre/writing_support/hourglass.pdf.

Acknowledgments

This work was supported in part by Eola Solutions. The authors thank Wenjing Yang, Yicheng Sun, and Changfeng Yu of UMass Lowell for writing an API to run the AutoPhrase code available at Github to carry out SSR evaluations, and to Liqun (Catherine) Shao of Microsoft for an interesting discussion on Softplus function adjustment. They are grateful to Cheng Zhang at UMass Lowell for implementing dooyeed.com that uses SSR to rank sentences for a given document. The second author would also like to thank Jiawei Han of the University of Illinois at Urbana-Champaign and Jesse Wang of the University of Rochester for inspiring conversations.

References

Das

Martins

. A survey on automatic text summarization. Literature Survey for the Language and Statistics II Course at CMU. 2007; 4(57): 192-195.

Wang

Zhang

Yang

Shao

Wang

. An effective scheme for generating an overview report over a very large corpus of documents. in: Proceedings of the 19th ACM Symposium on Document Engineering (DocEng 2019); 2019.

Neto

Santos

Kaestner

Alexandre

Santos

, et al. Document clustering and text summarization. 2000.

Mihalcea

Tarau

. Textrank: Bringing order into text. in: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing; 2004.

Bhartiya

Singh

. A semantic approach to summarization. arXiv preprint arXiv:14061203. 2014.

Bellare

Sarma

Loiwal

Mehta

Ramakrishnan

, et al. Generic text summarization using WordNet. in: Proceedings of Internationational Conference on Language Resources and Evaluation; 2004.

Yuan

. Extractive summarization using inter-and intra-event relevance. in: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2006; 369-376.

Nallapati

Zhai

Zhou

. SummaRuNNer: A recurrent neural network based sequence model for extractive summarization of documents. in: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017; 3075-3081.

Radev

, et al. SummBank 1.0 LDC2003T16. Web Download. 2003; Philadelphia: Linguistic Data Consortium.

10.

Chen

Zhu

. Lcsts: A large scale chinese short text summarization dataset. arXiv Preprint arXiv:150605865. 2015.

11.

Mikolov

, Yih Wt Zweig

. Linguistic regularities in continuous space word representations. in: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013; 746-751.

12.

Kusner

Sun

Kolkin

Weinberger

. From word embeddings to document distances. in: Proceedings of the 32nd International Conference on Machine Learning. 2015; 957-966.

13.

Shang

Liu

Jiang

Ren

Voss

Han

. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering. 2018; 30(10): 1825-1837.

14.

Dueck

. Affinity propagation: Clustering data by passing messages. University of Toronto. 2009.

15.

DUC. Document understanding conference. 2014; https://www-nlpir.nist.gov/projects/duc/intro.html.

16.

Kupiec

Pedersen

Chen

. A trainable document summarizer. in: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1995; 68-73.

17.

Conroy

O’eary

. Text summarization via hidden markov models. in: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2001; 406-407.

18.

Cheng

Lapata

. Neural summarization by extracting sentences and words. arXiv Preprint arXiv:160307252. 2016.

19.

Narayan

Papasarantopoulos

Cohen

Lapata

. Neural extractive summarization with side information. arXiv Preprint arXiv:170404530. 2017.

20.

Narayan

Cohen

Lapata

. Ranking sentences for extractive summarization with reinforcement learning. arXiv Preprint arXiv:180208636. 2018.

21.

Zhang

Pratama

. Extractive document summarization based on convolutional neural networks. in: Proceedings of the 42nd Annual Conference of the IEEE Industrial Electronics Society. IEEE. 2016; 918-922.

22.

Erkan

Radev

. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research. 2004; 22: 457-479.

23.

Brin

Page

. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems. 1998; 30(1-7): 107-117.

24.

Wan

Xiao

. Exploiting neighborhood knowledge for single document summarization and keyphrase extraction. ACM Transactions on Information Systems. 2010; 28(2).

25.

Wan

. Towards a unified approach to simultaneous single-document and multi-document summarizations. in: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics. 2010; 1137-1145.

26.

Zhou

Xue

Zha

. Enhancing diversity, coverage and balance for summarization through structure learning. in: Proceedings of the 18th International Conference on World Wide Web. 2009; 71-80.

27.

Parveen

Strube

. Integrating importance, non-redundancy and coherence in graph-based extractive summarization. in: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015; 1298-1304.

28.

Parveen

Ramsl

Strube

. Topical coherence for graph-based extractive summarization. in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015; 1949-1954.

29.

Parveen

Mesgar

Strube

. Generating coherent summaries of scientific articles using coherence patterns. in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016; 772-783.

30.

Kleinberg

. Authoritative sources in a hyperlinked environment. in: Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms. ACM and SIAM. 1998; 668-677.

31.

Isonuma

Mori

Sakata

. Unsupervised neural single-document summarization of reviews via learning latent discourse structure and its ranking. in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL. 2019; 2142-2152.

32.

Zhang

Wang

. Semantic WordRank: Generating finer single-document summarizations. in: Lecture Notes in Computer Sciences, Proceedings of the 19th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2018). Springer. 2018; 398-409.

33.

Pöttker

. News and its communicative quality: The inverted pyramid – when and why did it appear? Journalism Studies. 2003; 4(4): 501-511.

34.

Florescu

Caragea

. A position-biased PageRank algorithm for keyphrase extraction. in: Proceedings of the 31st AAAI Conference on Artifical Intelligence. 2017; 4923-4924.

35.

Brin

Page

. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems. 1998; 30(1-7): 107-117.

36.

Shao

Zhang

Jia

Wang

. Efficient and effective single-document summarizations and a word-embedding measurement of quality. in: Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - (Volume 1), Funchal, Madeira, Portugal, November 1-3, 2017. 2017; 114-122.

37.

Tinkham

. The effects of semantic and thematic clustering on the learning of second language vocabulary. Second Language Research. 1997; 13(2): 138-163.

38.

Hearst

. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics. 1997; 23(1): 33-64.

39.

Greene

Cunningham

. Practical solutions to the problem of diagonal dominance in kernel document clustering. in: Proceedings of the 23rd International Conference on Machine Learning. 2006; 377-384.

40.

Von Luxburg

. A tutorial on spectral clustering. Statistics and Computing. 2007; 17(4): 395-416.

41.

Besbes

Zghal

. Personalized and context-aware retrieval based on fuzzy ontology profiling. Integrated Computer Aided Engineering. 2016; 24.

42.

Lehoucq

Sorensen

. Implicitly restarted Lanczos method. in: Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide (Zhaojun Bai et al.). SIAM.

43.

Fränti

Sieranoja

. How much can k-means be improved by using better initialization and repeats? Pattern Recognition. 2019; 93: 95-112.

44.

Hartigan

Wong

. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society Series C (Applied Statistics). 1979; 28(1): 100-108.

45.

Kaufman

Rousseeuw

. Finding groups in data: An introduction to cluster analysis. vol. 344. John Wiley & Sons. 2009.

46.

Miettinen

. Nonlinear multiobjective optimization. Kluwer Academic Publishers. 1999.

47.

Branke

Deb

Miettinen

, SlowiÅ„ski R. Multiobjective optimization: Interactive and evolutionary approaches. Springer. 2008.

48.

Cheng

Zhang

Caraffini

Neri

. Multicriteria adaptive differential evolution for global numerical optimization. Integrated Computer Aided Engineering. 2015; 22: 103-117.

49.

Rostami

Neri

Epitropakis

. Progressive preference articulation for decision making in multi-objective optimisation problems. Integrated Computer-Aided Engineering. 2017; 24(4): 315-335.

50.

Liu

Wang

Fan

Wei

Tong

. A convergence-diversity balanced fitness evaluation mechanism for decomposition-based many-objective optimization algorithm. Integrated Computer-Aided Engineering. 2019; 26(2): 159-184.

51.

Zhang

Rong

Neri

, Pérez-Jiménez MJ. An optimization spiking neural P system for approximately solving combinatorial optimization problems. International Journal of Neural Systems. 2014; 24(5): 1440006.

52.

Rostami

Neri

. Covariance matrix adaptation Pareto archived evolution strategy with hypervolume-sorted adaptive grid algorithm. Integrated Computer-Aided Engineering. 2016; 23(4): 313-329.

53.

Pan

Tian

Zhang

. A region division based diversity maintaining approach for many-objective optimization. Integrated Computer-Aided Engineering. 2017; 24(3): 279-296.

54.

Liu

Wang

Liu

. A two phase hybrid algorithm with a new decomposition method for large scale optimization. Integrated Computer-Aided Engineering. 2018; 25(4): 349-367.

55.

Yan

Zhang

Xie

. An optimizer ensemble algorithm and its application to image registration. Integrated Computer-Aided Engineering. 2019; 26(4): 311-327.

56.

Bojanowski

Grave

Joulin

Mikolov

. Enriching word vectors with subword information. arXiv Preprint arXiv:160704606. 2016.

57.

Atasu

Parnell

, Dünner C Sifalakis

Pozidis

Vasileiadis

, et al. Linear-complexity relaxed word Mover’s distance with GPU acceleration. in: Proceedings of 2017 IEEE International Conference on Big Data. IEEE. 2017; 889-896.

58.

Lin

. Linear-complexity relaxed word Mover’s distance with GPU acceleration. in: Proceedings of the Workshop on Text Summarization Branches Out. Post-Conference Workshop of ACL 2004; 2004.

59.

Rekabsaz

Lupu

Hanbury

. Exploration of a threshold for similarity based on uncertainty in word embedding. in: European Conference on Information Retrieval. Springer. 2017; 396-409.

60.

Peters

Neumann

Iyyer

Gardner

Clark

Lee

, et al. Deep contextualized word representations. arXiv Preprint arXiv:180205365. 2018.

61.

Vulić

Mrkšić

. Specialising word vectors for lexical entailment. arXiv Preprint arXiv:171006371. 2017.

62.

Nickel

Kiela

. Poincaré embeddings for learning hierarchical representations. in: Proceedings of the 31 Annual Conference on Advances in Neural Information Processing Systems. 2017; 6338-6347.

63.

Zhang

Chen

Croft

. Learning a hierarchical embedding model for personalized product search. in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 2017; 645-654.

64.

Batmanghelich

Saeedi

Narasimhan

Gershman

. Nonparametric spherical topic modeling with word embeddings. in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. vol. NIH Public Access. 2016; 537.

65.

Kiros

Zhu

Salakhutdinov

Zemel

Urtasun

Torralba

, et al. Skip-thought vectors. in: Proceedings of the 29th Annual Conference on Advances in Neural Information Processing Systems. 2015; 3294-3302.

66.

Devlin

Chang

Lee

Toutanova

. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv:181004805. 2018.

An unsupervised semantic sentence ranking scheme for text documents

Abstract

Keywords

1. Introduction

2. Related work

2.1 Supervised methods

2.2 Unsupervised methods

3. Semantic SentenceRank

3.2 Phrase and word embedding

3.3 Preprocessing

4. Semantic graph representations

4.1 Semantic word graph

4.2 Semantic phrase graph

4.3 Semantic sentence graph

5.1 Article-structure-biased PageRank-1

5.4 Combined sentence scoring

6.1 Semantic spectral clustering

7. Sentence selection and ranking

7.1 Round-robin selection

8. Implementations and evaluations

8.1 Embedding database and parameters

8.2 Implementations and complexity analysis

8.3 Datasets for evaluation

8.4 The ROUGE measures

8.5 Comparisons on DUC-02

Table 2 Comparison results (%) on DUC-02

Table 3 ROUGE (%) comparisons results on SummBank with the 5% constraint on sentence selections

9. Conclusions and final remarks

Footnotes

Acknowledgments

References

Table 2
Comparison results (%) on DUC-02

Table 3
ROUGE (%) comparisons results on SummBank with the 5% constraint on sentence selections