Abstract
This paper presents Semantic SentenceRank (SSR), an unsupervised scheme for automatically ranking sentences in a single document according to their relative importance. In particular, SSR extracts essential words and phrases from a text document, and uses semantic measures to construct, respectively, a semantic phrase graph over phrases and words, and a semantic sentence graph over sentences. It applies two variants of article-structure-biased PageRank to score phrases and words on the first graph and sentences on the second graph. It then combines these scores to generate the final score for each sentence. Finally, SSR solves a multi-objective optimization problem for ranking sentences based on their final scores and topic diversity through semantic subtopic clustering. An implementation of SSR that runs in quadratic time is presented, and it outperforms, on the SummBank benchmarks, each individual judge’s ranking and compares favorably with the combined ranking of all judges.
Keywords
Introduction
Ranking sentences in a single document according to their relative importance plays a central role in various applications, including summary extraction from a given document (e.g., see [1]), structured-overview generation over a large corpus of documents [2], and layered reading for fast comprehension of a given document. The last application1 enables the reader to read a layer of the most important sentences first, then subsequent layers of next important sentences until the entire document is read.
There are supervised and unsupervised methods for ranking sentences. Most unsupervised methods use easy-to-compute counting features, such as TF-IDF (term frequency-inverse document frequency) [3] and co-occurrences of words [4]. Such approaches are domain and language independent, and are often preferred over other approaches. Not using semantic information in the context, these methods may only produce suboptimal sentence ranking. Other unsupervised algorithms that use semantic information include Semantic Role Labelling [5], WordNet [6], and Named Entity Recognition [7]. These methods, unfortunately, impose a common limitation of language dependence. In other words, using these algorithms for a given language requires software tools to provide the underlying semantic information for the language, which may not be available.
Supervised methods require labeled data to train models. Modern supervised methods that learn feature representations automatically using a deep neural network model rely on a significant amount of labeled texts (e.g., see [8]). Lacking training data is a major obstacle when developing a supervised sentence-ranking algorithm. The SummBank dataset [9] for evaluating sentence-ranking algorithms, for example, contains only 200 news articles, which is far from sufficient to train a deep neural network model. Other larger datasets sufficient to train neural networks for summarization, such as CNN/DailyMail,2 are unsuitable to train models for ranking sentences, because they only provide a few human-written highlights as a summary for each document.
There are no datasets available at this time for other languages that are suitable for training sentence-ranking models. For example, for the Chinese language, the LSCTS [10]dataset consists of news articles and an average of 1 to 2 sentences written by human annotators as a summary for each article; and the NLPCC 2017 dataset3 also consists of news articles with a summary of upto 45 Chinese characters written by human annotators. These datasets cannot be used to train sentence-ranking models. Lacking training data for a particular language has hindered adaption of a good supervised model for one language to different languages.
These concerns suggest a direction of investigating unsupervised sentence-ranking algorithms using semantic features that can be computed readily for a given language. Word-embedding representations [11] and Word Mover’s Distance (WMD) [12], for example, are semantic features of this kind. A good word-embedding representation provides useful semantic and syntactic information. Requiring only a large amount of unlabeled texts, it is straightforward to compute word embedding representations for any language using unlabeled out-of-band data such as Wikipedia dumps of the underlying language. WMD uses word embedding representations to measure semantic distance between two sentences, which can be used to measure their semantic similarity.
This paper presents an unsupervised semantic senten-ce-ranking scheme called Semantic SentenceRank (SSR) using semantic features at the word, phrase, and sentence levels. SSR uses phrase and word embedding representations and co-occurrences to construct a semantic phrase-word graph, scores words and phrases using a variant of article-structure-biased PageRank, adjusts scores using Solfplus elevation, and computes a normalized score for each sentence. SSR then constructs a semantic sentence graph using WMD, scores sentences using a variant of article-structure-based PageRank, combines these sentence scores to generate the final sentence scores, computes semantic subtopic clustering of sentences, and ranks sentences by solving a multi-objective 0–1 knapsack problem that maximizes the final scores of selected sentences and the diversity of subtopic coverage.
The major contributions of this paper are as follows:
Flexibility: SSR is an unsupervised scheme using semantic features that can be computed readily for any language. Efficiency: SSR runs in quadratic time when using AutoPhrase [13] to extract phrases, Affinity Propagation [14] to generate semantic subtopic clusters for sentences, and a greedy algorithm to approximate the multi-objective 0–1 knapsack problem. Accuracy: Running on the DUC-02 dataset [15], the aforementioned implementation of SSR outperforms all previous algorithms under the ROUGE measures. More significantly, on the SummBank dataset [9], SSR outperforms each individual judge’s ranking and compares favorably with the combined sentence ranking of all judges.
The rest of the paper is organized as follows: Section 2 provides a brief overview of related work. SSR is presented in Section 3. Detail descriptions of the major components of SSR are presented in Sections 4–7. Implementations and evaluation results are presented in Section 8. Conclusions and final remarks are presented in Section 9.
Related work
Extractive summarization algorithms that can specify the number of sentences in a summary can be used to rank sentences, and vice versa.
Supervised methods
Supervised summarization methods can be categorized into two categories: sentence labeling and sentence scoring.
Supervised sentence-labeling methods assign a binary label to each sentence
Supervised sentence-scoring methods depend on the underlying similarity measure of a sentence in a document to a benchmark summary. For example, CNN-W2V [21] is a model that computes a ROUGE score for each sentence in a document using the corresponding summaries in a labeled data as references, and uses such scores as the ground truth to train a convolutional-neural-network model to score sentences independently of input documents. This means that the same sentence appearing in different documents always has the same score. This is problematic, for the same sentence in different documents is unlikely to be equally important. A more reasonable approach is to score sentences via global optimization by taking previously scored sentences and their scores into consideration. Refresh [20], for example, is a recent model in this direction. It scores a sentence using previously scored sentences and the summaries of the underlying document in a labeled dataset, where sentence scores are used as a reward function in the model.
No matter what the underlying method is, it is necessary to have a large labeled dataset to train a supervised neural network model. This necessity remains a major obstacle, for such a dataset may not exist for a given language.
Unsupervised methods
Unsupervised summarization methods exploit relations between words, as related words “promote” each other. For example, TextRank [4] and LexRank [22] each models a document as a sentence graph based on word relations, but they use only syntactic features. PageRank [23] is used to score words.
Other methods incorporate additional information for achieving a higher accuracy. UniformLink [24], for example, constructs a sentence graph on a set of similar documents, where a sentence is scored based on both of the in-document score and cross-document score. URank [25], on the other hand, uses a unified graph-based framework to study both single-document and multi-document summarization.
The quality of a summary may be improved using max-margin methods [26] or integer-linear programming (ILP) [27, 28]. Among the previous algorithms,
It is worth noting that a recent unsupervisedmethod [31], while not related to the problem studied in this paper, may provide new ideas. On a collection of consumer reviews on a particular product, the method generates an abstractive sentence as the main review point, and ranks sentences by the number of descendants in a discourse tree rooted on the main review point. It would be interesting to investigate if this method can be modified to rank sentences of a given document according to their relative importance.
Early unsupervised methods have two common downsides: (1) They don’t promote diversity. This is because the importance of sentences are based only on sentence scores, and so sentences of high scores representing the same subtopic may all be included in a summary, leaving no room to include sentences with lower scores but with different subtopics. (2) These methods do not used semantic features.
Semantic SentenceRank
To overcome the downsides of the existing methods (supervised or unsupervised), a better sentence-ranking algorithm should incorporate semantic features and topic diversity, and it should be unsupervised.
Major components of (a) SSR and (b) SPR/SWR.
Let
where
SSR computes
SSR contains two sub-models: one at the word level known as Semantic WordRank (SWR) [32], and one at the phrase-word level referred to as Semantic PhraseRank (SPR). In other words, SPR is SSR excluding semantic sentence graph and ABS-biased PageRank-2. Both SWR and SPR follow the same data follow diagram (see Fig. 1b), except that SWR does not consider phrase-level similarities. Both are faster than SSR, and perform well on selecting top-ranked sentences, which is sufficient for certain applications.
Phrase and word embedding
A searchable dataset of phrase and word embedding representations is calculated independently of SSR. Such an embedding dataset may be available for free download for some languages. If unavailable, it is straightforward to compute word and phrase embedding over an unlabeled Wikipedia dump using a standard method. To extract phrases, a linear-time unsupervised algorithm such as AutoPhrase [13] may be used. Recalculations of phrase and word embedding may be carried out once in a while on a larger Wikipedia dump.
Preprocessing
The preprocessing component computes, on a given document
Descriptions of the remaining components of SSR are presented in Sections 4–7.
Semantic graph representations
Semantic word graph
In addition to considering co-occurrence between words as in TextRank [4], the semantic word graph (SWG) for the underlying document adds embedding similarity of words to enhance connectivity of the graph. Adding semantic similarity is vital for processing analytic languages, such as Chinese, which seldom use inflections. When stemming is not applicable, semantic similarity serves as an alternative to represent the relations between words with similar meanings, allowing them to share the importance when computing PageRank scores for these nodes.
Let
Let
Semantic phrase graph
In a SWG, words in a phrase (e.g., names, scientific terms, and general entity names) would have high co-occurrence counts if the phrase appears multiple times in the document. The meaning of a word inside a phrase may be different from that outside the phrase. For example, in the sentence “There is an apple on top of her Apple computer”, the word “Apple” appears outside and inside the phrase of “Apple computer”, which has different meanings. Thus, a high-quality phrase extractor is desired when building a phrase graph.
Given a document
A semantic phrase-word graph (SPG) is a weighted graph
Semantic sentence graph
A semantic sentence graph (SSG) of a document
Let
Define a different similarity measure of
Sort the similarity scores given by Eq. (1) in descending order. Select a Normalize the co-occurrence edge weight
Article-structure-biased PageRank-1
Article structures define how information is presented. For example, the typical structure of news articles is an inverted pyramid [33], where critical information is presented at the beginning, followed by additional information with less important details. In academic writing, the structure of an article would look like an hourglass4, which includes an additional conclusion piece at the end of the article. Thus, sentence locations in an article according to the underlying structure also plays a role in ranking sentences. SWR uses a position-biased PageRank algorithm [34].
Directly applying PageRank, one can compute a score
where
Equation (2) is an unbiased PageRank, where each word is assumed equally likely to start from. In article-structure-biased (ASB) PageRank, each word
Rank the importance of sentence locations from the most important to the least important based on the underlying article structure (Note: This is not the sentence ranking to be computed). Let
The probability for node
Note that the above computation is at the sentence level, which can be easily adapted to the word level by ranking words instead of sentences.
The ASB PageRank score
Computation starts with an arbitrary initial value for each node, and iterates the computation of Eq. (3) until it converges.
Let
This way of scoring, however, has a drawback. To see this, suppose that
Using the Softplus function
Apply the Softplus function to each word, and sum up the elevated values to be the salient score of
To illustrate this using a numerical example, assume that
Numerical examples with
Sentence
For the semantic sentence graph, SSR uses a modified ASB PageRank algorithm to score sentences, as ASB PageRank-1 suitable for words may not be suitable for sentences. To see this, assume that a document has the inverted pyramid structure, then using the reciprocal of the location index of a sentence as its location score will result in putting too much weight on the first few sentences and too little weight on the subsequent sentences. Clearly, this Pareto phenomenon is not practical. Instead, let
Combined sentence scoring
The sentence scoring function
Selecting a sentence based only on sentence scores would result in a poor diversity of topic coverage, as multiple sentences of the same subtopic could have higher scores than sentences of different subtopics. To avoid this drawback, a sentence subtopic clustering method is needed.
Clustering algorithms depend on a chosen similarity measure for the underlying objects to be clustered. Clustering may be carried out based on thematic similarity measures or semantic similarity measures. Thematic clustering groups sentences of the same context into the same cluster. For example, under thematic similarity measures, sentences that contain the following words may be grouped into the same cluster: frog, pond, green, or tree [37]. TextTiling [38], for example, is a thematic clustering algorithm. First used in text summarization [36], TextTiling groups several consecutive paragraphs into the same cluster by finding thematic shifts between consecutive paragraphs. Unfortunately, it often fails to generate multiple clusters on short articles or when there are no clear thematic shifts between consecutive paragraphs.
Clustering methods based on TF-IDF over the BOW (bag-of-words) representations are in general unsuitable for measuring document distances or similarities due to frequent near-orthogonality [12, 39].
Semantic clustering, on the other hand, groups sentences that have similar meanings or convey similar information into the same cluster. For example, under semantic similarity measures, sentences about eyes, noses, and ears may be group into the same cluster. Semantic clustering is based on semantic measures between sentences. WMD, in particular, can be used to define semantic measures.
Efficiency, accuracy, and easy implementation are criteria to choose a clustering algorithm. When choosing a clustering algorithm to implement SSR, it is imperative to choose one that can also be easily modified to use semantic measures. Moreover, the complexity of the algorithm should not exceed quadratic time, as higher complexity may cause interactive applications (such as the layered-reading tool at
Spectral clustering [40] and affinity propagation [14] are both based on k-means using Euclidean distance as the underlying similarity measure, which can easily be replaced with a semantic measure such as WMD, and they can be carried out in quadratic time. Any clustering method that possesses these properties may also serve as candidates. Other context-aware similarity measures such as the one described in [41] may be explored. The topic diversity function
Semantic spectral clustering
Spectral clustering [40] uses eigenvalues of a similarity matrix (aka. affinity matrix) to reduce dimension before clustering a given set of data points into
Treat each sentence as a data point. Let WMD
where
The number of clusters
Let
Thus, computing the similarity matrix incurs
Recall that spectral clustering must fix a number of clusters before clustering. This could be problematic in practice. Affinity propagation (AP) clustering [14] overcomes this problem. It is an exemplar-based clustering algorithm such as k-means [44] and k-medoids [45] except that AP does not need to preset the number of clusters.
Let
Initially, set Set
If
Otherwise, set
If
Sentence selection and ranking
An approximation algorithm is needed to cope with the NP-hardness of the multi-objective 0–1 knapsack problem. There are various techniques for tackling multi-objective optimization problems such as those described in [46, 47, 48, 49, 50, 51]. Other recent optimization techniques on solving integrated computer-aided engineering problems include [52, 53, 54, 55]. A good approximation algorithm should balance between accuracy and efficiency. The following greedy approximation in a round-robin style is used in the current implementation of SSR.
Round-robin selection
Each sentence
Let For each sentence For each cluster While there are still sentences that have not been selected, do the followings:
Sort the remaining clusters in descending order according to the highest unit score contained in a cluster. For example, if the highest unit score in cluster Select the sentence from the remaining sentences with the highest unit score, one from each cluster in the order of sorted clusters, and add it to
where Remove the selected sentences from their corresponding clusters. Rank sentences according to the order they are selected.
Implementations and evaluations
Embedding database and parameters
The word-embedding database uses a pre-computed word embedding representations with subword information [56], which can handle out-of-vocabulary words and generate better word embedding for rare words.
The phrase-embedding database uses Auto-Phrase [13] on an English Wikipedia dump and the CNN/DailyMail dataset to extract phrases and words. fastText [56] is then used to compute embedding representations for these phrases and words with respect to the same datasets. AutoPhrase is an unsupervised phrase extractor that supports any language as long as a general knowledge base and a pre-trained POS-tagger (recall that POS stands for part-of-speech) in that language are available. Wikipedia is typically used as a general knowledge base and pre-trained POS-taggers are widely available for different languages.
The sliding-window size for computing co-occurr-ence of words for SWG is set to
A cosine similarity value of word or phrase embedding that is larger than 0.6 is deemed sufficient to indicate two words are semantically similar, which ensures that the underlying semantic graph has sufficient connectivity, but not too dense. Because there are more words than phrases, setting the threshold value of semantic similarity for words to 0.65 and for phrases to be 0.6 is reasonable. Note that setting the threshold values slightly larger will only slightly affect the precision.
For the semantic sentence graph, setting the percentage
Implementations and complexity analysis
The implementation of SPG uses AutoPhrase to extract phrases. A straightforward table lookup of the embedding database provides the embedding representations of words and phrases needed to construct SPG.
Since the datasets to be used to evaluate sentence-ranking algorithms are news articles and news articles in general have the inverted pyramid structure, to implement ABS-biased PageRank-1, the following location score for word
Likewise, to implement ABS-biased PageRank-2, the following location score for sentence
Note that location scores may be defined using different functions. For example, the structure of an academic research paper may in general have the hourglass structure.5
Finally, semantic special clustering is used to implement SWR and SPR, while semantic affinity propagation is used to implement SSR.
Under the said implementations, SWR, SPR, and SSR on a given document
The preprocessing of extracting phrases using AutoPhrase can be done in The embedding database may be implemented as a dictionary, and so looking-up a word The running time of both ABS-biased PageRank algorithms is The Softplus adjustment can be done in Semantic spectral clustering runs in Semantic affinity propagation runs in
To shortern the running time, a linear-time relaxed version of Word Mover’s Distance [57] is used in the experiments and evaluations.
Datasets for evaluation
Most datasets for evaluating summarization algorithms consist of one or more human-written summaries for each text document. These summaries either have a fixed number of words or a fixed number of sentences. For example, each summary in DUC-02 consists of about 100 words or less. Some datasets, such as CNN/Daily Mail, may only contain summaries of one to three human-written sentences.
To evaluate a sentence-ranking algorithm using such datasets, one may use an appropriate number of sentences of the highest rank it produces to match the size of the underlying summary and compare their ROUGE scores. This approach, however, cannot be used to establish the accuracy of sentence ranking for all sentences in a document.
It is customary to use the DUC-02 benchmarks to evaluate the effect of summarization algorithms, including extractive summarization, even though DUC-02 benchmarks are abstractive summaries. In particular, DUC-02 contains a total of 567 news articles with an average of 25 sentences per document. Each article has at least two abstractive summaries written by human annotators, and each summary consists of at most 100 words.
The SummBank dataset [9] is the best dataset there is at this time for evaluating sentence-ranking algorithms. It provides benchmarks produced by three human judges. The judges annotated 200 news articles written in English with an average of 20 sentences per document, resulting in, for each article, three sets of sentence rankings, one by each judge. Ranking scores of sentences can be derived from their rankings. In addition, SummBank also provides, for each article, a set of combined sentence ranking of all judges, where the combined ranking of each sentence is the average ranking scores of all three judges.
The ROUGE measures
ROUGE [58] is a widely used metric to evaluate accuracy of summaries. ROUGE-n is an n-gram recall between the automatic summary and a set of references, where ROUGE-SU4 evaluates an algorithm-generated summary using skip-bigram and unigram co-occurrence statistics, allowing at most four intervening unigrams when forming skip-bigrams.
An ultimate objective for any machine ranking method would be to achieve the highest possible ROUGE measures against rankings of all judges, which are served as references. In particular, in the SummBank dataset, if the corresponding mean ROUGE scores of a machine ranking and the combined ranking of all judges against the three reference rankings are comparable, then the machine ranking is deemed as good as the combined wisdom of all three human judges.
Comparisons on DUC-02
On the DUC-02 dataset, SWR, SPR, and SSR extract, respectively, sentences of the highest ranks with a total length bounded by 100 words. The results under ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-SU4 (R-SU4) against the DUC-02 benchmarks are shown in Table 2. The corresponding scores published in
Comparison results (%) on DUC-02
Comparison results (%) on DUC-02
The following results can be seen against the DUC-02 benchmarks:
SSR outperforms SPR under every measure. SPR outperforms SWR under ROUGE-2 and ROUGE-SU4, and has the same ROUGE-1 score as SWR. SWR outperforms
ROUGE-1% comparisons of individual judges and combined ranking with TextRank, SWR, SPR, and SSR over the summBank benchmarks: (a) Judge 1 against Judges 2 and 3; (b) Judge 2 against Judges 1 and 3; (c) Judge 3 against Judges 1 and 3; (d) Combined ranking against all judges.
Comparisons are made with each individual judge’s ranking and the combined ranking of sentences of all judges. The combined ranking of sentences on a given document is obtained as follows: First derive individual judges’ ranking scores for each sentence contained in the document, then average individual ranking scores as the combined ranking score of the sentence.
To compare with each judge’s ranking of sentences, for Judge To compare with the combined ranking of sentences by all judges, the sentence rankings of all individual judges are used as references.
Table 3 depicts the ROUGE-1, ROUGE-2, and ROUGE-SU4 scores of different methods on selections of top 5% of sentences.
ROUGE (%) comparisons results on SummBank with the 5% constraint on sentence selections
ROUGE (%) comparisons results on SummBank with the 5% constraint on sentence selections
The following results are evident:
Under all categories, SSR outperforms each judge by a significant margin and also outperforms SPR, which outperforms SWR, and SWR significantly outperforms TextRank. SSR slightly outperforms the combined ranking of all judges under ROUGE-1, and is slightly below but very close to the combined ranking under ROUGE-2 and ROUGE-SU4, with the percentage differences being, respectively, 0.029%, 0.104%, and 0.109%. Moreover, SPR is slightly below SSR, SWR is slightly below SPR, and TextRank is substantially below SWR.
A full range of comparisons under ROUGE-1 with individual judges and the combined ranking of all judges are given at Fig. 2, where the percentage indicates that a portion of sentences are selected according to their ranks by the underlying methods. Thus, when 90% or more sentences are selected, all methods are comparable.
To demonstrate the robustness of an algorithm, it is customary to also compare ROUGE-2 and ROUGE-SU4 scores against the corresponding references. An algorithm is robust if it compares consistently against the references under different ROUGE measures. Full-range comparisons under the ROUGE-2 and ROUGE-SU4 measures (see Table 4) show similar trends to those under ROUGE-1, indicating that SSR, SPR, and SWR are robust. Table 4 shows the comparison results from 10% top-ranked sentences to 90%, with an increment of 10% each time.
It follows from Table 4 that, under the ROUGE-1, ROUGE-2, and ROUGE-SU4 measures, comparison results similar to those on the 5% top-ranked sentences discussed on Table 3 hold true on the entire spectrum. More specifically,
SSR is better than SPR and is comparable with the combined ranking of all judges at all percentage levels with slightly smaller scores. Full-range comparisons of SSR, SPR, SWR, and textRank with individual judges and the combined ranking of all judges over the summBank benchmarks, where shaded numbers in a group of comparisons for an individual judge are the highest scores under the corresponding measures, while in the group of comparisons for combined ranking, the shaded numbers are the highest and the second highest scores
ROUGE (%) comparison with different features removed
SPR is better than SWR and narrows the gap between SWR and the combined ranking.
SWR is significantly better than TextRank.
SWR is substantially better than each individual judge’s ranking. SWR is compatible on top ranked sentences (up to 30%), but incurs a moderate gap on lower ranked sentences between 30% and 90%.
TextRank is substantially worse than the combined ranking of all judges. On the other hand, TextRank is better than individual judges on sentences of higher ranks but worse on sentences of lower ranks.
Comparison with Judge 1: TextRank is better on top ranked sentences (up to 20%) and worse on the rest of the sentences. Comparison with Judge 2: TextRank is much better on top ranked sentences (up to 30%) and slightly better on the rest of the sentences. Comparison with Judge 3: TextRank is better or comparable on top ranked sentences (up to 30%), worse on lower ranked sentences from 30% to 70%, and comparable on the rest of the sentences.
TextRank is based only on co-occurrences of words for computing sentence scores, considering neither semantic information, nor subtopic diversity, nor structure of the underlying document. Incorporating these three features is expected to significantly improve the accuracy of sentence ranking, and the experiment results confirm that it is true. While incorporating semantics of words is better than without them, incorporating semantics of phrases and words is better than just using semantics of words, and adding semantics of sentences is even better.
It will be shown next how word semantics, article structure, Softplus function adjustment, and subtopic clustering each contribute to the improvement of sentence ranking.
It is interesting to understand how each of the features of semantic edges, Softplus function adjustment, ASB PageRank, and subtopic clustering actually contributes to the improvement of sentence rankings. To answer this question it suffices to evaluate the basic model SWR by removing a feature from it one at a time. Let SWR_NSE, SWR_NAS, SWR_NSC, and SWR_NSP denote, respectively, the variant of SWR without semantic edges, article-structure information, subtopic clustering, and Softplus adjustment.
Table 5 is the results obtained from evaluations over SummBank on selections of 10%, 40%, and 70% of sentences. TextRank is included as a baseline. The numbers in italic are the most severe drops, indicating that the corresponding features are the most critical. The following results are drawn:
With each of the features removed, the corresponding ROUGE scores drop, indicating that each feature has contributed to the improvement. When the percentage of selecting sentences is smaller, removing article structure results in much larger drops, indicating that article structures are more significant on top-ranked sentences. When the percentage becomes larger, semantic edges would become increasingly more critical. While the Softplus function adjustment does improve ROUGE scores, it is not as significant as the other features. Each SWR variant outperforms TextRank under all ROUGE measures, except that at the 70% level, when semantic edges are removed, SWR_NSE is lightly below TextRank.
Conclusions and final remarks
SSR is an efficient and accurate scheme for ranking sentences of a given document. In particular, an implementation of SSR presented in this paper runs in quadratic time, and outperforms, on the SummBank benchmarks, each individual human judge’s ranking under standard ROUGE measures and compares well with the combined ranking of all judges. Moreover, extracting sentences of the highest ranks with an appropriate number of sentences as an extractive summary achieves the state-of-the-art results over the DUC-02 benchmarks under standard ROUGE measures.
SSR is an unsupervised scheme that does not rely on language-specific features or deep linguistic computations, and so it is readily adaptable across languages. What it needs is a reasonable corpus of digital documents available in the adapted language (such as Wikipedia dumps) for extracting phrases and training word and phrase embedding representations. However, the similarity threshold values presented in this paper for constructing a semantic word graph and a semantic phrase graph for a given document, while appropriate for the English language and not sensitive to a small change, may need to be adjusted for other languages. Better threshold values can be determined using a small amount of labeled data. When no labeled data is available, an expected number of synonyms may be used to determine appropriate thresholds [59].
The following directions may be explored for further improvement of accuracy, in addition to what has been mentioned in the earlier sections:
Investigate other embedding methods such as ELMo [60], LEAR [61], Poincaré embed-dings [62], hierarchical embedding [63], and spherical embedding [64]. Investigate unsupervised sentence representations such as skip-thought vectors [65] and BERT [66]. Fine-tune location scoring for better representing article structures. Explore new approximation algorithm for the NP-hard multi-objective knapsack problem. For example, instead of using round-robin approximation that treats each cluster equally likely, a weighted round-robin strategy that treats each cluster according to the subtopic distribution in the given document may help improve accuracy.
Finally, readers should keep in mind that employing new methods, while possibly improving accuracy, may also degrade efficiency. Whether a trade-off is acceptable depends on the underlying applications.
Footnotes
A prototype is available at
Available at
Available at
See, for example, discussions in
Acknowledgments
This work was supported in part by Eola Solutions. The authors thank Wenjing Yang, Yicheng Sun, and Changfeng Yu of UMass Lowell for writing an API to run the AutoPhrase code available at Github to carry out SSR evaluations, and to Liqun (Catherine) Shao of Microsoft for an interesting discussion on Softplus function adjustment. They are grateful to Cheng Zhang at UMass Lowell for implementing dooyeed.com that uses SSR to rank sentences for a given document. The second author would also like to thank Jiawei Han of the University of Illinois at Urbana-Champaign and Jesse Wang of the University of Rochester for inspiring conversations.
