Abstract
The present work proposes an application of Soft Rough Set and its span for unsupervised keyword extraction. In recent times Soft Rough Sets are being applied in various domains, though none of its applications are in the area of keyword extraction. On the other hand, the concept of Rough Set based span has been developed for improved efficiency in the domain of extractive text summarization. In this work we amalgamate these two techniques, called Soft Rough Set based Span (SRS), to provide an effective solution for keyword extraction from texts. The universe for Soft Rough Set is taken to be a collection of words from the input texts. SRS provides an ideal platform for identifying the set of keywords from the input text which cannot always be defined clearly and unambiguously. The proposed technique uses greedy algorithm for computing spanning sets. The experimental results suggest that extraction of keywords using the proposed scheme gives consistent results across different domains. Also, it has been found to be more efficient in comparison with several existing unsupervised techniques.
Introduction
Dealing with uncertainties of various nature at regular and consistent level is a usual feature of many real-life applications. As a consequence, study and analysis of these uncertainties, and their careful handling are of significant importance for any such domain. Rough Set [1] and their extensions viz. Soft Rough Set [2] have already been used in literature for dealing with uncertainties in various applications. In this respect we have developed the concept of Soft Rough Set based Span (SRS), and demonstrated its efficacy for keyword extraction on news headlines belonging to different domains.
Several keyword extraction techniques have been developed over the years. One being the supervised approach in which the keyword selection process is guided, and most discerning terms are provided in output. Another method is using an unsupervised approach. These algorithms work without any kind of user supervision or training data, in terms of class of input data. The output generated is based on the text input(s) being provided to the algorithm. One such technique is TF-IDF [3]. The relevance score of a word is calculated by multiplying two metrics: how many times a word appears in a document (TF), and the inverse document frequency (IDF) stating the number of occurrences of the word across a set of documents. Other examples of unsupervised keyword extraction algorithms are Textrank [4], RAKE [5]. Since Textrank was first released it has been applied to tasks, such as summarization [6, 7], credibility assessment [8], opinion mining [9, 10] and others. Textrank uses a graph based model to extract keywords, while RAKE extracts keyphrases specifically for document summarization. In Table 1 we provide a comparison of some basic properties of the proposed technique with other existing techniques. The above-mentioned works of extracting keywords involve inherent uncertainty in selecting keywords against many competing (multi)words based on their importance/frequency in the document. Rough set based decision making provides a suitable alternative to deal with the inherent uncertainty related to keyword extraction. However, use of Rough Set would need the universe of objects to be partitioned into respective equivalence classes. This motivates us to propose the use of Soft Rough Set which instead of defining equivalence classes provides a function to map the universe of objects into different equivalence classes based on a set of attributes.
Comparison of different unsupervised keyword extraction techniques
Comparison of different unsupervised keyword extraction techniques
Rough Set theory is a comprehensive and useful mathematical tool to deal with imperfect data with uncertainty. Rough Sets are defined on an Information System (U, P), where U is the universe of a set of objects and P is the set of attributes. In Rough Set theory, the universe is divided into three sets, Lower Approximation, Upper Approximation and Boundary Region with respect to a given subset X of U. The Lower Approximation of a Rough Set, also known as Positive Region, consists of elements of U which will definitely belong to the set. Boundary Region consists of elements which may belong to a classification category, although not always guaranteed. The Upper Approximation is the set of objects of U that are in both Lower Approximation and Boundary Region.
Rough Set based uncertainty handling techniques have been used in various areas of Artificial Intelligence, including Data Mining [11], Data Classification [12], Information Retrieval [13], Feature Selection [14] among others.
The proposed Soft Rough Set Span (SRS Span) based technique uses the concept of span as defined in [15]. We have extended the notion of span from the domain of Rough Set to Soft Rough Set for extracting the important keywords from text using an unsupervised approach. A greedy algorithm is proposed for computing the span for extraction of top k keywords, where k is an input to be given by a user.
The paper is organized as follows. Section 2 presents the formulation of Soft Rough Set and defines Soft Rough Set based Span. Section 3 proposes a greedy algorithm for spanning set computation. In Section 4 the proposed technique is compared with existing baseline techniques. Moreover, the consistency of our proposed technique across various domains of news headlines is examined. Section 5 concludes the paper.
Soft Set [16] over a universe U and a set of parameters E is a pair (F, A) where A is a subset of E and F provides a mapping from the set A to the power set of U. For each e ɛA, F(e) is a subset of U. In a way each F(e) determines a set of objects that are similar to each other with respect to the attribute e.
Let F(e1) = {c2, c4, c8}, F(e2) = {c1, c2}, F(e3) = {c3, c4, c5}, F(e4) = {c1, c3, c6} and F(e5) = {c2, c7}. Thus, Soft Set (F, E) defines the following equivalence classes: (F, E) = {expensive cars = {c2, c4, c8}, modern cars = {c1, c2}, cheap cars = {c3, c4, c5}, high mileage cars = {c1, c3, c6}, low mileage cars = {c2, c7}}. Below we give some necessary definitions relevant to this work.
In Rough Set theory a subset X⊆U is represented by a pair of crisp sets, namely, Lower Approximation and Upper Approximation, which are mathematically defined as follows: Lower Approximation:
Upper Approximation:
The set difference between Upper Approximation and Lower Approximation is known as Boundary Region, given as: Boundary Region:
δP,X is referred as the span score, and the set X which maximizes the span score for a particular value of u is said to be the spanning set. Hence, the set X depends on the parameter value u. The span of a Rough Set computes the efficiency of a subset X overlapping with the equivalence classes of the universe.
Soft Rough Set which merges the concepts of Soft Set and Rough Set is defined below.
Boundary Region of Soft Rough Set is defined in a similar way as in Rough Set, Soft A-Boundary Region = SBNDA (X) = SUpperA (X) - SLowerA (X)
The concept of SRS Span is proposed in the following definition.
In the above equation the parameter u is user-defined. The equation of SRS Span is linear in u.
The computation of spanning set and span score for SRS Span is explained in Illustration 2 using the scenario given in Illustration 1.
Considering set X, we have the following: SLowerP (X) = F (e3) = { c3, c4, c5 } SUpperp (X) = F (e1) ∪ F (e2) ∪ F (e3) ∪ F (e4) = { c1, c2, c3, c4, c5, c6, c8 } . SBNDP (X) = SUpperp (X) - SLowerP (X) ={ c1, c2, c6, c8 } |SLowerP (X) |=3and|SBNDP (X) | = 4
Similarly, for sets Y and Z the SRS Span is computed as shown in Table 2:
SRS Span for sets Y and Z
SRS Span for sets Y and Z
Figure 1 shows the varying values of δP,X, δP,Y and δP,Z for u ∈ [0, 1]. In both Fig. 1 and Table 3, red, blue and green denote values corresponding to δP,X, δP,Y and δP,Z respectively. The spanning set varies with the choice of u as tabulated in Table 3.

Plot of δP,X, δP,Y and δP,Z for u ∈ [0, 1].
For set Z since the Lower Approximation is ø. A value of u < 0.33 gives the maximum span score among the rest. Section 3 describes the application of SRS Span for extracting keywords.
In this section we propose a greedy algorithm for selection of the set of keywords. The algorithm computes the set of keywords in an iterative way starting from the empty set. In each iteration the next candidate word is tested for possible inclusion in the set of keywords by checking whether its inclusion increases the span of the keyword set. The selection of the next candidate word can be done in two ways: Forward Selection: It starts from the first available word, and selects the next candidate word in a sequential way. Forward Selection has two inherent drawbacks: It is certain to include the very first available word, the search may not last till the end of the document if the desired number of keywords are already enlisted. Random Selection: In this strategy in each iteration the next candidate word is chosen randomly from the remaining available words.
Hence in this work we have followed the Random Selection strategy. Figure 2 provides the details of the Random Selection algorithm.

Soft Rough Set based span for keyword extraction.
We start randomly with X = {Meetings}. Next we compute the Lower and Upper Approximations.
Case 1: Consider the value of u = 0.3. 1st iteration: SLowerP (X) =ø, SUpperp (X) = F(Work), as F(Work) ∩X≠ ø. SBNDP (X) = F (Work) = {Meetings, E - mail} |SLowerP (X) |=0 and |SBNDP (X) |=2
In the next iteration suppose the word chosen randomly be ‘E-mail’. Hence for the second iteration X = {Meetings, E-mail}.
Therefore, SLowerP (X) = F(Work), SUpperp (X) = F(Work), as F(Work) ∩X≠ ø.
Hence, SBNDP (X) =ø, and
As span score decreases ‘E-mail’ gets dropped from set X. Hence for the third iteration one more word is chosen randomly and let that be ‘Gym’. Hence the new X = {Meetings, Gym}.
Now SLowerP (X) =ø, and SUpperp (X) ={Soccer, Meetings, Walk, Gym, E-mail}
Hence, SBNDP (X) ={Soccer, Meetings, Walk, Gym, E-mail} and
As the span score increases and set X consists of two elements the algorithm stops. The top two keywords is X = {Meetings, Gym}.
Case 2: Here we consider value of u = 0.8.
1st iteration: X = {Meetings},
2nd iteration: X = {Meetings, E-mail},
As span score increases E-mail is not dropped from X in this case. The top two keywords therefore are:
Case 1 and Case 2 give different sets of top two keywords depending on the choice of u. In Case 2 both the keywords are from the same cluster due to higher weightage to the elements in the Lower Approximation. Hence for the future experiments we work with small value of u = 0.1 as having most words from the same cluster will make the selection biased.
The task of keyword extraction is performed on a dataset [17] of one million Australian news headlines given by the Australian Broadcasting Corporation (ABC). The reason for choosing this dataset is that given a document containing a large number of news headlines, extracting keywords would help the user get an idea of the exhaustive set of domains covered in the headlines.
The experiment is conducted for varying number of clusters from 120 to 170 with a step size of 10. For each cluster size the algorithm is run for 10 iterations selecting 5000 random news headlines each time. In each iteration top 50, 75 and 100 keywords are extracted for each of the clusters. The pre-processing step involves removing duplicate headlines from the dataset, deleting stop words using python NLTK [18] package, lemmatization using spaCy [19] and building the Word2Vec model of dimension 100. The window size is kept at 5 as the length of headlines after preprocessing does not exceed five. The user defined parameter is set to u = 0.1. The SRS Span Score for a cluster size k is the average of all the SRS Span scores over 10 iterations. The quality of keywords in each iteration is evaluated by comparing the extracted keywords with 400 keywords selected by a human expert after going through the ABC news database. The similarity score of a word in the extracted keyword set is taken to be the highest similarity score when compared with each of the 400 words suggested by the human expert. Therefore, for each iteration the similarity score is computed as the average of the similarity scores of all the words in the extracted keyword set. The final similarity score is the average taken over 10 iterations. The respective SRS Span score and similarity scores for varying cluster sizes are presented in Table 3.
Ordering of SRS Span score with respect to u
Ordering of SRS Span score with respect to u
The similarity metrics used in this work are as follows: Word Embedding based similarity. Word vectors are computed from Word2Vec [20]. The cosine distance between the vectors of the extracted keywords and the manually selected keywords is computed to estimate the similarity between the two collection of keywords. WordNet based semantic similarity [21]. In WordNet, each concept is represented by synonym sets, known as synsets, which have a common meaning. A synset consists of English noun, verbs, adjectives and adverbs. A synset is represented by a 3-part name of the form: word.pos.nn, where, word is the specific word to which the synset belongs, pos is the part of speech of the synset, and nn is the number associated with the specific synset
WordNet similarity measure is used to find the similarity score of two words. The shorter is the path between two synsets the higher is their similarity value. The score lies between 0 and 1, where 1 means absolute similarity, and 0 means no similarity. The average similarity score between each pair of words between extracted keywords and manually marked keywords is taken to get the final score. WordNet similarity is calculated using NLTK and spaCy package of python. Figure 3 displays the Word Cloud generated for Top 100 keywords using 150 clusters. Similarly, Figs. 4, 5 and 6 display the Word Cloud for Top 100 keywords using RAKE, Textrank and TF-IDF.

Word Cloud of Top 100 keywords using SRS Span on ABC dataset.

Word Cloud of Top 100 keywords using RAKE on ABC dataset.

Word Cloud of Top 100 keywords using Textrank on ABC dataset.

Word Cloud of Top 100 keywords using TF-IDF on ABC dataset.
The similarity scores in Table 4 illustrate the SRS Span score and the similarity scores of the proposed algorithm. The experiment is stopped after 170 clusters as there is a significant drop in the SRS Span score. On comparison it is seen that the span score decreases with the increase in the size of clusters. This is intuitive as the number of words in the Boundary Region of Soft Rough Set decrease due to less number of words in each cluster. It is observed that with the increase in the SRS Span score, there is an increase in the similarity score with respect to both the similarity metrics used. This is evident from Table 5 as the correlation between the SRS Span scores and the respective similarity scores is high.
SRS Span score and similarity scores with varying clusters
Correlation coefficient between SRS Span score and respective similarity scores
Next we compare SRS Span with other unsupervised keyword extraction techniques, namely, TF-IDF, RAKE and Textrank. Metrics such as Precision, Recall and F1-score and Word2Vec and WordNet similarity scores for the Top 100 keywords with the manually extracted keywords is used for comparison. F1-score is defined as:
With respect to our experiments Precision is defined as how many of the extracted keywords are correct, while Recall means how many of the manually assigned keywords are present in the list of keywords obtained by the algorithm.
In Table 6 the efficiency of the proposed method is compared with the existing techniques: TF-IDF, RAKE and Textrank. The SRS Span model with 150 clusters is considered for comparison. As SRS Span extracts relevant semantic keywords due to clustering. Hence the similarity scores are higher compared to the existing techniques. To check consistency, the algorithm is run on a News Category Dataset [22] for five different domains, namely, Politics, Business, Crime, Entertainment and Sports. Table 7 presents the scores when the extracted keywords are compared with the manually extracted words for this dataset.
Comparison with other unsupervised baseline techniques
Experimental results across various domains of headlines
The cluster size for experiments in Table 7 also has been domains. The values indicate that the technique gives consistent scores across various domains. This demonstrates the efficacy of the algorithm across different domains. The values in bold indicate the highest score achieved among all the experiments conducted across domains.
In this paper we present a novel concept of SRS Span based keyword extraction. This is applied for the problem of keyword extraction from text inputs which we have chosen as news headlines from Australian Broadcasting Corporation. The results have been found to be very encouraging across different domains of news headlines.
Unlike supervised techniques which require a training corpus, this approach is unsupervised. However, to the merit of the proposed algorithm, it draws features from the text on its own, owing to the use of word embeddings. Additionally, the choice of the parameter value u provides flexibility to the user to apply this technique according to the problem requirements.
Presently, we conducted experiments with news headlines. In future we plan to extend the application to other types of short texts, such as tweets, and also to longer form of documents, such as news articles. As part of further research, a comparison between other word embedding would help gain more insight about this technique. Currently we have used K-Means clustering which is hard clustering. In future, we want to extend it using Fuzzy C-Means Clustering and lexical chains for improved performance.
