Soft Rough Set based span for unsupervised keyword extraction

Abstract

The present work proposes an application of Soft Rough Set and its span for unsupervised keyword extraction. In recent times Soft Rough Sets are being applied in various domains, though none of its applications are in the area of keyword extraction. On the other hand, the concept of Rough Set based span has been developed for improved efficiency in the domain of extractive text summarization. In this work we amalgamate these two techniques, called Soft Rough Set based Span (SRS), to provide an effective solution for keyword extraction from texts. The universe for Soft Rough Set is taken to be a collection of words from the input texts. SRS provides an ideal platform for identifying the set of keywords from the input text which cannot always be defined clearly and unambiguously. The proposed technique uses greedy algorithm for computing spanning sets. The experimental results suggest that extraction of keywords using the proposed scheme gives consistent results across different domains. Also, it has been found to be more efficient in comparison with several existing unsupervised techniques.

Keywords

Keyword extraction Rough Set Soft Rough Set Rough Set based Span natural language processing

1 Introduction

Dealing with uncertainties of various nature at regular and consistent level is a usual feature of many real-life applications. As a consequence, study and analysis of these uncertainties, and their careful handling are of significant importance for any such domain. Rough Set [1] and their extensions viz. Soft Rough Set [2] have already been used in literature for dealing with uncertainties in various applications. In this respect we have developed the concept of Soft Rough Set based Span (SRS), and demonstrated its efficacy for keyword extraction on news headlines belonging to different domains.

Several keyword extraction techniques have been developed over the years. One being the supervised approach in which the keyword selection process is guided, and most discerning terms are provided in output. Another method is using an unsupervised approach. These algorithms work without any kind of user supervision or training data, in terms of class of input data. The output generated is based on the text input(s) being provided to the algorithm. One such technique is TF-IDF [3]. The relevance score of a word is calculated by multiplying two metrics: how many times a word appears in a document (TF), and the inverse document frequency (IDF) stating the number of occurrences of the word across a set of documents. Other examples of unsupervised keyword extraction algorithms are Textrank [4], RAKE [5]. Since Textrank was first released it has been applied to tasks, such as summarization [6, 7], credibility assessment [8], opinion mining [9, 10] and others. Textrank uses a graph based model to extract keywords, while RAKE extracts keyphrases specifically for document summarization. In Table 1 we provide a comparison of some basic properties of the proposed technique with other existing techniques. The above-mentioned works of extracting keywords involve inherent uncertainty in selecting keywords against many competing (multi)words based on their importance/frequency in the document. Rough set based decision making provides a suitable alternative to deal with the inherent uncertainty related to keyword extraction. However, use of Rough Set would need the universe of objects to be partitioned into respective equivalence classes. This motivates us to propose the use of Soft Rough Set which instead of defining equivalence classes provides a function to map the universe of objects into different equivalence classes based on a set of attributes.

Table 1
Comparison of different unsupervised keyword extraction techniques

Method Pros Cons

TF-IDF •Easy to compute •Term-weighting improves the quality •Based on the bag-of-words (BoW) model, therefore it does not capture semantics of the text.

•Assumes independence of index terms

RAKE •Can apply to any text without prior training •Works on a single document at a time

•Domain-independent •Cannot extract semantically meaningful words

•Works on each document very fast and the complexity is low.

Textrank •Based on Google PageRank •Find keywords that appear “central” •Extracting a few keywords, ignoring their semantic relevance

•Domain independent •Works on a single document at a time

SRS based span •Domain independent •Due to clustering and Rough Set approximations

•Clustering of semantically close keywords help generate a set of keywords covering the document. computational speed is on the lower side.

Method	Pros	Cons
TF-IDF	•Easy to compute •Term-weighting improves the quality	•Based on the bag-of-words (BoW) model, therefore it does not capture semantics of the text.
		•Assumes independence of index terms
RAKE	•Can apply to any text without prior training	•Works on a single document at a time
	•Domain-independent	•Cannot extract semantically meaningful words
	•Works on each document very fast and the complexity is low.
Textrank	•Based on Google PageRank •Find keywords that appear “central”	•Extracting a few keywords, ignoring their semantic relevance
	•Domain independent	•Works on a single document at a time
SRS based span	•Domain independent	•Due to clustering and Rough Set approximations
	•Clustering of semantically close keywords help generate a set of keywords covering the document.	computational speed is on the lower side.

Rough Set theory is a comprehensive and useful mathematical tool to deal with imperfect data with uncertainty. Rough Sets are defined on an Information System (U, P), where U is the universe of a set of objects and P is the set of attributes. In Rough Set theory, the universe is divided into three sets, Lower Approximation, Upper Approximation and Boundary Region with respect to a given subset X of U. The Lower Approximation of a Rough Set, also known as Positive Region, consists of elements of U which will definitely belong to the set. Boundary Region consists of elements which may belong to a classification category, although not always guaranteed. The Upper Approximation is the set of objects of U that are in both Lower Approximation and Boundary Region.

Rough Set based uncertainty handling techniques have been used in various areas of Artificial Intelligence, including Data Mining [11], Data Classification [12], Information Retrieval [13], Feature Selection [14] among others.

The proposed Soft Rough Set Span (SRS Span) based technique uses the concept of span as defined in [15]. We have extended the notion of span from the domain of Rough Set to Soft Rough Set for extracting the important keywords from text using an unsupervised approach. A greedy algorithm is proposed for computing the span for extraction of top k keywords, where k is an input to be given by a user.

The paper is organized as follows. Section 2 presents the formulation of Soft Rough Set and defines Soft Rough Set based Span. Section 3 proposes a greedy algorithm for spanning set computation. In Section 4 the proposed technique is compared with existing baseline techniques. Moreover, the consistency of our proposed technique across various domains of news headlines is examined. Section 5 concludes the paper.

2 Soft Set, Rough Set and Soft Rough Set based Span

Soft Set [16] over a universe U and a set of parameters E is a pair (F, A) where A is a subset of E and F provides a mapping from the set A to the power set of U. For each e ɛA, F(e) is a subset of U. In a way each F(e) determines a set of objects that are similar to each other with respect to the attribute e.

Illustration 1: Let U = {c₁, c₂, c₃, c₄, c₅, c₆, c₇, c₈} be a set of cars, and let E = {e₁, e₂, e₃, e₄, e₅} be a set of parameters where e₁ = ‘expensive’, e₂ = ‘modern’, e₃ = ‘cheap’, e₄ = ‘high mileage’ and e₅ = ‘low mileage’. A Soft Set (F, E) where F is the “choice of car” is defined in the following way:

Let F(e₁) = {c₂, c₄, c₈}, F(e₂) = {c₁, c₂}, F(e₃) = {c₃, c₄, c₅}, F(e₄) = {c₁, c₃, c₆} and F(e₅) = {c₂, c₇}. Thus, Soft Set (F, E) defines the following equivalence classes: (F, E) = {expensive cars = {c₂, c₄, c₈}, modern cars = {c₁, c₂}, cheap cars = {c₃, c₄, c₅}, high mileage cars = {c₁, c₃, c₆}, low mileage cars = {c₂, c₇}}. Below we give some necessary definitions relevant to this work.

Rough Set: Let U be the universe under consideration, and P be a set of the attributes describing U. The pair (U, P) is called an Information System. Each aɛP imposes an equivalence relation on U. The equivalence classes of U so obtained are denoted by $\frac{U}{P}$ which is defined as follows: $\frac{U}{P} = {(x, y) \in U \times U | \forall a \in P, a (x) = a (y)}$

In Rough Set theory a subset X⊆U is represented by a pair of crisp sets, namely, Lower Approximation and Upper Approximation, which are mathematically defined as follows:

Lower Approximation: ${Lower}_{P} (X) = \cup {Y \in \frac{U}{P} : Y \subseteq X}$

Upper Approximation: ${Upper}_{p} (X) = \cup {Y \in \frac{U}{P} : Y \cap X \neq ø}$

The set difference between Upper Approximation and Lower Approximation is known as Boundary Region, given as:

Boundary Region: ${BND}_{p} (X) = {Upper}_{p} (X) - {Lower}_{p} (X)$

Rough Set based Span: The span of a subset X of U, with respect to a set of attributes P is defined as:

$δ_{P, X} = u * \frac{| {Lower}_{P} (X) |}{| U |} + (1 - u) * \frac{| {BND}_{P} (X) |}{| U |}, foru \in [0, 1] .$

δ_P,X is referred as the span score, and the set X which maximizes the span score for a particular value of u is said to be the spanning set. Hence, the set X depends on the parameter value u. The span of a Rough Set computes the efficiency of a subset X overlapping with the equivalence classes of the universe.

Soft Rough Set which merges the concepts of Soft Set and Rough Set is defined below.

Soft Rough Set: Let (F, A) be a Soft Set over U. The Soft Rough Lower Approximation and Soft Rough Upper Approximation of X with respect to A are defined as follows:

${SLower}_{A} (X) = \underset{a \in A}{U} {F (a); F (a) \subseteq X}$

${SUpper}_{A} (X) = \underset{a \in A}{U} {F (a); F (a) \cap X \neq ø}$

Boundary Region of Soft Rough Set is defined in a similar way as in Rough Set,

Soft A-Boundary Region = SBND_A (X) = SUpper_A (X) - SLower_A (X)

The concept of SRS Span is proposed in the following definition.

Soft Rough Set based Span: Let S = (F, A), where A ⊆ E, the set of parameters over U. The SRS Span with respect to attribute P for a subset X ⊆ U is defined as:

$δ_{P, X} = u * \frac{| {SLower}_{P} (X) |}{| U |} + (1 - u) * \frac{| {SBND}_{P} (X) |}{| U |}, for u \in [0, 1] .$

In the above equation the parameter u is user-defined. The equation of SRS Span is linear in u.

The computation of spanning set and span score for SRS Span is explained in Illustration 2 using the scenario given in Illustration 1.

Illustration 2: Consider sets X, Y and Z having elements {c₁, c₃, c₄, c₅}, {c₂, c₆, c₇} and {c₄, c₅, c₇} respectively. The SRS Span for X is computed as:

Considering set X, we have the following:

SLower_P (X) = F (e₃) = { c₃, c₄, c₅ }

SUpper_p (X) = F (e₁) ∪ F (e₂) ∪ F (e₃) ∪ F (e₄) = { c₁, c₂, c₃, c₄, c₅, c₆, c₈ } .

SBND_P (X) = SUpper_p (X) - SLower_P (X) ={ c₁, c₂, c₆, c₈ }

|SLower_P (X) |=3and|SBND_P (X) | = 4

$δ_{P, X} = u * \frac{| {SLower}_{P} (X) |}{| U |} + (1 - u) * \frac{| {SBND}_{P} (X) |}{| U |}$

$δ_{P, X} = u * \frac{3}{8} + (1 - u) * \frac{4}{8} = \frac{4 - u}{8}$

Similarly, for sets Y and Z the SRS Span is computed as shown in Table 2:

Table 2
SRS Span for sets Y and Z

Y Z

SLower_P (Y) = {c₂, c₇} SLower_P (Z) =ø

SUpper_p (Y) = SUpper_p (Z) =

{c₁, c₂, c₃, c₄, c₆, c₇, c₈} {c₂, c₃, c₄, c₅, c₇, c₈}

SBND_P (Y) = SBND_P (Z) =

{c₁, c₃, c₄, c₆, c₈} {c₂, c₃, c₄, c₅, c₇, c₈}

$δ_{P, Y} = u * \frac{2}{8} + (1 - u) $ $δ_{P, Z} = u \frac{0}{8} + (1 - u) *$

$\frac{5}{8} = \frac{5 - 3 u}{8}$ $\frac{6}{8} = \frac{6 - 6 u}{8}$

Y	Z
SLower_P (Y) = {c₂, c₇}	SLower_P (Z) =ø
SUpper_p (Y) =	SUpper_p (Z) =
{c₁, c₂, c₃, c₄, c₆, c₇, c₈}	{c₂, c₃, c₄, c₅, c₇, c₈}
SBND_P (Y) =	SBND_P (Z) =
{c₁, c₃, c₄, c₆, c₈}	{c₂, c₃, c₄, c₅, c₇, c₈}
$δ_{P, Y} = u * \frac{2}{8} + (1 - u) *$	$δ_{P, Z} = u * \frac{0}{8} + (1 - u) *$
$\frac{5}{8} = \frac{5 - 3 u}{8}$	$\frac{6}{8} = \frac{6 - 6 u}{8}$

Figure 1 shows the varying values of δ_P,X, δ_P,Y and δ_P,Z for u ∈ [0, 1]. In both Fig. 1 and Table 3, red, blue and green denote values corresponding to δ_P,X, δ_P,Y and δ_P,Z respectively. The spanning set varies with the choice of u as tabulated in Table 3.

Fig. 1

Plot of δ_P,X, δ_P,Y and δ_P,Z for u ∈ [0, 1].

For set Z since the Lower Approximation is ø. A value of u < 0.33 gives the maximum span score among the rest. Section 3 describes the application of SRS Span for extracting keywords.

3 SRS Span based keyword extraction

In this section we propose a greedy algorithm for selection of the set of keywords. The algorithm computes the set of keywords in an iterative way starting from the empty set. In each iteration the next candidate word is tested for possible inclusion in the set of keywords by checking whether its inclusion increases the span of the keyword set. The selection of the next candidate word can be done in two ways:

Forward Selection: It starts from the first available word, and selects the next candidate word in a sequential way. Forward Selection has two inherent drawbacks:

(1)
It is certain to include the very first available word,

the search may not last till the end of the document if the desired number of keywords are already enlisted.

Random Selection: In this strategy in each iteration the next candidate word is chosen randomly from the remaining available words.

Hence in this work we have followed the Random Selection strategy. Figure 2 provides the details of the Random Selection algorithm.

Fig. 2
Soft Rough Set based span for keyword extraction.

Example: Let the universe of words be U = Soccer, Meetings, Walk, Breakfast, Lunch, Gym, Dinner, E-mail. Let the number of clusters be three. Namely, P the attribute set comprise of Work, Exercise and Meal. Our objective is to extract the top two keywords. Let the objects in the clusters be the following: F(Work) = E-mail, Meetings, F(Exercise) = Soccer, Walk, Gym and F(Meal) = Breakfast, Lunch, Dinner.

We start randomly with X = {Meetings}. Next we compute the Lower and Upper Approximations.

Case 1: Consider the value of u = 0.3.
1st iteration: SLower_P (X) =ø, SUpper_p (X) = F(Work), as F(Work) ∩X≠ ø.

SBND_P (X) = F (Work) = {Meetings, E - mail}

|SLower_P (X) |=0 and |SBND_P (X) |=2

$δ_{P, X} = u * \frac{| {SLower}_{P} (X) |}{| U |} + (1 - u) * \frac{| {SBND}_{P} (X) |}{| U |}$ $= 0.3 * 0 + 0.7 * \frac{2}{8} = 0.175$

In the next iteration suppose the word chosen randomly be ‘E-mail’. Hence for the second iteration X = {Meetings, E-mail}.

Therefore, SLower_P (X) = F(Work), SUpper_p (X) = F(Work), as F(Work) ∩X≠ ø.

Hence, SBND_P (X) =ø, and $δ_{P, X} = 0.3 * \frac{2}{8} + 0.7 * 0 = 0.075$ .

As span score decreases ‘E-mail’ gets dropped from set X. Hence for the third iteration one more word is chosen randomly and let that be ‘Gym’. Hence the new X = {Meetings, Gym}.

Now SLower_P (X) =ø, and SUpper_p (X) ={Soccer, Meetings, Walk, Gym, E-mail}

Hence, SBND_P (X) ={Soccer, Meetings, Walk, Gym, E-mail} and $δ_{P, X} = 0.7 * \frac{5}{8} = 0.4375$ .

As the span score increases and set X consists of two elements the algorithm stops. The top two keywords is X = {Meetings, Gym}.

Case 2: Here we consider value of u = 0.8.

1st iteration: X = {Meetings}, $δ_{P, X} = 0.8 * 0 + 0.2 * \frac{2}{8} = 0.05 .$

2nd iteration: X = {Meetings, E-mail}, $δ_{P, X} = 0.8 * \frac{2}{8} + 0.2 * 0 = 0.2 .$

As span score increases E-mail is not dropped from X in this case. The top two keywords therefore are: $X = {Meetings, E - mail}$

Case 1 and Case 2 give different sets of top two keywords depending on the choice of u. In Case 2 both the keywords are from the same cluster due to higher weightage to the elements in the Lower Approximation. Hence for the future experiments we work with small value of u = 0.1 as having most words from the same cluster will make the selection biased.
4 Results and discussion

The task of keyword extraction is performed on a dataset [17] of one million Australian news headlines given by the Australian Broadcasting Corporation (ABC). The reason for choosing this dataset is that given a document containing a large number of news headlines, extracting keywords would help the user get an idea of the exhaustive set of domains covered in the headlines.

The experiment is conducted for varying number of clusters from 120 to 170 with a step size of 10. For each cluster size the algorithm is run for 10 iterations selecting 5000 random news headlines each time. In each iteration top 50, 75 and 100 keywords are extracted for each of the clusters. The pre-processing step involves removing duplicate headlines from the dataset, deleting stop words using python NLTK [18] package, lemmatization using spaCy [19] and building the Word2Vec model of dimension 100. The window size is kept at 5 as the length of headlines after preprocessing does not exceed five. The user defined parameter is set to u = 0.1. The SRS Span Score for a cluster size k is the average of all the SRS Span scores over 10 iterations. The quality of keywords in each iteration is evaluated by comparing the extracted keywords with 400 keywords selected by a human expert after going through the ABC news database. The similarity score of a word in the extracted keyword set is taken to be the highest similarity score when compared with each of the 400 words suggested by the human expert. Therefore, for each iteration the similarity score is computed as the average of the similarity scores of all the words in the extracted keyword set. The final similarity score is the average taken over 10 iterations. The respective SRS Span score and similarity scores for varying cluster sizes are presented in Table 3.

Table 3
Ordering of SRS Span score with respect to u

Intervals of u Ordering of span score

u ∈ [0, 0.33) δ_P,Z > δ_P,Y > δ_P,X

u ∈ (0.33, 0.4) δ_P,Y > δ_P,Z > δ_P,X

u ∈ (0.4, 0.5) δ_P,Y > δ_P,X > δ_P,Z

u ∈ (0.5, 1] δ_P,X > δ_P,Y > δ_P,Z

Intervals of u	Ordering of span score
u ∈ [0, 0.33)	δ_P,Z > δ_P,Y > δ_P,X
u ∈ (0.33, 0.4)	δ_P,Y > δ_P,Z > δ_P,X
u ∈ (0.4, 0.5)	δ_P,Y > δ_P,X > δ_P,Z
u ∈ (0.5, 1]	δ_P,X > δ_P,Y > δ_P,Z

The similarity metrics used in this work are as follows:

Word Embedding based similarity. Word vectors are computed from Word2Vec [20]. The cosine distance between the vectors of the extracted keywords and the manually selected keywords is computed to estimate the similarity between the two collection of keywords.

WordNet based semantic similarity [21]. In WordNet, each concept is represented by synonym sets, known as synsets, which have a common meaning. A synset consists of English noun, verbs, adjectives and adverbs. A synset is represented by a 3-part name of the form: word.pos.nn, where,

word is the specific word to which the synset belongs,

pos is the part of speech of the synset, and

nn is the number associated with the specific synset

WordNet similarity measure is used to find the similarity score of two words. The shorter is the path between two synsets the higher is their similarity value. The score lies between 0 and 1, where 1 means absolute similarity, and 0 means no similarity. The average similarity score between each pair of words between extracted keywords and manually marked keywords is taken to get the final score. WordNet similarity is calculated using NLTK and spaCy package of python. Figure 3 displays the Word Cloud generated for Top 100 keywords using 150 clusters. Similarly, Figs. 4, 5 and 6 display the Word Cloud for Top 100 keywords using RAKE, Textrank and TF-IDF.

Fig. 3

Word Cloud of Top 100 keywords using SRS Span on ABC dataset.

Fig. 4

Word Cloud of Top 100 keywords using RAKE on ABC dataset.

Fig. 5

Word Cloud of Top 100 keywords using Textrank on ABC dataset.

Fig. 6

Word Cloud of Top 100 keywords using TF-IDF on ABC dataset.

The similarity scores in Table 4 illustrate the SRS Span score and the similarity scores of the proposed algorithm. The experiment is stopped after 170 clusters as there is a significant drop in the SRS Span score. On comparison it is seen that the span score decreases with the increase in the size of clusters. This is intuitive as the number of words in the Boundary Region of Soft Rough Set decrease due to less number of words in each cluster. It is observed that with the increase in the SRS Span score, there is an increase in the similarity score with respect to both the similarity metrics used. This is evident from Table 5 as the correlation between the SRS Span scores and the respective similarity scores is high.

Table 4

SRS Span score and similarity scores with varying clusters

Keywords	SRS Span Score	Word2Vec similarity	WordNet similarity
120 Clusters
Top 50	0.5324	0.631	0.158
Top 75	0.7361	0.697	0.195
Top 100	0.8837	0.723	0.283
130 Clusters
Top 50	0.4935	0.584	0.127
Top 75	0.7021	0.606	0.178
Top 100	0.8621	0.675	0.221
140 Clusters
Top 50	0.4643	0.615	0.141
Top 75	0.6936	0.648	0.236
Top 100	0.8422	0.718	0.297
150 Clusters
Top 50	0.4191	0.647	0.164
Top 75	0.6757	0.754	0.275
Top 100	0.8373	0.807	0.320
160 Clusters
Top 50	0.3972	0.644	0.145
Top 75	0.6077	0.660	0.286
Top 100	0.8084	0.689	0.356
170 Clusters
Top 50	0.2851	0.625	0.116
Top 75	0.3452	0.681	0.182
Top 100	0.4964	0.694	0.275

Table 5

Correlation coefficient between SRS Span score and respective similarity scores

Correlation	Word2Vec	WordNet
SRS Span score (120)	0.96799	0.94819
SRS Span score (130)	0.93384	0.98883
SRS Span score (140)	0.94695	0.99997
SRS Span score (150)	0.99751	0.99898
SRS Span score (160)	0.98401	0.97360
SRS Span score (170)	0.83130	0.98932

Next we compare SRS Span with other unsupervised keyword extraction techniques, namely, TF-IDF, RAKE and Textrank. Metrics such as Precision, Recall and F1-score and Word2Vec and WordNet similarity scores for the Top 100 keywords with the manually extracted keywords is used for comparison. F1-score is defined as: $F 1 - score = 2 * \frac{Precision * Recall}{Precision + Recall}$

With respect to our experiments Precision is defined as how many of the extracted keywords are correct, while Recall means how many of the manually assigned keywords are present in the list of keywords obtained by the algorithm.

In Table 6 the efficiency of the proposed method is compared with the existing techniques: TF-IDF, RAKE and Textrank. The SRS Span model with 150 clusters is considered for comparison. As SRS Span extracts relevant semantic keywords due to clustering. Hence the similarity scores are higher compared to the existing techniques. To check consistency, the algorithm is run on a News Category Dataset [22] for five different domains, namely, Politics, Business, Crime, Entertainment and Sports. Table 7 presents the scores when the extracted keywords are compared with the manually extracted words for this dataset.

Table 6

Comparison with other unsupervised baseline techniques

Keywords	SRS Span Score	Word2Vec similarity	WordNet similarity
Politics (4850 headlines)
Top 50	0.3463	0.431	0.131
Top 100	0.7692	0.695	0.266
Business (5900 headlines)
Top 50	0.4437	0.415	0.124
Top 100	0.7774	0.718	0.234
Crime (3405 headlines)
Top 50	0.3568	0.417	0.147
Top 100	0.7977	0.683	0.280
Entertainment (5100 headlines)
Top 50	0.4384	0.395	0.167
Top 100	0.7712	0.756	0.271
Sports (4884 headlines)
Top 50	0.3532	0.458	0.173
Top 100	0.8035	0.793	0.249

Table 7

Experimental results across various domains of headlines

Methods	Word2Vec	WordNet	Precision	Recall	F1-score
TF-IDF	0.689	0.258	0.21	0.052	0.083
RAKE	0.774	0.302	0.27	0.068	0.108
Textrank	0.586	0.271	0.19	0.048	0.076
SRS Span	0.807	0.320	0.26	0.058	0.094

The cluster size for experiments in Table 7 also has been domains. The values indicate that the technique gives consistent scores across various domains. This demonstrates the efficacy of the algorithm across different domains. The values in bold indicate the highest score achieved among all the experiments conducted across domains.

5 Conclusion and future work

In this paper we present a novel concept of SRS Span based keyword extraction. This is applied for the problem of keyword extraction from text inputs which we have chosen as news headlines from Australian Broadcasting Corporation. The results have been found to be very encouraging across different domains of news headlines.

Unlike supervised techniques which require a training corpus, this approach is unsupervised. However, to the merit of the proposed algorithm, it draws features from the text on its own, owing to the use of word embeddings. Additionally, the choice of the parameter value u provides flexibility to the user to apply this technique according to the problem requirements.

Presently, we conducted experiments with news headlines. In future we plan to extend the application to other types of short texts, such as tweets, and also to longer form of documents, such as news articles. As part of further research, a comparison between other word embedding would help gain more insight about this technique. Currently we have used K-Means clustering which is hard clustering. In future, we want to extend it using Fuzzy C-Means Clustering and lexical chains for improved performance.

References

Pawlak

, Rough Setss and fuzzy Setss, Fuzzy Setss andSystems17(1) (1985), 99–102.

Feng

, Liu

, Leoreanu-Fotea

, Jun

Y.B.

, Soft Sets and soft rough Sets, Information Sciences181(6) (2011), 1125–1137.

Salton

and McGill

M.J.

, Introduction to modern information retrieval. New York: McGraw- Hill. 1983.

Mihalcea

and Tarau

, Textrank: Bringing order into text. In Proceedings of the 2004 conference on empiricalmethods in natural language processing (2004) Jul pp. 404–411.

Rose

, Engel

, Cramer

and Cowley

, Automatic keyword extraction from individual documents, Text mining: applications and theory1 (2010), 1–20.

Mallick

, et al., Graph-based text summarization using modified TextRank, Soft computing in data analytics. Springer, Singapore, (2019), 137–146.

Son

and Shin

, Music lyrics summarization method using textrank algorithm, Journal of Korea Multimedia Society21(1) (2018), 45–50.

Balcerzak

, Jaworski

and Wierzbicki

, Application of TextRank algorithm for credibility assessment, In 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)1 (2014), pp. 451–454. IEEE.

Deguchi

and Yamaguchi

, Argument component classification by relation identification by neural network and TextRank. In Proceedings of the 6thWorkshop on Argument Mining, (2019), pp. 83–91.

10.

Petasis

and Karkaletsis

, Identifying argument components through textrank. In Proceedings of the Third Workshop on Argument Mining (ArgMining2016) (2016), pp. 94–102.

11.

Indriani

, Penerapan Metode Rough Sets Dalam Menentukan PembelianSmartphone Android Oleh Konsumen, JTIK (Jurnal TeknikInformatika Kaputama)2(1) (2018), 85–92.

12.

Luo

, Li

, Chen

, Fujita

and Yi

, Incremental rough Sets approach for hierarchical multicriteria classification, Information Sciences429 (2018), 72–87.

13.

Selvalakshmi

and Subramaniam

, Intelligent ontology based semantic information retrieval using feature selection and classification, Cluster Computing22(5) (2019), 12871–12881.

14.

Swiniarski

R.W.

and Skowron

, Rough Sets methods in feature selection and recognition, Pattern Recognition Letters24(6) (2003), 833–849.

15.

Yadav

and Chatterjee

, Rough sets based span and its application to extractive text summarization, Journal of Intelligent & Fuzzy Systems37(3) (2019), 4299–4309.

16.

Molodtsov

, Soft Sets theory— first results, Computers & Mathematics with Applications37(4-5) (1999), 19–31.

17.

Kulkarni

, A Million News Headlines, Kaggle, 2020. https://www.kaggle.com/therohk/million-headlines

18.

Bird

, NLTK: the natural language toolkit. Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions (2006), 69–72.

19.

Al Omran

F.N.A.

and Treude

, Choosing an NLP library for analyzing software documentation: a systematic literature review and a series of experiments, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, (2017), 187–197.

20.

Church

K.W.

, Word2Vec, Natural Language Engineering23(1) (2017), 155–162.

21.

Meng

, Huang

and Gu

, A review of semantic similarity measures in wordnet, International Journal of Hybrid Information Technology6(1) (2013), 1–12.

22.

Misra

, News Category Dataset, 06 2018. https://www.kaggle.com/rmisra/news-category-dataset