Keyword extraction from social media via AHP

Abstract

In the context of natural-language processing, keyword extraction has been studied widely. In promoting business-enterprise goods and services, however, a major challenge remains to extracting keywords effectively and efficiently from social-media user-generated data, wherein employed are traditional, language-dependent and supervised keyword-extraction techniques. This study contributes a keyword extraction analytic hierarchy process (KEAHP), as a language-independent and unsupervised keyword-extraction technique. By using four user-generated data attributes, KEAHP identifies keywords from the word co-occurrence in linguistic networks, based on a multiple-attribute decision-making approach. The proposed technique has been validated via a publically-available standard dataset, and the experimental results show the effectiveness and efficiency of the algorithm in KEAHP. Despite its limitations, the study contends that KEAHP can drastically improve performance in promoting business-enterprise goods and services, while also discussed are implications for future research and practice in keyword-extraction techniques.

Keywords

Analytic hierarchy process social media data keyword extraction KEAHP word co-occurrence network event detection

Waheed Yousuf Ramay is Ph.D. candidate at the School of Computer and Communication, University of Science and Technology Beijing, China. Prior to doctoral degree, Ramay has worked as Lecturer in department of the Computer Science, COMSATS University, research interests are sentiment analysis of social media data and semantic web.

Xu Cheng Yin is Professor at the School of Computer and Communication, University of Science and Technology Beijing, China. Professor Xu has written books and articles related to artificial intelligence and machine learning. He has hosted multiple projects on text mining and pattern recognition. His current interests are in robotics to facilitate the human being.

Inam Illahi is Ph.D. candidate at the School of Computer science, Beijing Institute of Technology China. Prior to doctoral degree, Inam has worked as Lecturer in department of the Computer Science, COMSATS University, Pakistan research interests are sentiment analysis of social media data and software engineering.

1 Introduction

In recent years, many traditional keyword extraction techniques have been studied to extract theme for every document. These keywords are helpful in identifying influential segments, framing semantic web and other applications of natural language processing. This area of research is related to the domain of topic detection and tracking which was proposed by James Allan in 2003. Various applications use keyword extraction techniques [2] for web search, report generation and cataloguing. This area is intended to identifying the most useful terms which include many sub-processes. The document is introduced in word format, HTML or pdf form. Initially, the documents are pre-processed to remove the redundant and unimportant information. The data are then processed through different keyword extraction approaches including statistical approach, linguistic approach, machine learning approach, network-based approach and topic modelling approach.

In statistical approach, term frequency-inverse document frequency is the most widely used technique for keyword extraction. Recently many new techniques [3] have been developed for statistical keyword extraction. Moreover, as observed, PageRank and LexRank algorithms perform better than TF – IDF. In linguistic approach, automatically identifying keywords is similar to semantic resemblance [1]. In machine learning approach the keyword extraction technique is considered as classification technique. Different dictionaries including WordNet, SentiNet and ConceptNet are used for keyword extraction techniques. In network-based algorithms, the nature and semantics of word co-occurrence networks are studied to identify important terms. In this, nodes are considered as words and edges are considered as co-occurrence frequency. Many useful insights have been obtained from these algorithms for identifying influential segments and keywords. Other existing linguistic network based keyword extraction techniques including LexRank [3], SingleRank [4], ExpandRank [4]. Topic modelling techniques have been popularized by David M Blee in 2003. The author introduced Latent Dirichlet Allocation technique which is used to identify which document is related to which topic and to what extent. This has been further improved by word co-occurrence network structures [5].

Although keyword extraction is an important area of research and many academic researchers and practitioners have received a lot of attention towards it but state of the art for keyword extraction method is still not observed as compared to many other core NLP tasks [6]. The application of keyword extraction is topic modelling and trend/event detection. In this research work, a new technique has been proposed to extract keyword from homogeneous information networks using multiple attribute decision making (MADM) technique. The key parameters used for keyword extraction are the network parameters of word co-occurrence network. This could be useful for identifying important information.

The motivation behind this article is to identify the most influential nodes which indicate important terms in the corpus. The traditional and existing natural language processing techniques are supervised and language specific. However, there are some circumstances under which the text need not be supervised and cannot be trained over existing dataset. Also, the language may be multilingual. The structure-based analysis is required to identify the keywords. Thus, this technique considers structure-based analysis with multiple attributes. Based on multiple attributes, decision of keyword extraction is made. Hence, AHP approach is used for multiple attributes decision making (MADM).

Social media constitute one of the largest communication platforms, wherein people voluntarily express and share their thoughts. The increases in multimedia data on social networking websites give multiple clues about human predisposition. These user-generated data are present on the internet in different modalities including text, images, audio, video, gesture, etc. The purpose of this study is to extract keywords via multiple data attributes for event detection and analysis, including customer’s data for business vendors, weather data, temporal data, geo-location data, traffic data, and weekday’s data and so on, in order to contribute to the self-development of our human society.

This paper is organized as follows. Section 2 describes related work and evolution of different approaches. Section 3 proposed a technique for keyword extraction from homogeneous information network. Section 4 discusses the experimental results. Finally, Section 5 concludes and gives future scope.

2 Related work

The keyword extraction is one of the most widely studied research fields. It contains important information about the topics and trends which are discussed in the corpus. The textual information contains several named entities, newly framed topics and commonly used slangs as topic of discussion. The field of topic detection and tracking (TDT) have been introduced by James Allan in 2003 [7]. Keyword extraction techniques are major source of information which provides significant input to TDT. Many application domains including text summarization, event detection [8] have been using keyword extraction as a source of information.

Although there is much research work done in this area, still there is a gap of identifying keywords in language independent and unsupervised environment. In 2004, [3] the author proposed graph-of-words model by introducing TextRank for identifying keywords. Simultaneously, another author [3, 9] proposed LexRank for text summarization. This has been improved further by proposing SingleRank and ExpandRank [4]. On the contrary, the word co-occurrence network has been studied for keyword extraction by introducing DegExt [10]. The author used degree centrality for identifying important words evolving nodes in linguisticsnetworks.

The word co-occurrence networks have been used thereafter for identifying keywords from textual corpus. This is because the language independent and unsupervised approach is required for mapping important keywords from the corpus due to variety of languages and variety of information present. In 2014, the author proposed Twitter Keyword Graph for identifying keywords [11] from social media data. The author used network science parameters for identifying the importance of the word evolving node in linguistic network. The author studied various types of word co-occurrence networks and different centralities. This has been further improved by introducing selectivity based keyword extraction using strength [12] of the node. This parameter is evaluated on the basis of node degree and corresponding edge weight. However, in this research paper, the idea is that even better technique could be proposed by using multiple key parameters [13]. Thus, in this research work, multiple key parameters have been used and MADM optimization technique namely analytic hierarchy process (AHP) has been used for mapping important nodes from homogeneous information network. This idea has been inspired from identifying influential segments from word co-occurrence networks using AHP [14]. The proposed technique has been discussed in detail inSection 3.

3 Proposed methodology

The proposed technique is used to extract keywords from the given text. The given text is given as input. The text is pre-processed following cleansing using stop-word removal. In this step, the stop words are removed from the given text. The text is further filtered by removing all types of punctuations, URL links and other reserved words such as RT used for retweet etc. All words are converted into lowercase. This is done to keep lowercase, camel-case and upper-case words as same nodes. The pre-processed text is further fed into word co-occurrence network. The word co-occurrence network is formed by tokenizing words in the sentence and taking words of sentence as path. The paths are then mapped into the network. This network is known as word co-occurrence network. The frequency is increased for two words co-occurring together. Further, key parameters are evaluated.

The keyword extraction technique using Analytic hierarchy process (KEAHP) has been proposed using word co-occurrence network and its key parameters. Analytic hierarchy process (AHP) is multiple attributes decision-making approach. AHP is used to rank the alternative chosen for analysis. This technique is used when alternative words are chosen as keywords and they need to get ranked. This is a useful technique for evaluating both beneficiaries (attributes which are more significantly appropriate when value gets increased) and non-beneficiaries (attributes which are less significantly appropriate when value gets decreased). This is used for handling multiple attributes to rank the alternatives; Hence, AHP is used for keyword extraction.

The idea behind key parameters used in this research work lies in the literature of this field. Different authors have used different key parameters. The chosen attributes are degree, weighted degree, clustering coefficient and between centrality. These attributes play a significant role in identifying appropriate keywords which could be used for different applications including text summarization, topic detection and event detection. Degree is the number of links which a node makes with other nodes. It measures term frequency. This is a significant parameter and is measured as $Degree = \sum (outgoing links + Incoming links)$ (1)

Weighted degree measures the number of links connecting to the node with frequency ofco-occurrence. It is measured as $Weighted degree = \sum_{i = 0}^{n} {link}_{ij} * {weight}_{ij}$ (2)

Where j is every node to which node i is linked. Clustering coefficient measures the degree of coupling between neighbours of the nodes $CLustering coffiecient = \frac{2 n}{k (k - 1)}$ (3)

These measures show that node with high clustering coefficient is less important as it could be common or more frequent work. However, word with low clustering coefficient is more important. Betweenness centrality measures the centrality of the node in the network. The overview of the architecture of the KEAHP has been shown in Fig 1. The corpus is provided and word co-occurrence network is framed from the corpus. Different parameters are evaluated and nodes have been ranked using AHP. The highly ranked nodes are consideredas keywords.

Fig.1

Framework for key word extraction.

The word co-occurrence network is created using sentence as path. Each word in the sentence is marked as node and two words co-occurring assigns and edge over both the nodes. Parameters used as different attributes for identifying important keywords from homogeneous information network include term frequency as weighted degree (WD), the degree of the node as mentioned [6, 15] for word co-occurrence network which signifies the number of different neighbours of the node (deg), clustering coefficient (CC) as it has been observed that the network formed using word co-occurrence network follows assortativity law and betweenness centrality (BC) to indicate the most important words connecting other words. The relative ranking matrix used for multiple attributes is shown in Table 1.

Table 1

Relative importance matrix

A	WD	CC	Deg	BC
WD	1.00	5.00	3.00	5.00
CC	0.20	1.00	0.33	1.00
Deg	0.33	3.00	1.00	3.00
BC	0.20	1.00	0.33	1.00

It has been observed that weighted degree and betweenness centrality are considered as beneficiary coefficients but clustering coefficients and degree of the network are considered as non-beneficiary coefficients. This is because if the word is important then it must be connected to different nodes which belong to different set of statements and thus, clustering coefficient is less for an important node. On the contrary, although term frequency is important as weighted degree of the node, but degree of the node indicates the number of neighbours of the node. The words like ‘has’ and ‘have’ could connect with all types of words and have more degree and less co-occurrence frequency weight. Thus, degree is non-beneficiary attribute. The values of random index have been chosen fromthe Table 2.

Table 2

Values of the Random Index (RI) for small problems

m	2	3	4	5	6	7	8	9	10
RI	0	0.58	0.90	1.12	1.24	1.32	1.41	1.45	1.51

As observed, the value for random index is used as 0.9. The consistency ratio for the relative importance matrix has been obtained and uncertainty of ±10% has been observed. The experimental results are discussed in Section 4.

4 Experimental results

The dataset used for this is First Story Detection (FSD) [1]. It contains millions of tweets which are marked as appropriate topics. Total 27 topics have been detected in this dataset and for every given set of tweet id, the data are extracted using Tweepy API. The tweets are marked as topic and ground truth is given in the dataset. The results so obtained are evaluated on the basis of recall measure which is given as

$Recall = \frac{Number of Correct keywords obtained}{number of correct keywords}$

Recall represents the percentage of correct keywords obtained from resulting phrases and is thus valuable information for measuring and comparing performance of all the techniques. Thus, the technique with highest recall rate and lowest redundancy rate is considered as the best technique. Redundancy is defined as number of correct keywords repeated in resulting set as mentioned in equation below. $Redundancy = \frac{Number of repeated correct keyword}{Number of keywords (Re sulting phrases \cap Ground truth topic)}$ (4)

The increased recall value for KEAHP indicates that the proposed technique performs better than existing techniques. For each topic, the recall value is improved by 20–25% than the existing techniques. The results have been obtained using keyword comparison basis from given topic. The experimental results obtained for one of the instances is shown in Table 3. It has been shown that the keywords like ‘cnnbrk’, ‘bbcnews’ has been removed by the proposed technique. The appearance of these words in existing technique was due to the high number of user-mention frequency in tweet. However, these keywords do not indicate anything important about the topic. For each technique selectivity based keyword extraction (SBKE), twitter keyword graph (TKG) and keyword extraction analytic hierarchy process (KEAHP), 30 words have been extracted. As the number of keywords extracted is increased, the KEAHP has given improved recall rate.

Table 3

Different approaches comparisons

Topic 3: Space shuttle Atlantis lands safely, ending NASA’s space shuttle program
Keyword Extraction Techniques	Keywords Obtained	Recall
SBKE	Spaceshuttle, remarkable, via, iweatheronline, marks, caps, cnnbrk, atlantis, nbc, hundreds	0.375
	put, abcnews, foxnewscorn, you’ve, that’s bbcnews, completed, made, marking, savor	0.375
	boom, arrives, iwo, lab, moment, nasakennedy, orbit, goodbye, back, four	0.375
TKG	Atlantis, completed, spaceshuttle, marks, cnnbrk, safe, come, hear, amazing, thank	0.375
	breakingnews, threedecade, closing, savor, moment, nasakennedy, boom, remarkable, you’ve, made	0.5
	goodbye, caps, earth, nasa, around, nbc, miles, abcnews, landed, arrives	0.625
KEAHP	Atlantis, shuttle, space, year, boom, nasas, nasa, center, kennedy, lands	0.625
	Final, journey, earth, miles, landed, nasa’s, orbits, around, last, florida	0.875
	Altantis, completed, home, programme, bringing, mission, ending, touching, goodbye, first	0.875

It has been observed that the proposed technique gives better results for 10 sets of randomly selected topics from the given dataset. Further, due to use of uncertain and ill-formed data, the corpus may contain noisy data. The appearance of noisy text has been minimized in the results using multiple attributes. This is due to the fact that well-formed words have higher frequency than potentially distributed keywords indicating the same word. Also, the higher weight over edges indicates the lower degree and thus, an important keyword.

5 Conclusion

In this research paper, efforts have been made to identify keywords from social media data. This is specifically for social media data in order to extract keywords from uncertain user and ill-formed data where slangs and repetitive words are used in shorthand notations.

Based on burst of outbreak information from social media including promotion of any specific product, natural disasters, contagious disease spread, etc. can be controlled. This can be path breaking input for instant emergency management resources. Four user-generated data attributes selected to study the latent patterns for key word extraction from social media signals.

The proposed technique has been validated via a publically-available standard dataset, and the experimental results show the effectiveness and efficiency of the algorithm in KEAHP. Despite its limitations, the study contends that KEAHP can drastically improve performance in promoting business-enterprise goods and services, while also discussed are implications for future research and practice in keyword-extraction techniques

For this, key parameters have been chosen to obtain optimized ranking for each node and thus, extracting keywords. It has been observed that the recall rate for the proposed technique KEAHP is 20–25% more than that of existing word co-occurrence network-based techniques. The proposed technique outperforms the existing techniques for social media data using multiple attributes.

In future, we intend to work on real-time streaming data and improving uncertain user-generated data using text normalization. The implications of keyword extraction will facilitate indexing and browsing, and significantly improve the quality search engines. Keywords Extraction is widely used in text refinement.

References

Petrović

, Osborne

and Lavrenko

, Streaming first story detection with application to Twitter, NAACL HLT 2010 - Hum. Lang. Technol. 2010 Annu. Conf. North Am. Chapter Assoc. Comput. Linguist. Proc. Main Conf, 2010, pp. 181–189. https://www-scopus-com.web.bisu.edu.cn/inward/record.url?eid=2-s2.0-80053272732&partnerID=tZOtx3y1

Fung

G.P.C.

, Yu

J.X.

, Yu

P.S.

and Lu

, Parameter free bursty events detection in text streams, VLDB ’05 Proc 31st Int Conf Very Large Data Bases 1 (2005), 181–192. doi:10.1.1.60.2671

Mihalcea

and Tarau

, TextRank: Bringing order into text, Proc EMNLP (2004), 404–411.

Wan

and Xiao

, Single document keyphrase extraction using neighborhood knowledge, Proc 23rd Natl Conf Artif Intell 2 (2008), 855–860. doi:10.1145/1740592.1740596

Beliga

, Meštrović

and Hr

S.U.

, An overview of graph-based keyword extraction methods and approaches sanda martinčić-ipčić, JIOS 39 (2015).

Feng

, Zhang

Y.Q.

and Zhang

, Improving the co-word analysis method based on semantic distance, Scientometrics 111 (2017), 1521–1531. doi:10.1007/s11192-017-2286-1

Allan

, Topic detection and tracking: Event-based information organization, 2002. http://portal.acm.org/citation.cfm?id=772260

Garg

and Kumar

, Review on event detection techniques in social multimedia, Online Inf Rev 40 (2016), 347–361. doi:10.1108/OIR-08-2015-0281

Erkan

and Radev

D.R.

, LexRank: Graph-based lexical centrality as salience in text summarization, J Artif Intell Res 22 (2004), 457–479. doi:10.1613/jair.1523

10.

Litvak

, Last

and Kandel

, DegExt: A language-independent keyphrase extractor, J Ambient Intell Humaniz Comput 4 (2013), 377–387. doi:10.1007/s12652-012-0109-z

11.

Abilhoa

W.D.

and De

L.N.

, Castro, A keyword extraction method from twitter messages represented as graphs, Appl Math Comput 240 (2014), 308–325. doi:10.1016/j.amc.2014.04.090

12.

Zhang

, Wang

, Cao

, Wang

and Xu

, A hybrid term-term relations analysis approach for topic detection, Knowledge-Based Syst 93 (2016), 109–120. doi:10.1016/j.knosys.2015.11.006

13.

Lahiri

, Choudhury

S.R.

and Caragea

, Keyword and keyphrase extraction using centrality measures on collocation networks, (2014). http://arxiv.org/abs/1401.6571

14.

Garg

and Kumar

, Identifying influential segments from word co-occurrence networks using AHP, Cogn Syst Res 47 (2018), 28–41. doi:10.1016/j.cogsys.2017.07.003

15.

Akimushkin

, Amancio

D.R.

and Oliveira

O.N.

, Text authorship identified using the dynamics of word co-occurrence networks, PLoS One 12 (2017). doi:10.1371/journal.pone.0170527