A distributional semantics-based information retrieval framework for online social networks

Abstract

Online social networks are considered to be one of the most disruptive platforms where people communicate with each other on any topic ranging from funny cat videos to cancer support. The widespread diffusion of mobile platforms such as smart-phones causes the number of messages shared in such platforms to grow heavily, thus more intelligent and scalable algorithms are needed for efficient extraction of useful information. This paper proposes a method for retrieving relevant information from social network messages using a distributional semantics-based framework powered by topic modeling. The proposed framework combines the Latent Dirichlet Allocation and distributional representation of phrases (Phrase2Vec) for effective information retrieval from online social networks. Extensive and systematic experiments on messages collected from Twitter (tweets) show this approach outperforms some state-of-the-art approaches in terms of precision and accuracy and better information retrieval is possible using the proposed method.

Keywords

Social networks distributional semantics information retrieval topic modeling Phrase2Vec latent dirichlet allocation

1. Introduction

Recent years witnessed an exponential growth in the number of social network users due to the widespread popularity of mobile platforms and applications. This caused a huge volume of messages to be generated while these users communicate with each other [33]. People create and share pieces of text content that may be about some disasters such as earthquakes [20], or community view on elections [21], or new governmental agendas [25], or even opinions on a newly launched product in the market [22]. Thus retrieving relevant and useful information from those dynamically generated unstructured data is crucial as those messages may contain useful insights for any interested stakeholders. Information shared by social network users are archived at the service provider’s end and there may be situations where the users or other interested parties want to retrieve relevant messages from these enormous archives of data. So the need for intelligent and text understanding search, ranking, and retrieval algorithms is ever-increasing and crucial. The state-of-the-art algorithms that operate on a set of static documents may not always be a good choice in this situation due to the dynamic nature of information generated in social networks [23].

Text mining and information retrieval research communities devised many methods and algorithms for intelligent information retrieval from large unstructured text archives. These algorithms exhibited varying levels of accuracy when dealing with static documents but the performance was degraded on dynamic data such as social network messages [29]. Due to this dynamic nature, social networking platforms require more sophisticated algorithms to deal with the complexities in search and retrieval of relevant information. This is because social network platforms often use very short texts, non-standard language, and rich associated metadata. Even though a vast array of research on these dimensions has been reported to tackle this situation, there are still a lot of avenues where advanced natural language processing and information retrieval techniques can be used to design text understanding algorithms that can operate on such platforms.

Topic modeling is a suite of text understanding algorithms that can leverage hidden themes from large archives of unstructured text data [36]. Several variations of such algorithms with varying degrees of accuracy have been devised in the past with only difference in the assumption they made in the process of generating themes. Topic models were further extended for mining topics from social networks such as Twitter and are used by social network research communities in detecting change in topics over a while [34, 32, 24, 15]. On the other hand, neural network based approaches such as word2vec – distributed representation of words in a vector space [27], and phrase2vec – distributed representation of phrases in a vector space [28] etc. gained significant attention that considered the ‘context’ in which words or phrases have been used.

This work attempts to make use of topic modeling [36] and distributed representation of phrases (phrase2vec) [28] for scoring social network messages and shows the application on a phrase query-based information retrieval task. The main research objectives of this paper are (1) Introduce the task of topic modeling guided scoring of unstructured text and its applications in online social networks, (2) Propose a framework that combines topic modeling and distributional representation of phrases for better retrieval of relevant information from text messages, (3) Verify experimentally the effectiveness of the proposed framework in extracting relevant and useful information from social network messages, (4) Compare proposed approach with state-of-the-art models to establish the fitness of our method in scoring social network messages and phrase query-based information retrieval. The main contributions of this paper are summarized as follows:

1.
Proposes a framework for retrieving relevant and useful content from a large collection of social network messages.
2.
Shows that a hybrid approach combining topic modeling and distributed representation of phrases can better retrieve and rank relevant information.
3.
Rigorous experimental comparison of the proposed method with other state-of-the-art approaches establishes the effectiveness of the proposed method.

The major research questions addressed in this manuscript are:

•
Can we use the dynamics of the topics generated by the topic modeling and phrase embedding generated by phrase2vec for better ranking and retrieving messages?
•
Can the proposed approach be used to address the drawbacks of the existing approaches for information retrieval from social networks which are dynamic in nature.

Organization: The rest of this paper is organized as follows: Section 2 briefly outlines state-of-the-art in information retrieval from online social networks. Section 3 discusses the proposed framework in detail and Section 4 describes the experimental setup used in this work. A detailed evaluation of results is given in Section 5 and in Section 6, the authors give conclusions and discuss future work.
2. Related work

This section details some of the prominent works which are related to our proposed work that attempts to develop methods for retrieving information from a large collection of social network data. Due to the widespread development of social network platforms, the amount of data or messages accumulated in these platforms grew exponentially that attracted social network researchers to devise new methods to retrieve, analyze, curate and store such messages. Very recently, Xu et al. [1] proposed a novel approach that integrate social annotations into topic models for personalized document retrieval [1]. This approach uses social tags of documents to capture user preferences and reconstructs candidate documents for a given query. Then these reconstructed documents are used for achieving better retrieval performance. Experimental results showed that the proposed approach significantly outperformed state-of-the-art approaches for personalized retrieval [1].

An approach that automatically finds traffic-related reports and extract useful information is proposed by Vallejos et al. [2]. The authors addresses the problem of informal language usage in social networks and the lack of geographic metadata to extract traffic-related reports. When compared with a popular traffic-specific social network, the proposed approach obtained promising results [2]. A system called AMiner, which deals with the search and mining of academic social network was recently reported by Wan et al. [3]. The objective of AMiner is to help researchers and scientists gain a deeper understanding of the large and heterogeneous networks formed by authors, papers, conferences, journals and organizations [3]. The authors have used a generative probabilistic model to simultaneously model the different entities while providing a topic-level expertise search. Marco Brambilla et al. proposed a method for extracting emerging knowledge from social media [9]. Their proposed method discovered emerging entities by extracting them from social content and ranks the candidates by using their distance from the centroid of seeds, where those seed entities are provided by the experts. Another approach for information retrieval from social media was reported by Oostdijk et al. that used a linguistics based approach [16]. The authors specifically extracted traffic-related information from Tweets and claimed an accuracy rate of 74%.

An approach for retrieving disaster events from cross-platform social media was introduced by Shen et al. [8] in which the authors proposed a method to retrieve data on an event based on a preliminary collection of event-specific hashtags. This method could collect distinct sets of hashtags even for similar simultaneous events [8]. Another approach for text information retrieval by integrating global and local textual information [17] was reported recently in the literature by Wang et al., in which the authors combine Latent Dirichlet Allocation topic modeling algorithm and word2vec and proposed an approach for document retrieval. They claimed that the proposed method significantly enhanced the accuracy of classification analysis. A novel approach that implemented real-time social media retrieval with spatial, temporal and social constraints was introduced very recently by Gao et al. [10]. This paper proposed an interval-at-a-time framework to provide a solution to social media retrieval with spatial, temporal and social constraints. This algorithm relies on inverted lists and showed better performance when compared with other state-of-the-art methods [10].

A method for identifying search keywords for finding relevant social media posts was reported recently in the literature [18] by Wang et al. The paper proposed a novel technique to help the user identify topical search keywords. Their experiment on tweets dataset showed the effectiveness of their proposed approach in identifying potential keywords. Tolosa et al. introduced the concept of an integrated cache, a static cache that stores both individual postings list and its bigram intersections [11]. The authors devised an algorithm to re-write queries using this cache and showed that there is a speed improvement over previous methods. Kim et al. presented a simulation-based approach for selective search under many different hardware configurations. They used large query logs that provided the effectiveness of the selective search approach and distribution of shards across machines [12]. Another recent notable work that examines indexing for repetitive collections is reported by Gagie et al. [13]. Their proposed work included mechanisms for effective compression, methods for top- $k$ retrieval and identification of a number of documents containing a given string. The authors also showed how to perform ranked conjunctive and ranked disjunctive search using tf-idfranking. Recently, approaches such as social network based contextual ranking [5], attributed social network embedding [6], Query personalization using social network information and collaborative filtering techniques [7] etc. for retrieving information from social network data. An approach for automatic predictions using LDA for learning through social networking services has been introduced by Troussas et al. [14]. The paper presented a prototype facebook application for learning using LDA which was capable of making predictions about the interests of students by collecting their preferences and characteristics. The work concluded that automatic predictions using LDA to students in social networks can be a powerful idea in personalizing instruction [14].

On conducting this systematic literature review, we found that there exist an array of drawbacks with the state-of-the-art approaches in information retrieval, specifically from online social networks. A vast majority of the existing algorithms operate on static set of documents and significantly fails when it comes to the dynamic data. Also, the approaches discussed above are used for information retrieval from social networks such as Twitter that depends on techniques such as expert provided seed entities, global and local textual information, phrase embeddings, inverted lists, integrated static cache, and distributed query processing which may not be feasible in all the contexts. On the other hand, our proposed method makes use of the dynamics of the topics (generated by LDA) discovered from social network messages, key-phrases (from extracted phrases using natural language processing), phrase embedding (phrase2vec representation) all lead to multiple ways of reaching a target document/message from the user given query phrase, using which we may better retrieve and rank the messages.

2.1 Background: Latent Dirichlet Allocation (LDA)

Inspired from previous topic models, Blei et al. introduced a new topic modeling algorithm known as Latent Dirichlet Allocation (LDA) [37]. This model assumes that a document contains multiple topics and such topics are extracted using a Dirichlet Prior process. In the following section, we will briefly describe the underlying principle of LDA [37]. Even though LDA works well on broad ranges of discrete datasets, the text is considered to be a typical example to which the model can be best applied. The process of generating a document with $n$ words by LDA can be described as follows [37]:

1. 1.
Choose the number of words, $n$ , according to Poisson Distribution;
2.
Choose the distribution over topics, $\theta$ , for this document by Dirichlet Distribution;

(a)
Choose a topic $T^{(i)}\sim$ Multinomial $(\theta)$ .
(b)
Choose a word $W^{(i)}$ from $P(W^{(i)}|T^{(i)},\beta)$ .

Thus the marginal distribution of the document can be obtained from the above process as:

$\displaystyle P(d)=\int_{\theta}\left(\prod_{i=1}^{n}\sum_{T^{(i)}}P(W^{(i)})|% T^{(i)},\beta)P(T^{(i)}|\theta)\right)P(\theta|\alpha){\rm d}\theta$ (1)

where, $P(\theta|\alpha)$ is derived by Dirichlet Distribution parameterized by $\alpha$ , and $P(W^{(i)})|T^{(i)},\beta)$ is the probability of $W^{(i)}$ under topic $T^{(i)}$ parameterized by $\beta$ . The parameter $\alpha$ can be viewed as a prior observation counting on the number of times each topic is sampled in a document, before we actually seen any word from that document. The parameter $\beta$ is a hyper-paramter determining the number of times words are sampled from a topic [37], before any word of the corpus is observed. At the end, the probability of the whole corpus $D$ can be derived by taking the product of all documents’ marginal probability as:

$\displaystyle P(D)=\prod_{i=1}^{M}P(d_{i})$ (2)
2.2 Background: Distributional representation of phrases (Phrase2Vec)

This section briefly outlines the necessary background on the distributional representation of words and phrases which are current hot research areas in text mining and natural language processing. This area of research is attempting to develop methods for measuring the level of semantic similarities among linguistic elements in large samples of unstructured data. The idea of distributional representation of words and phrases got significant attention from text mining and natural language processing practitioners as these representations could project the semantic similarities of those linguistic units in high dimensional vector space. The seminal work in this dimension that explored beyond traditional syntactic and probabilistic approach was introduced by Tomas Mikolov which represented word unigrams in high dimensional vector spaces [27]. This paper contributed two strong architectures – Continuous Bag of Words (CBOW) and Skip-gram – where the former can predict the current word based on the context and the latter can predict the context based on a given word. These models were initially used to work with word unigrams, later methods for distributed representation of phrases [28], sentences, paragraphs and even documents [26] have been reported.

Figure 1.

Skip-gram model for phrases used in this work. $p(t)$ in the input layer denotes the input/query phrase and $p(t-2)$ , $p(t-1)$ , $p(t+1)$ , $p(t+2)$ in the output layer denotes phrases from distributed representation which are semantically closer to $p(t)$ .

Figure 2.

Overall workflow of the proposed approach.

In this work, we use the Skip-gram architecture for learning distributed representation of phrases identified from tweets and is shown in Fig. 1. In phrase2vec model, a large number of key-phrases are identified and during training, they are treated as individual tokens, to find phrase representations that may predict surrounding phrases in a sentence or given document [28]. Given a phrase (query), this work also utilizes this intuition to find semantically related phrases that are identified from tweets beforehand and represented in a distributed high dimensional vector space.

3. Distributional semantics based information retrieval framework for online social networks (DIRFOS)

This section details the proposed framework for topic modeling powered distributional semantics-based information retrieval from online social networks. The overall workflow of the proposed framework is shown in Fig. 2. The framework consists of two modules in which the first one requires online processing and the second one is executed offline. A detailed description of the modules is outlined in the subsequent sections.

3.1 Online processing

The online processing module of this framework is used for extracting noun-phrases, generating topics and creating distributed vector representation of identified noun-phrases. This proposed method uses an off-the-shelf tool for extracting noun-phrases from pre-processed tweets that will identify and extract semantically valid phrases. The decision of using an off-the-shelf tool was taken as we observed that the custom natural language processing rules based on part-of-speech tagging is computationally expensive in terms of both memory and time thus the scalability of such an approach is very weak. Latent Dirichlet Allocation (LDA) [37] algorithm is then executed on tweets for generating latent topics that are word unigrams. Now we introduce our formula for scoring noun-phrases in topics generated. The score for a noun-phrase $n$ in a given topic $t$ is computed as the product of the probability of $t$ over the document $d$ multiplied by the frequency of noun-phrase in the document. Notationally,

$\displaystyle\text{score}(n,t)=\sum_{\text{message}_{m}}p(t|m)\text{freq}(n,m)$ (3)

where $p(t|m)$ is the scoring for the message (tweets)-topic pair obtained from the LDA and $\text{freq}(n,m)$ is the frequency of noun-phrase $n$ in tweet message $m$ . After calculating the score for each noun-phrase, we filter top $k$ noun-phrases according to the score. Now we propose our algorithm, Algorithm 3.1, which takes a set of topics which are probability distribution over words, generated by Latent Dirichlet Allocation (LDA) [37] and a set of dynamic corpus of tweet messages $D$ , which are collected from Twitter, which is one of the widely used online social networks. Using an off-the-shelf noun-phrase tagger, the algorithm first tags all the noun-phrases from dynamic corpus $D$ . The frequencies of noun-phrase in documents are then computed. The probability distribution of topic over documents, $p(t|m)$ is computed from LDA output. Then the score of already tagged noun-phrases in documents, $\text{score}(n,m)$ calculated as the product of $p(t|m)$ and $\text{freq}(n,m)$ . Finally, we find top- $k$ noun-phrases according to the score, $\text{score}(n,t)$ calculated using Eq. (3).

Topical scoring noun-phrases from tweetsInputInput OutputOutput function ComputeScore $(C,T)$

Corpus of tweets, $C$ and LDA topics, $T$ top- $k$ phrases from tweets, for each topic each message $m$ from the tweet corpus $C$ extract noun-phrases from $m$ topic $t$ and noun-phrase $n$ score $(n,t)=\displaystyle\sum_{{\rm message}_{m}}p(t|m)\text{freq}(n,m)$ for each topic $t$ , find the top- $k$ noun-phrases according to the $\text{score}(n,t)$

Query enhancement and tweet messages scoringInputInput OutputOutput function RankAndRetrieveTweets $(q)$

A phrase query $q$ Ranked set of tweets according to the query (1) For $q$ , find phrases from phrase2vec and enhance the query $q$ to $q^{\prime}$

(2) Retrieve significant topics for $q^{\prime}$ using scoring function, score $(q^{\prime},t)$ , where score $(q^{\prime},t)=\displaystyle\sum_{{\rm message}_{m}}p(t|m)\text{freq}(q^{% \prime},m)$

(3) Use score $(q^{\prime},t)$ , to get topic – message distribution

(4) Select top- $k$ tweets according to the distribution

3.2 Offline processing

The offline module of this framework deals with ranking messages, given a phrase query by the user. The algorithm we have devised for retrieving relevant messages for a user-specified query is given in Algorithm 3.1. Firstly, the phrase query given by the user is enhanced and this is achieved by using the phrase2vec embedding we have created during the online processing stage. The query $q$ is passed to the phrase2vec model and the output (semantically similar phrases for $q$ ) $q^{\prime}$ is used to score the messages. Thus rather than passing a single phrase query, we have a set of phrases given by the phrase2vec model which are semantically related to the input query $q$ . The score of enhanced query $q^{\prime}$ is calculated using the formula $\text{score}(q^{\prime},t)=\sum_{\text{document}_{d}}p(t|m)\text{freq}(q^{% \prime},m)$ where $p(t|m)$ is the probability of a topic $t$ in a tweet message $d$ and $\text{freq}(q^{\prime},d)$ is the number of times enhanced query (obtained from phrase2vec, distributed representation of phrases) term appears in that document. After scoring, we select top- $k$ documents according to the score. Thus in this approach, rather than searching for the presence or absence of user-supplied query, we induct semantics using the distributional representation of phrases in the vector space that will lead multiple ways to score and retrieve relevant messages from the collection. We implemented Algorithm 3.1 and Algorithm 3.1 for showcasing the usefulness and efficiency of our proposed method and a detailed explanation is given in Section 4.

4. Experimental setup

This section details the experimental setup used for our proposed information extraction experiment. A complete description of the dataset used and the experimental testbed details are highlighted.

Dataset description

This work uses messages from Twitter (tweets) for this experiment and collected tweets for 5 hashtags – “#KashmirUnrest”, “#SurgicalStrike”, “#Demonetisation”, “#Election2016” and “#MakeInIndia”. We have collected 25000 tweet messages for each hashtag from March 2017 to August 2017, using SocioViz1

¹
http://www.socioviz.net [Last accessed on 13/03/2020].

platform. Thus a total of 125000 tweets are used for carrying out this experiment. A snapshot of the tweets collected for the hashtag “#DeMonetisation” is shown in Fig. 3.

Figure 3.

A snapshot of tweets collected for the hashtag “#DeMonetisation”

4.1 Phrase extraction

We generated noun-phrases from pre-processed tweets using two different methods. Initially, we have implemented a natural language processing (NLP) based approach and later using an off-the-shelf tool. For the NLP approach, we have first done a parts-of-speech (POS) tagging on pre-processed messages and written NLP rules by combining different POS tags extracted. Some of such rules used for keyphrase extraction are shown in Table 1. More information on POS tags can be obtained from the URL https://cs.nyu.edu/grishman/ jet/guide/PennPOS.html. POS tagged words were combined by using an underscore symbol (“_”) to form phrases. As we found NLP rules based on POS tagging took more time for generating keyphrases compared to TextBlob,2

²
https://textblob.readthedocs.io/en/dev/ [Last accessed on 13/03/ 2020].

which is an off-the-shelf-tool for key-phrase extraction, we decided to use TextBlob library for this experimentation. TextBlob uses NLTK [35], which is a collection of libraries and programs written in Python programming language and is used for symbolic and natural language processing tasks for the English language. Since NLTK [35] is used by industry-level applications that involve natural language processing and TextBlob uses NLTK [35] as its backbone, we found this library performs well in almost all domains in English language and specifically the messages (tweets) we have collected using the mentioned hashtags.

Table 1

Common NLP rules for extracting semantically valid phrases from tweets

Sl. no.	Pattern
1	JJ $+$ NN [Adjective $+$ Noun]
2	JJ $+$ NNS [Adjective $+$ Noun (Plural)]
3	JJ $+$ VBG [Adjective $+$ Verb (gerund or present participle)]
4	JJR $+$ VBG [Adjective (comparative) $+$ Verb (gerund or present participle)]
5	NN $+$ NN [Noun (singular or mass) $+$ Noun (singular or mass)]
6	NN $+$ NNS [Noun (singular or mass) $+$ Noun (plural)]
7	NN $+$ VB [Noun (singular or mass) $+$ Verb (base form)]
8	NN $+$ VBG [Noun (singular or mass) $+$ Verb (Verb, gerund or present participle)]
9	NN $+$ VBN [Noun (singular or mass) $+$ Verb (past participle)]
10	NNP $+$ NNS [Proper noun (singular) $+$ Noun (plural)]
11	NNS $+$ VB [Noun (plural) $+$ Verb (base form)]
12	NNP $+$ NNS [Proper noun (singular) $+$ Noun (plural)]

4.2 Experimental testbed

All algorithms proposed in this paper were implemented using Python 2.7 and run on a server computer with AMD Opteron 6376 2.3 GHz 16 core processor and 16 GB of main memory. The collected tweets were thoroughly pre-processed to remove noises such as web URLs and special characters. Pre-processed tweet messages were saved in plain text files and prepared an experiment ready copy of the dataset. For topic generation from these tweets, we have executed the Latent Dirichlet Allocation (LDA) algorithm and used MALLET (Machine Learning for Language Toolkit) tool available at http://mallet.cs.umass.edu/download.php. The number of iterations of Gibbs sampling used in this experiment is set to 300, as we found that Gibbs sampling usually approaches the target distribution after 300 iterations, using the trial and error approach. The other two parameters $\alpha$ and $\beta$ for LDA is set as $\alpha=50/Z$ and $\beta=0.01$ respectively for this experiment. For generating phrase2vec embedding of the data, we use the public code library available at https:// github.com/zseymour/phrase2vec, which is an implementation of phrase2vec from modified word2vec code using C language. Noun-phrase tagged files are given as input and this code generated the distributed phrase representation of messages.

After the online processing stage, the experiment has been extended for scoring or ranking messages, given a user query. For scoring, we make use of the dynamics of the topic (generated by LDA), phrases (generated by TextBlob) and the messages that all lead multiple ways of reaching the message from the query phrase. Firstly, the given query is enhanced using the phrase2vec embedding generated and stored in the online processing stage of the experiment. Given a phrase query, the pre-trained model will give a set of phrases that are semantically related to the given query and is calculated as the cosine distance between two vectors in the distributional vector space. We use this enhanced query set for extracting relevant messages, rather than using the simple query phrase based information retrieval. Thus the message would be scored high if it contains a lot of phrases that are similar to the query phrase which is our hypothesis for this experiment. We compare our proposed ranking approach with four state-of-the-art baselines such as BM25 [38], a flat position index-based approach [30], an inverted index-based approach [31] and an entity-based retrieval approach [19]. Out of these three approaches, BM25 [38] is a term matching based approach, other two phrase-based retrieval mechanisms that use flat position index and an inverted index as their primary data structures and an entity-language based approach. We implemented them using Python programming language with the help of public code repositories available at https://www.rosettacode.org/wiki/ Inverted_index_Python and https://github.com/matteo- bertozzi/blog-code/blob/master/pyinverted-index/. Detailed analysis and evaluation of results from these experiments are given in the Result and Evaluation section of this paper.

Table 2
Comparison of proposed method with BM25, flat position index, inverted index and entity based baselines for twitter hashtags – “#KashmirUnrest”, “#SurgicalStrike”, and “#Demonetisation”

Method	Hashtags
	“#KashmirUnrest”			“#SurgicalStrike”			“#Demonetisation”
	P	R	F	P	R	F	P	R	F
BM25 [38]	0.6879	0.5578	0.6160	0.6643	0.5210	0.5839	0.6996	0.5945	0.6427
FPI [30]	0.6952	0.6354	0.6639	0.6930	0.7102	0.7014	0.6104	0.6332	0.6215
Inv. idx [31]	0.7855	0.6985	0.7394	0.6969	0.7011	0.6989	0.8014	0.7685	0.7846
Entity based [19]	0.8021	0.8110	0.8065	0.8215	0.7985	0.8139	0.8129	0.8854	0.8476
DIRFOS	0.9114	0.9274	0.9193	0.9375	0.9412	0.9393	0.9199	0.9254	0.9226

Table 3

Comparison of proposed method with BM25, flat position index, inverted index and entity based baselines for twitter hashtags – “#Election2016”, “#MakeInIndia”, and “#TrumpRussia”

Method	Hashtags
	“#Election2016”			“#MakeInIndia”			“#TrumpRussia”
	P	R	F	P	R	F	P	R	F
BM25 [38]	0.6954	0.5874	0.6368	0.5685	0.4995	0.5317	0.6685	0.6235	0.6452
FPI [30]	0.7045	0.7125	0.7084	0.6696	0.7895	0.7246	0.6336	0.7452	0.6848
Inv. idx [31]	0.8022	0.8452	0.8231	0.7785	0.7452	0.7614	0.8457	0.8741	0.8596
Entity based [19]	0.8544	0.8630	0.8586	0.8741	0.8452	0.8594	0.9025	0.8944	0.8984
DIRFOS	0.9527	0.9320	0.9422	0.9740	0.8974	0.9341	0.8952	0.8041	0.8472

5. Results and evaluation

In this section, we describe and evaluate our proposed framework for efficient information retrieval from online social networks. We compare our scoring method with four other state-of-the-art baselines:

•
BM25 (Best Match 25) [38]: In information retrieval, BM25, known as Okapi BM25, is a widely used ranking function by search engines to rank a large collection of documents given a search query. It is a probabilistic retrieval method that act on bag-of-words created from the document collection using the following equation [38]:

$\displaystyle\text{score}(q,d)=\sum_{i=1}^{|q|}\text{idf}(q_{i})\cdot\frac{% \text{tf}(q_{i},d)\cdot(k_{1}+1)}{\text{tf}(q_{i},d)+k_{1}\cdot(1-b+b\cdot% \frac{|d|}{\text{avgdl}})}$ (4)

where, $tf(q_{i},d)$ is $q_{i}$ ’s term frequency in the document $d$ , $|d|$ is the length of the document $d$ in words, and avgdl is the average document length in the text collection from which documents are drawn. For this experiment, we have used the parameters $k1=1$ , $k2=0$ , $k3=1$ and $b=0.5$ for initial set of experiments and then later finalized the values of $k1$ as 1.7 and $b$ as 0.95 by trial and error approach and found these values give optimal results.

Figure 4.
Precision, recall and F-measure comparison for the hashtag – “#KashmirUnrest”.

Figure 5.
Precision, recall and F-measure comparison for the hashtag – “#SurgicalStrike”.

•
FPI (Flat Position Index) based approach [30]: In this method, the authors propose a flat position index for phrase query evaluation. The entire document collection is treated as huge sequence of tokens and each token is represented by one flat position. This method could reduce the index size and speed up phrase querying substantially.
•
Inverted Index based approach [31]: This method introduced a set of inverted indexes that work for strings and phrases that works well for phrase searching. This approach also highlighted top- $k$ based retrieval under relevance metrics like tf-idf and frequency [31].
•
Entity – Language Model based approach [19]: This paper introduced entity-based language models for document retrieval that utilizes information about single terms in the query and documents as term sequences marked as entities. This approach projected the merits of using language models for retrieval tasks and shows that language models can be effectively used for cluster based document retrieval [19].

Figure 6.
Precision, recall and F-measure comparison for the hashtag – “#Election2016”.

Figure 7.
Precision, recall and F-measure comparison for the hashtag – “#Demonetisation”.

Figure 8.
Precision, recall and F-measure comparison for the hashtag – “#MakeInIndia”.

Figure 9.
Precision, recall and F-measure comparison for the hashtag – “#TrumpRussia”.

We compared our proposed algorithm for information retrieval with four different state-of-the-art approaches [38, 30, 31, 19]. The performance comparison in terms of precision, recall and f-measure for hashtags “#KashmirUnrest”, “#SurgicalStrike”, and “#Demonetisation” are shown in Table 2 and the hashtags “#Election2016”, “#MakeInIndia”, and “#TrumpRussia” are shown in Table 3 respectively. From Tables 2 and 3, it is clear that our proposed framework outperforms the chosen baselines in term of precision, recall and f-measure. For the hashtag “#KashmirUnrest”, our method achieved 0.9114, 0.9274 and 0.9193 as precision, recall and f-measure respectively and the entity based approach [19], which is the closest competitor that could achieve an f-measure value of 0.8065 only. For the next four hashtags such as “#SurgicalStrike”, “#Demonetisation”, “#Election2016” and “#MakeInIndia”, our method showed better f-measure which was 0.9393, 0.9226, 0.9422 and 0.9341 respectively. But for the hashtag “#TrumpRussia”, we found that our closest competitor for other hashtags, the entity-based approach [19] showed better performance in terms of f-measure. This method showed 0.8984 as the f-measure and our method could achieve an f-measure of 0.8472 only. Upon further and detailed analysis, we found that the number of noun phrases generated from the tweets against the hashtag “#TrumpRussia” was less compared to the other hashtags. In future work, we may address this by comparing different tools for key-phrase extraction and enrich the process of phrase generation by incorporating custom natural language rules.
6. Conclusions and future work

This paper proposed a framework for better information retrieval from online social networks, using distributional phrase embedding and topic modeling. Our proposed method was successful in utilizing the dynamics of the topics generated by the topic model fused with the keyphrases and phrase embeddings. This addressed the major drawback of the vast majority of existing algorithms that fails to work with dynamic data. Experiments with messages collected from Twitter shows that better ranking and retrieval of relevant messages are possible if topic modeling is integrated with distributional semantics. We have conducted the baseline comparison with some of the state-of-the-art approaches and the proposed approach showed significant improvements. The main limitation of this study is that the proposed approach do not consider the informal nature of the messages and the amount of messages are limited. In the future work, the authors will be working on these dimensions. Also, some of the potential areas that worth investigating further are tuning the parameters of LDA algorithm for topic modeling and phrase extraction, use of sentence2Vec and doc2Vec with this experimental setup for better ranking and retrieval for a larger set of documents.

References

Lin

Guan

. Integrating social annotations into topic models for personalized document retrieval. Soft Computing. 2020; 24(3): 1707-1716.

Vallejos

Alonso

Caimmi

Berdun

Armentano

Soria

. Mining social networks to detect traffic incidents. Information Systems Frontiers. 2020; 24: 1-20.

Wan

Zhang

Tang

. Aminer: Search and mining of academic social networks. Data Intelligence. 2019; 1(1): 58-76.

Richter

Kelly

Haugen

Flores

. U.S. Patent Application. 2019; 10/296, 547.

Neystadt

Karidi

Weisfeild

Varshavsky

Oron

Radinsky

. U.S. Patent No. 9,870,424. Washington, DC: U.S. Patent and Trademark Office; 2018.

Liao

Zhang

Chua

. Attributed social network embedding. IEEE Transactions on Knowledge and Data Engineering. 2018; 30(12): 2257-2270.

Margaris

Vassilakis

Georgiadis

. Query personalization using social network information and collaborative filtering techniques. Future Generation Computer Systems. 2018; 78: 440-450.

Shen

Murzintcev

Song

Cheng

. Information retrieval of a disaster event from cross-platform social media. Information Discovery and Delivery; 2017.

Brambilla

Ceri

, Della Valle

Volonterio

, Acero Salazar

. Extracting emerging knowledge from social media. in: Proceedings of the 26th International Conference on World Wide Web. 2017; 795-804.

10.

Gao

Wang

Shao

Song

. Real-time social media retrieval with spatial, temporal and social constraints. Neurocomputing. 2017; 253: 77-88.

11.

Tolosa

Feuerstein

Becchetti

Marchetti-Spaccamela

. Performance improvements for search systems using an integrated cache of lists+ intersections. Information Retrieval Journal. 2017; 20(3): 172-198.

12.

Kim

Callan

Culpepper

Moffat

. Efficient distributed selective search. Information Retrieval Journal. 2017; 20(3): 221-252.

13.

Gagie

Hartikainen

Karhu

Kärkkäinen

Navarro

Puglisi

Sirén

. Document retrieval on repetitive string collections. Information Retrieval Journal. 2017; 20(3): 253-291.

14.

Troussas

Krouska

Virvou

. Automatic predictions using LDA for learning through social networking services. in: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI). 2017; 747-751. IEEE.

15.

Steinskog

Therkelsen

Gambäck

. Twitter topic modeling by tweet aggregation. in: Proceedings of the 21st nordic conference on computational linguistics. 2017; 77-86.

16.

Oostdijk

Hürriyetoglu

Puts

Daas

, van den Bosch

. Information extraction from social media: A linguistically motivated approach.

17.

Wang

Zhang

. A text information retrieval method by integrating global and local textual information. in: 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC). 2016; 1: 504-505. IEEE.

18.

Wang

Chen

Liu

Emery

. Identifying search keywords for finding relevant social media posts. in: Proceedings of the AAAI Conference on Artificial Intelligence 2016; 30(1).

19.

Raviv

Kurland

Carmel

. Document retrieval using entity-based language models. in: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 2016; 65-74.

20.

Yamamura

. Natural disasters and social capital formation: The impact of the Great Hanshin-Awaji earthquake. Papers in Regional Science. 2016; 95: S143-64.

21.

Halberstam

Knight

. Homophily, group size, and the diffusion of political information in social networks: Evidence from Twitter. Journal of Public Economics. 2016; 143: 73-88.

22.

Nakov

Rosenthal

Kiritchenko

Mohammad

Kozareva

Ritter

Stoyanov

Zhu

. Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts. Language Resources and Evaluation. 2016; 50(1): 35-65.

23.

Sekara

Stopczynski

Lehmann

. Fundamental structures of dynamic social networks. Proceedings of the National Academy of Sciences. 2016; 113(36): 9977-9982.

24.

Lim

Chen

Buntine

. Twitter-network topic model: A full Bayesian treatment for social network and text modeling. arXiv preprint arXiv: 1609.06791. 2016.

25.

Anoop

Asharaf

Alessandro

. Generating and visualizing topic hierarchies from microblogs: An iterative latent dirichlet allocation approach. in: 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI). 2015; 824-828. IEEE.

26.

Mikolov

. Distributed representations of sentences and documents. in: International Conference on Machine Learning. 2014; 1188-1196. PMLR.

27.

Mikolov

Chen

Corrado

Dean

. Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781. 2013.

28.

Mikolov

Sutskever

Chen

Corrado

Dean

. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv: 1310.4546. 2013.

29.

Boughanem

. Information retrieval and social media. in: Modeling Approaches and Algorithms for Advanced Computer Applications. 2013; 7-7. Springer, Cham.

30.

Shan

Zhao

Yan

. Efficient phrase querying with flat position index. in: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011; 2001-2004.

31.

Patil

Thankachan

Shah

Hon

Vitter

Chandrasekaran

. Inverted indexes for phrases and strings. in: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011; 555-564.

32.

Zhao

Jiang

Weng

Lim

Yan

. Comparing twitter and traditional media using topic models. in: European Conference on Information Retrieval. 2011; 338-349. Springer, Berlin, Heidelberg.

33.

Aggarwal

. An introduction to social network data analytics. in: Social Network Data Analytics. 2011; 1-15. Springer, Boston, MA.

34.

Hong

Davison

. Empirical study of topic modeling in twitter. in: Proceedings of the First Workshop on Social Media Analytics. 2010; 80-88.

35.

Loper

Bird

. Nltk: The natural language toolkit. arXiv preprint cs/0205028. 2002.

36.

Wallach

. Topic modeling: Beyond bag-of-words. in: Proceedings of the 23rd International Conference on Machine Learning. 2006; 977-984.

37.

Blei

Jordan

. Latent dirichlet allocation. The Journal of Machine Learning Research. 2003; 3: 993-1022.

38.

Robertson

Walker

Jones

Hancock-Beaulieu

Gatford

. Okapi at TREC-3. Nist Special Publication Sp. 1995; 109: 109.

39.

Fleiss

Cohen

. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement. 1973; 33(3): 613-619.

A distributional semantics-based information retrieval framework for online social networks

Abstract

Keywords

1. Introduction

2.1 Background: Latent Dirichlet Allocation (LDA)

3.1 Online processing

4. Experimental setup

Dataset description

1 http://www.socioviz.net [Last accessed on 13/03/2020].

2 https://textblob.readthedocs.io/en/dev/ [Last accessed on 13/03/ 2020].

Table 2 Comparison of proposed method with BM25, flat position index, inverted index and entity based baselines for twitter hashtags – “#KashmirUnrest”, “#SurgicalStrike”, and “#Demonetisation”

References

¹
http://www.socioviz.net [Last accessed on 13/03/2020].

²
https://textblob.readthedocs.io/en/dev/ [Last accessed on 13/03/ 2020].

Table 2
Comparison of proposed method with BM25, flat position index, inverted index and entity based baselines for twitter hashtags – “#KashmirUnrest”, “#SurgicalStrike”, and “#Demonetisation”