Microbloggers’ interest inference using a subgraph stream

Abstract

Inferring user interest over large-scale microblogs have attracted much attention in recent years. However, the emergence of the massive data, dynamic change of information and persistence of microblogs pose challenges to interest inference. Most of the existing approaches rarely take into account the combination of these microbloggers’ characteristics within the model, which may incur information loss with nontrivial magnitude in real-time extraction of user interest and massive social data processing. To address these problems, in this paper, we propose a novel User-Networked Interest Topic Extraction in the form of Subgraph Stream (UNITE_SS) for microbloggers’ interest inference. To be specific, we develop several strategies for the construction of subgraph stream to select the better strategy for user interest inference. Moreover, the information of microblogs in each subgraph is utilized to obtain a real-time and effective interest for microbloggers. The experimental evaluation on a large dataset from Sina Weibo, one of the most popular microblogs in China, demonstrates that the proposed approach outperforms the state-of-the-art baselines in terms of precision, mean reciprocal rank (MRR) as well as runtime from the effectiveness and efficiency perspectives.

Keywords

Information processing microblog social network subgraph stream user interest inference

1. Introduction

Microblogs, which are important social platforms, can provide microbloggers with the ability to communicate and share information with each other. In the past decade, a lot of microblogs have been constructed for business applications and achieved tremendous successes. For example, Facebook has achieved more than two billion registered users, accounting for more than a quarter of the global population. In addition, Twitter has owned about 319 million active users every month, and hundreds of millions of tweets are generated every day. Meanwhile, the number of the registered users of Sina Weibo have exceeded 500 million, and the daily volume of microblogers’ posts reaches more than 100 million. Inferring interest for microbloggers from microblogs plays an important role for improving user experiences and satisfactions. With identified interest from microblogs, the enterprises or companies can provide proper services such as information searching, targeted advertising and personalized recommendation for users. Therefore, inferring user interest from microblogs has become an emerging issue in the recent years. However, the information of these microblogs generates unique characteristics such as massive data size, dynamic change of information and persistence, which put forward the underlying challenge for user interest inference. Figure 1 shows a simple example of inferring users’ interest in different environments. As shown in the left side of Fig. 1, in traditional environments, large-scale information of microblogs which are composed of both massive users and posts published by these users, cannot be tackled at once by traditional methods for user interest inference. In addition, these posts are updated as the time goes by, which further increase the difficulty of inferring user interest.

Figure 1.

User interest inference in different environment.

Generally, most existing methods pay far less attention on the combination of afore-mentioned characteristics for microbloggers’ interest extraction [1, 2, 3], and only take one of these three characteristics of microblogs for inferring user interest into consideration. For example, several significant findings only focused on the dynamic change of information in microblogs [4, 5]. They inferred microbloggers’ real-time interest by analyzing information of microblogs in a given time interval. However, these methods are batch algorithms, and it is difficult to process massive information simultaneously with every increasing size of microbloggers and posts. In addition, several promising findings only focused on the characteristic of massive data size for microbloggers’ interest extraction [6, 7]. These findings coped with the massive social data by utilizing the big data technologies such as Hadoop. Nonetheless, the characteristics of social data such as dynamicity and persistence are ignored during data processing. As a consequence, the effectiveness of mining user interest is extremely low. Therefore, how to effectively and efficiently infer user interest over large-scale microblogs is a crucial research challenge.

To address the aforementioned challenge, we develop a microbloggers’ interest inference approach by making use of the data structure of subgraph stream. Specifically, the proposed method, named User-Networked Interest Topic Extraction in the form of Subgraph Stream (UNITE_SS for short), is shown in the right side of Fig. 1. The idea of the data structure in the proposed method is motivated by the data stream [8], where the characteristic of microblogs is similar to that of data stream, except for the social relations such as ‘following/follower’ and ‘retweet’ characteristics of microblogs. Hence, we name the data structure as “subgraph stream”. In UNITE_SS, we transfer a traditional processing framework into “subgraph stream” by social relations, and then infer microbloggers’ interest by means of each subgraph. As can be seen from the right side of Fig. 1, a sequence of subgraphs $\{D_{0},D_{1},D_{2}\}$ is constructed in the time order of $\{t_{0},t_{1},t_{2}\}$ by using the strategy of construction of subgraph stream (details summarized in Section 4.3). In each subgraph, the vertices represent users and posts published by corresponding users, and the edges are denoted as social relations between users. When a subgraph is constructed, the candidate interest of each user in a subgraph can be extracted by performing topic models on all the updated posts in the subgraph. Consequently, interest of each user without updating posts in the subgraph is inferred from that of his/her neighbors by using the social relations. For example, in a subgraph $\{D_{1}\}$ , the follower list of $\{u_{3}\}$ is $\{u_{1},u_{2}\}$ , thus the posts of $\{u_{1}\}$ and that of $\{u_{2}\}$ will be updated as that of $\{u_{3}\}$ is updated. According to the order of users’ updating posts, we can infer each user’s interest in the subgraph. Hence, even if posts are updated as the time goes by, we can also obtain a real-time interest for microbloggers by constructing and processing subgraphs. Therefore, it is practically significant to infer user interest by taking advantage of subgraph stream.

The main contributions of this study are summarized as follows:

•

In view of the traditional methods for user interest inference ignoring the combination of the massive data size, dynamic change of information and persistence characteristics of microblogs, we propose a microbloggers’ interest inference approach based on the data structure of subgraph stream. Note that the data structure can be extended to exploit not only an effective way for user interest inference, but also a solution to the problem of large-scale graph partitioning with constraints resources.

•

To find an effective subgraph stream for user interest inference, several strategies for subgraph stream construction are proposed. To the best of our knowledge, we propose the first subgraph stream approach considering the sequence of subgraphs by using correlations between subgraphs to improve user interest inference. Furthermore, to evaluate the effectiveness of the subgraph stream, these strategies are applied into traditional user interest inference methods.

•

Experiments are performed on a real dataset from Sina Weibo, which can demonstrate the efficiency and effectiveness of the proposed approach. Moreover, our proposed approach consistently outperforms the state-of-the-art methods in terms of precision, mean reciprocal rank (MRR) and runtime.

The remainder of the paper is organized as follows. Section 2 summarizes the related work. Section 3 gives the formulation of the problems. Section 4 reveals the details of our proposed approach. Furthermore, Section 5 experimentally compares our proposed approach with the state-of-the-art methods on the Sina Weibo dataset and discusses the experimental results. Finally, Section 6 draws conclusions.

2. Related work

There have been several fruitful findings related to inferring user interest from microblog, and most of these findings have only focused on one characteristic of microblogs, such as dynamic change of information or massive data size of microblogs for user interest inference. Therefore, these findings are classified as dynamic-based interest inference methods and large-scale-based interest inference ones. Besides, in this section, several theoretical achievements related to subgraph stream construction will be introduced as well.

2.1 Dynamic-based interest extraction methods

Most previous methods for user interest inference concentrated on the short text feature of microblogs, without taking into account the massive data size, dynamic change of information and persistence characteristics of microblogs. Some studies enriched the short text of microblogs by using external knowledge base such as Wikipedia [1, 2]. Meanwhile, some research work extended the insufficient context of microblogs by utilizing social graph structure. For example, Wang et al. [3] proposed a user interest propagation method to infer new users’ interest by the combination of textual content with social network relationships. User-Networked Interest Topic Extraction (UNITE) [11] was proposed to extract general microbloggers’ interest by using both the contextual information and the ‘following’ relations between microbloggers. Nonetheless, these methods for user interest inference utilized the static information of microblogs within a fixed period of time without considering the dynamic change of information in microblogs.

Recent years, several significant insights have focused on the dynamic characteristic of microblogs to inferring user interest. For example, Zarrinkalam et al. [4] inferred users’ interest over the emerging topic on Twitter in a given time interval by exploiting external semantic concepts. Budak et al. [12] proposed a probabilistic generative model to infer online user interest from Twitter. Literatures [5, 13] collected microbloggers’ information in time order, and then proposed variant topics models to identify microbloggers’ interest at each time. Bao et al. [14] proposed a social probabilistic matrix factorization model to predict users’ interest by the combination of time information and social network structure. However, with the growth of microbloggers and their published posts, these methods are batch algorithms and cannot handle large amounts of social data due to the resource restriction. Therefore, these methods are not suitable for inferring microbloggers’ interest over large-scale microblogs.

2.2 Large-scale-based interest extraction methods

With the prevalence of social networks, some research work for large-scale social data analysis in several areas has been emerged to find the value of information. For example, large-scale social network data has been discussed to discover drug users and potential adverse events [15], user profiles [16], privacy issues [17, 18], influential users [19] user behavior analysis [32] and social user interest [6].

Only few of them focus on inferring users’ interest over large-scale microblogs. Pennacchiotti and Gurumurthy [6] discovered millions of microbloggers’ interest on the basis work of Parallel LDA [20] and operated this work on a Hadoop cloud computing architecture of 1000 machines, which called “ParallelLDA”. Spasojevic et al. [7] proposed an infrastructure for inferring users’ topical interest and built it on the platform of Hadoop and querying infrastructure of Hive. However, all these methods directly utilized big data technologies such as Hadoop, Cloud Computing and Big Table to improve the processing performance of large-scale data, which fails to address the key characteristics of microblogs such as dynamicity and persistence. Hence, the effectiveness of mining user interest over large-scale microblogs is often poor.

2.3 Community detection methods

Similar to the work of designing a subgraph stream, some community detection approaches [9, 10] can be applied to divide the large-scale information of microblogs into small subgraphs (communities). More specifically, Newman [10] divided the network into communities based on the similarity or the strength of the connection between nodes. Mitrović and Tadić [21] detected communities in the network based on spectral algorithms. Blondel et al. [9] extracted the community structure of large networks based on Modularity [22] optimization. However, they have several drawbacks: 1) the directivity of social relations between users, 2) the sequence of communities (subgraphs), and 3) the balanced nodes in a community, which may cause some large communities still cannot be processed.

The aim of these approaches for constructing communities (subgraphs) on the basis of that the internal of a subgraph has a strong structural relation between nodes. However, the nodes belonging to different communities (subgraphs) are weakly connected, and these methods may cause that these communities (subgraphs) have imbalanced nodes. Different from it, the goal of our work for subgraph stream construction is designing a sequence of directed and balanced subgraphs, and there is association between adjacent subgraphs, which is more consistent with the characteristics of microblogs.

Recently, graph partitioning has been prevalent in scientific computing [26, 27, 28, 29]. Similar to community detection, graph partitioning are designed to construct a lot of subgraphs for parallel computing. However, in contrast with our problem setup, all these methods do not consider the correlations between subgraphs. Due to these differences, we do not discuss graph partitioning further.

3. Problem formulation

As described in Section 1, due to the massive data size, dynamicity and persistence of microblogs, we propose a subgraph stream for user interest inference. In the following, we formally define a subgraph stream, and then illustrate the research problems.

.

(Subgraph). A subgraph $D_{\alpha}$ is constructed as a directed graph $G_{\alpha}=\{V,E,W\}$ , where $V$ is the set of users, each user has an attribute of posts published by him/her; $E\subseteq V\times V$ is a set of edges indicating relationship or interactions among the users; $W\in R^{m_{\alpha}\times m_{\alpha}}$ , where $W_{ij}$ indicates affinity or strength of the relationship between users $v_{i},v_{j}\in V$ .

.

(Subgraph stream). A subgraph stream $S$ is a sequence of $N$ subgraphs denoted by $D=\linebreak\{D_{1},D_{2},\dots,D_{\alpha},\dots,D_{N}\}$ ; each subgraph $D_{\alpha}$ is composed of both a number of users connected by social relationships and posts published by these users, namely $D_{\alpha}=\{\{<u_{\alpha_{1}},p_{\alpha_{1}}>\}^{r_{1}}_{1},\dots,\{<u_{% \alpha_{i}},p_{\alpha_{i}}>\}^{r_{i}}_{1},\dots,\{<u_{\alpha_{m_{\alpha}}},p_{% \alpha_{m_{\alpha}}}>\}^{r_{m_{\alpha}}}_{1}\}$ , where $m_{\alpha}(m_{\alpha}\leqslant M)$ indicates the number of users in the $\alpha^{\text{th}}$ subgraph $D_{\alpha}$ , $M$ is the maximum number of users for a subgraph, $p_{\alpha_{i}}$ is a post published by user $u_{\alpha_{i}}$ , and $r_{i}$ denotes the number of posts published by user $u_{\alpha_{i}}$ ; there are some associations between adjacent subgraphs.

.

(User interest). Given a set of topics $\Gamma_{\alpha}$ extracted from a subgraph $D_{\alpha}$ , interests of a user $u$ is defined as a set of weighted topics $T_{u}=\{(\tau,w(u,\tau))|\tau\in\Gamma_{\alpha},u\in U\}$ , where $w(u,\tau)$ denotes the importance of the topic $\tau$ over $\Gamma_{\alpha}$ topics, and $U$ stands for the set of users.

Additionally, there are two following hypotheses in our work.

H1: Candidate users and their graph structural information of microblog are known.

H2: The interest of each user in previous subgraph $D_{\alpha-1}$ is known to all the users in next subgraph $D_{\alpha}$ .

The Problems. We focus on addressing the following problems:

Problem 1: Supposing massive candidate users in user list named List and their graph structural information of microblog are known, the objective is to construct a subgraph stream $S$ that can satisfy (1) each subgraph has almost the same amount of users, (2) the close relationships between the users within a subgraph, and (3) the close associations between adjacent subgraphs. Problem 2: Given a subgraph stream $S$ , the objective in this paper is that how to infer microbloggers’ interest based on the data structure of the subgraph stream $S$ .

To handle these problems effectively, the framework of our approach proposes strategies for constructing subgraph stream $S$ , which needs to meet the condition of Problem 1. Then, for each subgraph $D_{\alpha}$ , inferring user interest by topic modelling and social relationships between users to solve Problem 2.

4. Our proposed UNITE_SS approach

In this section, we first give an overview of our proposed approach (refer to Section 4.1) and its corresponding algorithm UNITE_SS (refer to Section 4.2), and then explore various subgraph stream construction strategies to choose the best subgraph stream for microbloggers’ interest inference (refer to Section 4.3). Finally, we describe some traditional user interest extraction methods and give the idea that how to apply these methods based on the data structure of subgraph streams (refer to Section 4.4).

4.1 Overview

To overcome the challenge of the dynamicity and persistence characteristics of microblog information, we propose an approach named UNITE_SS (User-Networked Interest Topic Extraction in the form of Subgraph Stream), which can be divided into two main parts: subgraph stream construction and user interest inference based on the data structure of subgraph stream, as shown in Fig. 2. When a subgraph is constructed, traditional methods for user interest extraction can be performed on each subgraph to infer microbloggers’ interest, where these methods contain 1) candidate user interest extraction, 2) social relations graph building, and 3) user interest ranking.

Figure 2.

The workflow of our proposed UNITE_SS approach.

4.2 UNITE_SS algorithm

Based on the above descriptions, we present the framework of UNITE_SS in the form of Pseudo-code, as shown in Algorithm 1. The algorithm takes candidate users in List and their graph structural information of microblog as input, and then iteratively generates subgraphs by choosing the users that satisfies some conditions from List and crawling posts of the users immediately. When a subgraph $D_{\alpha}$ is constructed, interest of each microblogger in $D_{\alpha}$ is inferred by using traditional user interest methods. Furthermore, UNITE_SS algorithm is depicted as follows.

Step 1: initialize a subgraph stream $D$ as $D_{1}$ , where subgraph $D_{1}$ is a set of users and posts published by them, and relations between them can be constructed as a strongly connected graph through Line 2 of Algorithm 4.2).

Step 2: delete users contained in $D_{1}$ from List to assure that new subgraph doesn’t contain the users that are contained in previous subgraph through Line 3 in Algorithm 4.2.

Step 3: choose a subgraph stream construction strategy proposed in Section 4.3 for selecting users from List through Line 5 of Algorithm 4.2. According to the chosen subgraph construction strategy presented in Section 4.3, select a user $u_{\alpha_{j}}$ from List, and put the user and his/her posts crawled into $D_{\alpha}$ through Lines 5–8 of Algorithm 4.2.

Step 4: delete the user $u_{\alpha_{j}}$ from List and update List to assure the user would not be contained in a new subgraph through Lines 9 and 13 of Algorithm 4.2).

Step 5: according to the chosen strategy for subgraph construction (detailed in Section 4.3), select the user $u_{\alpha_{j}}$ that can form a strongly connected graph with $G_{\alpha}$ from List, and put the user and his/her posts crawled into $D_{\alpha}$ through Lines 11–16 of Algorithm 4.2).

Step 6: a subgraph $D_{\alpha}$ is constructed, interest of microbloggers is inferred immediately by traditional methods for user interest extraction through Line 17 in Algorithm 4.2.

Step 7: repeat Step 3 to Step 6 respectively until $\alpha>N$ through Lines 6–19 in Algorithm 4.2.

More details regarding the UNITE_SS algorithm are illustrated in Algorithm 4.2.

UNITE_SS algorithm Candidate users in List and their graph structural information of microblog Interest of microbloggers.

$D=\{D_{1}\}$ ; $\textit{List}=\{\textit{the followees of every user in }D_{1}\}\setminus D_{1}$ ; $\alpha=2$ ;Choose a subgraph stream construction strategy (the details are described in Section 4.3);

$\alpha\leqslant N$ $j=1$ ; // count the number of user for a subgraph; $D_{\alpha}=\{<u_{\alpha_{j}},p_{\alpha_{j}}>\}$ $\textit{List}=\textit{List}\setminus\{u_{\alpha_{j}}\}$ ; // delete the user from List; $j=j+1$ $j\leqslant M\&\&$ there are still some users in List that can form a strongly connected graph with $G_{\alpha}\&\&\textit{List}\neq\varnothing$ $D_{\alpha}=D_{\alpha}+\{<u_{\alpha_{j}},p_{\alpha_{j}}>\}$ ;update $G_{\alpha+1}$ $\textit{List}=\textit{List}\setminus\{u_{\alpha_{j}}\}$ $D_{\alpha}=\{<u_{\alpha_{j}},p_{\alpha_{j}}>,\dots,<u_{\alpha_{i}},p_{\alpha_{% i}}>,\dots,<u_{\alpha_{m_{\alpha}}},p_{\alpha_{m_{\alpha}}}>\}$ Inferring interest of each microblogger in $D_{\alpha}$ by traditional user interest extraction methods $\alpha=\alpha+1$

Return Interest of microbloggers.

4.3 Subgraph stream construction

In this subsection, we propose various strategies for subgraph stream construction to solve Problem 1 which mentioned in Section 3. Furthermore, we discuss which subgraph stream construction strategy can better improve the quality of user interest.

Before introducing subgraph stream construction strategies, a simple toy example from Fig. 1 is prepared to demonstrate different subgraph stream construction strategies. There are ten users $\textit{List}=\{u_{1},u_{2},\dots,u_{10}\}$ . The relations among users are presented in Fig. 1, and the edge from $u_{i}$ to $u_{j}$ denotes that $u_{i}$ follows $u_{j}$ , namely, $<u_{i},u_{j}>$ . Hence, 26 social relations between users are denoted as $R=\{<u_{1},u_{2}>,<u_{1},u_{3}>,<u_{3},u_{1}>,\dots,<u_{10},u_{7}>\}$ . The total number $N$ of subgraphs is set to 3. Consequently, the maximum number $M$ of users for a subgraph is $M=10/3\thickapprox 4$ .

As mentioned in UNITE_SS algorithm, subgraph stream construction can be transformed into the problem that how to select the candidate users from List to construct a strongly connected graph that containing equals or less than $M$ users. Therefore, in what follows, we propose five subgraph stream construction strategies, namely, Random, MaxAssociation, HighFrequency, MaxAssociation $\&$ HighFrequency and ShortDistance.

4.3.1 Random strategy for subgraph stream construction

We propose a subgraph stream construction strategy named Random.

(1) Idea of Random strategy

As mentioned in UNITE_SS algorithm, to continually select a candidate user $u_{\alpha_{j}}$ from List in order to construct a subgraph $D_{\alpha}$ , we always generate a random number $z$ between 0 (inclusive) $\sim\textit{sizeOf}(\textit{List})$ (exclusive) as the sequence number in the List, where $\textit{sizeOf}(\textit{List})$ denotes the size of List. Therefore, $u_{\alpha_{j}}$ is assigned to the $z$ -th user of List, who will be selected as the candidate user to construct a subgraph.

Figure 3.

Running Random strategy with a example.

(2) Run Random strategy by the toy example

By incorporating Random strategy into UNITE_SS algorithm, a subgraph stream can be obtained. The process of constructing a subgraph stream is given in Fig. 3.

Given $\textit{List}=\{u_{1},u_{2},u_{3},u_{4},u_{5},u_{6},u_{7},u_{8},u_{9},u_{10}\}$ and an initial random number $z=3$ is supposed to generate. The user $u_{4}$ and the posts published by $u_{4}$ are first selected into the subgraph $D_{1}$ , namely, $D_{1}=\{u_{4}\}$ . At the same time $u_{4}$ is deleted from List. As a consequence, List is updated as $\textit{List}=\{u_{1},u_{2},u_{3},u_{5},u_{6},u_{7},u_{8},u_{9},u_{10}\}$ .

Furthermore, if the condition that described in Line 11 of UNITE_SS algorithm is true, a random number $z=0$ is supposed to generate, then a candidate user $u_{1}$ will be judged whether can be formed a strongly connected graph with $D_{1}=\{u_{4}\}$ . Subsequently, if the judgment is true, then $u_{1}$ and the posts published by $u_{1}$ are selected into $D_{1}$ . $D_{1}$ is updated as $D_{1}=\{u_{1},u_{4}\}$ , and $u_{1}$ is deleted from List. Hence, List is updated as $\textit{List}=\{u_{2},u_{3},u_{5},u_{6},u_{7},u_{8},u_{9},u_{10}\}$ ; otherwise, $u_{1}$ is also deleted from List and another user randomly chosen from List tries to construct $D_{1}$ . $D_{1}$ completes the construction until the condition that described in Line 11 of UNITE_SS algorithm is false, and $D_{1}$ is updated as $D_{1}=\{u_{4},u_{5},u_{6},u_{7}\}$ .

Finally, according to the same iterative process, we can repectively generate subgraph $D_{2}=\{u_{1},u_{2},\linebreak u_{3}\}$ and $D_{3}=\{u_{8},u_{9},u_{10}\}$ , as shown in Fig. 3.

4.3.2 MaxAssociation strategy for subgraph stream construction

As discussed previously, a subgraph stream can be generated by Random strategy. However, each time Random strategy is executed, the result of subgraph stream is different. Therefore, it is desired to explore relatively stable subgraph stream construction strategies for overcoming this limitation. Here, we propose a subgraph stream construction strategy named MaxAssociation.

(1) Idea of MaxAssociation strategy

Given the hypotheses H1 described in Section 3, MaxAssociation is exploited according to the principle that there are the most associations between a candidate user selected from List to construct a subgraph $D_{\alpha}$ and the users in the previous subgraph $D_{\alpha-1}$ . Hence, List is sorted by this principle, and then the first user in the sorted List is always selected as the candidate user for constructing subgraph $D_{\alpha}$ .

Algorithm 4.3 presents the pseudo-code of MaxAssociation. In Line 3 of Algorithm 4.3, $\textit{sizeOf}(.)$ denotes the size of $D_{\alpha-1}\cap\textit{followees}(u)$ . In Line 5 of Algorithm 4.3, thresholdF is set to filter the influence of the inactive users or spam users, because a user’s followers have more inactive users than his/her followees. In Line 11 of Algorithm 4.3, when a user is selected from List to construct a subgraph, the user will be deleted from List. Hence, List is constantly updated.

[htbp] MaxAssociationList and their graph structural information, subgraph $D_{\alpha-1}$ . subgraph $D_{\alpha}$ . user $u$ in List $\textit{countU}=\textit{sizeOf}(D_{\alpha-1}\cap\textit{followees}(u))$ each user $u_{f}$ in $\textit{followees}(u)$

$\textit{follows}(u_{f})\geqslant\textit{ThresholdF}$ $\textit{countU}=\textit{countU}+\textit{sizeOf}(D_{\alpha-1}\cap\textit{% followees}(u))$ Sort users in List by the value of countU in descending order Continuously select the first user $u$ from the updated List as the candidate user for constructing $D_{\alpha}$ Return subgraph $D_{\alpha}$ .

(2) Run MaxAssociation strategy by the toy example

By incorporating MaxAssociation strategy into UNITE_SS algorithm, a subgraph stream can be obtained, as shown in Fig. 4. The process of constructing a subgraph stream is given in Fig. 4.

Figure 4.

Running MaxAssociation strategy with an example.

Preliminary: Given $D_{1}=\{u_{1},u_{2},u_{3}\}$ and $\textit{List}=\{u_{4},u_{5},u_{6},u_{7},u_{8},u_{9},u_{10}\}$ , the followees/follo-wers of each user in List named as $F(.)$ are shown as follows: $F(u_{4})=\{u_{3},u_{5},u_{6}\}$ , $F(u_{5})=\{u_{2},\linebreak u_{3},u_{4},u_{6}\}$ , $F(u_{6})=\{u_{1},u_{4},u_{5},u_{7},u_{10}\}$ , $F(u_{7})=\{u_{6},u_{8},u_{10}\}$ , $F(u_{8})=\{u_{3},u_{7},u_{9}\}$ , $F(u_{9})=\{u_{8},u_{10}\}$ and $F(u_{10})=\{u_{6},u_{7},u_{9}\}$ .

First, we can get $D_{1}\cap F(u_{4})=\{u_{3}\}$ , $D_{1}\cap F(u_{5})=\{u_{2},u_{3}\}$ , $D_{1}\cap F(u_{6})=\{u_{1}\}$ and $D_{1}\cap F(u_{8})=\{u_{3}\}$ , the followees/followers of other users in List have no overlapping users with $D_{1}$ . Hence, List can be sorted as $\textit{List}=\{u_{5},u_{4},u_{8},u_{6},u_{7},u_{9},u_{10}\}$ by the size of overlapping users between $D_{1}=\{u_{1},u_{2},u_{3}\}$ and the followees of each user in List.

Then, based on the principle that there are most number of overlapping users between the followers/followees of candidate users selected from List and $D_{1}=\{u_{1},u_{2},u_{3}\}$ are at most, $u_{5}$ and the posts of $u_{5}$ are selected into $D_{2}$ . As a consequence, $D_{2}$ is updated as $D_{2}=\{u_{5}\}$ , and List is updated as $\textit{List}=\{u_{4},u_{8},u_{6},u_{7},u_{9},u_{10}\}$ .

Next, if the condition that described in Line 11 of UNITE_SS algorithm is true and $u_{4}$ can be formed a strongly connected graph with $D_{2}=\{u_{5}\}$ , then $u_{4}$ and the posts of $u_{4}$ will be selected into $D_{2}$ , and List is updated as $\textit{List}=\{u_{8},u_{6},u_{7},u_{9},u_{10}\}$ ; otherwise $u_{4}$ is deleted from List, and the current first user $u_{8}$ selected from List tries to construct $D_{2}$ . $D_{2}$ completes the construction until the condition that described in Line 11 of UNITE_SS algorithm is false, and $D_{2}=\{u_{4},u_{5},u_{6},u_{10}\}$ .

Finally, according to the aforementioned iterative process, we can generate $D_{3}=\{u_{7},u_{8},u_{9}\}$ , as shown in Fig. 4.

4.3.3 HighFrequency strategy for subgraph stream construction

Although MaxAssociation can construct a relatively stable subgraph stream, the effectiveness of this strategy often depends on the initial subgraph. Hence, we propose another strategy named HighFrequency.

(1) Idea of HighFrequency strategy

Based on the principle that the candidate user has the highest frequency of updating his/her microblogs within a period of time $t$ in List to construct a subgraph $D_{\alpha}$ , List is sorted by this principle, and then the first user in the sorted List is always selected as the candidate user for constructing $D_{\alpha}$ . Algorithm 4.3 presents the pseudo-code of HighFrequency. In Line 3 of Algorithm 4.3, $\textit{count}(.)$ means the updating times of the user’s microblogs. For example, $\textit{count}(\textit{post})$ means the number of posts published by the user; countU is the sum of the number of posts published by the user, retweet and interactions with other users, which is used to measure the updating frequency of a user on his/her microblog.

[htbp] HighFrequencyList and their graph structural information. subgraph $D_{\alpha}$ . user $u$ in List within a period of time $t$ $\textit{countU}=\textit{count}(\textit{posts})+\textit{count}(\textit{retweet}% )+\textit{count}(\textit{mention})$ Sort List by the value of countU in descending order Continuously select the first user $u$ from the updated List as the candidate user for constructing $D_{\alpha}$ Return subgraph $D_{\alpha}$ .

(2) Run HighFrequency strategy by the toy example

By incorporating HighFrequency strategy into UNITE_SS algorithm, a subgraph stream can be obtained, as shown in Fig. 5. The process of constructing a subgraph stream is given in Fig. 5.

Figure 5.

Running HighFrequency strategy with an example.

First, based on the high frequency of each user in List within a week, $\textit{List}=\{u_{1},u_{2},u_{3},u_{4},u_{5},u_{6},u_{7},\linebreak u_{8},u_{% 9},u_{10}\}$ is sorted in descending order by the high frequency of each user in List, which is updated as $\textit{List}=\{u_{5},u_{4},u_{8},u_{9},u_{1},u_{2},u_{3},u_{6},u_{7},u_{10}\}$ .

Furthermore, $u_{5}$ and the posts of $u_{5}$ are selected into $D_{1}$ . Hence, $D_{1}$ is updated as $D_{1}=\{u_{5}\}$ and List is updated as $\textit{List}=\{u_{4},u_{8},u_{9},u_{1},u_{5},u_{2},u_{6},u_{7},u_{10}\}$ .

Subsequently, if the condition that described in Line 11 of UNITE_SS algorithm is true and $u_{4}$ can be formed a strongly connected graph with $D_{1}=\{u_{5}\}$ , then $u_{4}$ and the posts of $u_{4}$ will be selected into $D_{1}$ , and List is updated as $\textit{List}=\{u_{8},u_{9},u_{1},u_{5},u_{2},u_{6},u_{7},u_{10}\}$ ; otherwise $u_{4}$ is deleted from List and the current first user $u_{8}$ selected from List tries to construct $D_{1}$ like $u_{4}$ . Iteratively, $D_{1}$ completes the construction until the condition that described in Line 11 of UNITE_SS algorithm is false, and $D_{1}$ is updated as $D_{1}=\{u_{5},u_{4},u_{6},u_{7}\}$ .

Finally, based on the iterative process mentioned above, we can respectively produce $D_{2}=\{u_{8},u_{9},\linebreak u_{10}\}$ and $D_{3}=\{u_{1},u_{2},u_{3}\}$ .

4.3.4 ShortDistance strategy for subgraph stream construction

Previous strategies may cause less close associations between adjacent subgraphs because of the uncertainty of selecting initial users from List to construct subgraphs. Therefore, we propose another strategy named ShortDistance.

(1) Idea of ShortDistance strategy

Given a previous subgraph $D_{\alpha-1}$ , we will first calculate the distance between current subgraph named $\textit{tmpSubgraph}_{p}$ (user $u_{p}$ from List added into subgraph $\textit{tmpSubgraph}_{p}$ ) and previous subgraph, and the equation is given as

$\displaystyle\textit{Dist}(\textit{tmpSubgraph}_{p},D_{\alpha-1})=\max\{% \textit{Dist}(u_{1},D_{\alpha-1}),\dots,\textit{Dist}(u_{p},D_{\alpha-1}),% \dots\},$

(1) $\displaystyle 1\leqslant p\leqslant\textit{sizeOf}(\textit{tmpSubgraph}_{p}).$

where $\textit{Dist}(u_{p},D_{\alpha-1})$ means the distance from user $u_{p}$ in $\textit{tmpSubgraph}_{p}$ to subgraph $D_{\alpha-1}$ and is calculated by building a big graph, which is composed of $\textit{tmpSubgraph}_{p}$ , $D_{\alpha-1}$ and social relations between them. And $\max\{.\}$ denotes the maximum distance from a user in $\textit{tmpSubgraph}_{p}$ to $D_{\alpha-1}$ .

Then, user $u_{q}$ close to $D_{\alpha-1}$ will be selected from List, which must satisfy the following condition

$\displaystyle\min\{\textit{Dist}(\textit{tmpSubgraph}_{1},D_{\alpha-1}),\dots,% \textit{Dist}(\textit{tmpSubgraph}_{q},D_{\alpha-1}),\dots\},$ (2) $\displaystyle 1\leqslant q\leqslant\textit{sizeOf}(\textit{List}).$

where $\min\{.\}$ denotes the shortest distance between $\textit{tmpSubgraph}_{p}$ and $D_{\alpha-1}$ .

Based on the above preparations, Algorithm 4.3 presents the ShortDistance algorithm for subgraph stream construction.

[htbp] ShortDistanceList and their graph structural information, subgraph $D_{\alpha-1}$ . subgraph $D_{\alpha}$ . $q=0;1<q<\textit{List.size}();q++$ put $u_{q}$ into tmpSubgraph each user $u$ in tmpSubgraph $\max\textit{Dist}_{q}<\textit{dis}(u,D_{\alpha-1})$ $\max\textit{Dist}_{q}=\textit{dis}(u,D_{\alpha-1})$ $\min\textit{GlobleDist}>\max\textit{Dist}_{q}$ $\min\textit{GlobleDist}=\max\textit{Dist}_{q}$ $\textit{GlobleDistID}=q$ $\textit{tmpSubgraph}=\textit{tmpSubgraph}\setminus u$ Continuously select the first user $u_{\min\textit{GlobleID}}$ from the updated List as the candidate user for constructing $D_{\alpha}$ Return subgraph $D_{\alpha}$ .

(2) Run the strategy of ShortDistance by the toy example

By incorporating ShortDistance strategy into UNITE_SS algorithm, a subgraph stream can be obtained as shown in Fig. 6. Compared with HighFrequency strategy shown in Fig. 5, the subgraph stream in Fig. 6 has more associations between adjacent subgraphs, which is benefit for user interest extraction. The reason is that the closer associations between subgraphs are, the more network information can be obtained to improve the quality of microbloggers’ interest further.

Figure 6.
Comparison ShortDistance with other strategy using an example.

4.3.5 MaxAssociation&HighFrequency strategy for subgraph stream construction

HighFrequency can early find active users, therefore it prones to generate more relations in a subgraph. However, active users who are not influential person may cause less users in a subgraph, which will cause poor quality of extracting user interest. Meanwhile, the effectiveness of MaxAssociation depends on the initial subgraph. Therefore, we propose another strategy named MaxAssociation&HighFrequency.

(1) Idea of MaxAssociation $\&$ HighFrequency strategy

According to the principle that the combination of MaxAssociation and HighFrequency, List is sorted by the associations. Then subgraph $D_{\alpha}$ is constructed to satisfy that there are most associations between the candidate users selected from List and the users in the previous subgraph $D_{\alpha-1}$ . If two or more users in List have the same number of overlapping users between the followers/followees of theses users and $D_{\alpha-1}$ , then List is sorted by the frequency of updating his/her microblogs.

(2) Run MaxAssociation $\&$ HighFrequency strategy by the toy example

As the example described in MaxAssociation strategy, candidate users $u_{4}$ , $u_{6}$ and $u_{8}$ have the same number of overlapping users with previous subgraph $D_{\alpha-1}$ , we can sort List by the frequency of updating his/her microblogs, and List is updated as $\textit{List}=\{u_{5},u_{4},u_{8},u_{6},u_{7},u_{9},u_{10}\}$ . Except for the order of List, the process of MaxAssociation $\&$ HighFrequency for subgraph construction is same as the process of MaxAssociation and HighFrequency. The result of MaxAssociation $\&$ HighFrequency is same as Fig. 4. Therefore, we do not illustrate the process of MaxAssociation $\&$ HighFrequency by the toy example.

In a word, there are four algorithms discussed in Sections 4.2 and 4.3 respectively. To better illustrate the associations between the proposed algorithms, a diagram is shown in Fig. 7. As seen from Fig. 7, there are five strategies for subgraph stream construction consisting the algorithm list, where three of them is given corresponding algorithm, and the idea of MaxAssociation&HighFrequency strategy is a combination of Algorithm 4.3 and Algorithm 4.3. Besides, when Step 3 in UNITE_SS algorithm is executed, a subgraph stream strategy must be chosen from the algorithm list.

Figure 7.

An illustration diagram showing links between the proposed algorithms.

4.4 User Interest Inference based on a subgraph

Traditional methods such as UNITE (User-Networked Interest Topic Extraction) [11] and RWTP (Random Walk Topic Propagation) [3] can be applied and extended in the data structure of subgraph stream to infer microbloggers’ interest from dynamic information of microblogs, where UNITE is effective for mining microbloggers’ interest from the static information of microblog [11]. Therefore, we choose UNITE as the example and extend UNITE on each subgraph, which can be divided down into three steps to extract user interest, namely, candidate user interest topic generation, network construction and interest topic ranking.

(1) Candidate interest generation

Preliminary: Given a subgraph stream $D=\{D_{1},D_{2},\dots,D_{\alpha},\dots,D_{N}\}$ , where the textual content of each subgraph $D_{\alpha}=\{\{<u_{\alpha_{1}},p_{\alpha_{1}}>\}^{r_{1}}_{1},\dots,\{<u_{% \alpha_{i}},p_{\alpha_{i}}>\}^{r_{i}}_{1},\dots,\{<u_{\alpha{m_{\alpha}}},p_{% \alpha{m_{\alpha}}}>\}^{r_{m_{\alpha}}}_{1}\}$ (as described in Section 3) can be aggregated as $D_{\alpha}=\{d_{\alpha,n}\}^{m_{\alpha}}_{n=1}$ by each user, where $d_{\alpha,n}$ aggregates each user’s posts as a big document.

To extract candidate interest topics by a user on a subgraph, we perform a topic model (such as Twitter-user [23]) over the posts of each subgraph $D_{\alpha}$ to get $\Gamma_{\alpha}$ topics. Hence, a candidate interest topic list for a user can be denoted as following:

$\displaystyle u_{i}:<t_{i1},s_{i1}>;\dots<t_{ij},s_{ij}>\dots$ (3)

where $t_{ij}$ and $s_{ij}(1\leqslant j\leqslant\Gamma_{\alpha})$ are denoted as the corresponding number of a topic for $u_{i}$ and the initial score of the topic respectively.

(2) Network construction

Besides the network between the users within a subgraph, there are associations between the adjacent subgraphs, so we construct the subgraph $D_{\alpha}$ as a directed graph $G_{\alpha}=(V_{\alpha},E_{\alpha},W_{\alpha})$ , which aggregates the associations between its previous subgraph $D_{\alpha-1}$ and $D_{\alpha}$ , where $V_{\alpha}$ denotes the set of users in subgraph $D_{\alpha-1}$ and $D_{\alpha}$ ; $E_{\alpha}$ denotes the set of edges by the “following” relationships between these users; $W_{\alpha}$ represents the set of weight of edges. The weight of any edge is set as the number of common followees between any two users that construct the edge, and the details are described in [11].

(3) Interest ranking

According to each user’s candidate interest and graph $G_{\alpha}=(V_{\alpha},E_{\alpha},W_{\alpha})$ , we give a brief view of the graph-based algorithm [11] to rank user interest on a subgraph.

$\textit{topicList}(u_{i})=\{\{<t_{im},s_{im}>\}^{T_{\alpha}}_{m=1}\}$ is denoted as user $u_{i}$ ’s candidate topic list, where $T_{\alpha}$ is the total number of a user’s candidate topic list, $t_{im}$ is the $m$ -th candidate topic, and $s_{im}$ is the initial score of $t_{im}$ obtained by topic modelling. Suppose $u_{k}(u_{k}\in\textit{neigbor}(u_{i}))$ is a neighbor user of $u_{i}$ , which can be obtained from the graph $G_{\alpha}=(V_{\alpha},E_{\alpha},W_{\alpha})$ , and $\textit{topicList}(u_{k})$ can be $u_{k}$ ’s candidate topic list. If there is an overlap topic list between $\textit{topicList}(u_{i})$ and $\textit{topicList}(u_{k})$ , namely $\textit{overlapList}(u_{ik})$ , the score of $t_{ip}$ will be influenced by these neighbors, where $p(1\leqslant p\leqslant T)$ is the location number of $t_{ip}$ in $\textit{topicList}(u_{i})$ . And the updated score of $t_{ip}$ called $s_{ip}$ by the influences of all neighbors of $u_{i}$ can be denoted as

$\displaystyle s_{ip}=s_{ip}+\sum_{u_{k}\in\textit{neigbor}(u_{i})}\frac{w_{ik}% }{\sum w_{ik}}\ast s_{ip}\ast\textit{isOverLap}(u_{ik})$ (4)

where

$\displaystyle\textit{isOverLap}(u_{ik})=\left\{\begin{array}[]{ll}{1},&\textit% {if overlapList}(u_{ik})\neq\varnothing;\\ {0},&\textit{if overlapList}(u_{ik})=\varnothing.\\ \end{array}\right.,$

where $\textit{overlapList}(u_{ik})$ can be calculated by performing topic model on the posts of all users in a subgraph. However, each subgraph has different topic distribution, so we leverage the cosine semantic similarity [24] to compute the overlap topics. For example, the value of $\textit{similarity}\geqslant\textit{thresholdS}$ is set as the “overlap” topic criteria, where thresholdS is the threshold of similarity. That is, if the semantic similarity between any two topics is beyond the value of thresholdS (thresholdS is empirically set as 0.8), then, they share the same meaning, and there is an overlap topic list $\textit{overlapList}(u_{ik})$ between $\textit{topicList}(u_{i})$ and $\textit{topicList}(u_{k})$ .

5. Experimental evaluations

In this section, the process of data collection and the general statistics of the data are described. Subsequently, the evaluation measures are given. Furthermore, the influences of subgraph construction strategies are explored. Finally, the experimental results are performed to show the advantages in comparison with the baselines.

5.1 Dataset

In order to verify the effectiveness of our proposed approach, we collect data from Sina Weibo, which is one of the most famous and representative microblogs in China. The crawling of dataset is performed as follows.

Initially, we collect the user Kaifu Lee to the initiate user list named List. That is due to Kai-fu Lee is a famous and active user in Sina Weibo. Based on the intuition that the followees of microblogger are most likely to be active users, we collect the followees of each user $v$ in List and form the set $F(v)$ . Further, we set $\textit{List}=\textit{List}\cup F(v)$ . This process is repeated until 4 iterations are completed.

After completing iterative process, we can obtain a large social stream named SubGraphData by collecting users in List and their information such as posts, the follower list and the followee list, where each user’s recent 100 posts are collected, and up to 200 followees or followers for each user can be crawled because the maximum visible number of followees or followers in Sina Weibo is 200.

There are 259,185,126 original posts created by 2,433,429 users. In fact, some users have more than 200 followees or followers. However, there are an average of 42 missing followees or followers per user due to the crawling restrictions imposed by Sina Weibo and network delay. On the other hand, for the posts published by users is noisy, we make use of the same pre-processing method of [11] to denoise. After pre-processing, SubGraphData contains 192,309,705 posts with an average of 25.8% denoising rate, and 2,072,727 users with an average of 14.8% denoising rate. The general statistics information of our experimental dataset with an average of 54.31 following relations between users are shown as Table 1. To the best of our knowledge, SubGraphData has more users than any dataset reported in the field of user interest inference.

Table 1
General statistics of SubGraph data

#Users	#Posts	#Relations (following)	#Followers	#Average nouns per post	#Noun vocabulary
2,072,727	192, 309,705	112,577,078	319,199,958	6.64	154,703

5.2 Evaluation measures

To evaluate the quality of user interest inference from microblogs, we introduce the measures of Precision and Mean Reciprocal Rank (MRR) as two famous metrics in Natural language processing for ranking the quality of the results. In addition, both of them are widely used in the literature [1, 11, 30, 31] for measuring the quality of user interest. The brief descriptions of Precision and MRR are outlined as follows.

(1) Precision

$\displaystyle\textit{Precision}=\frac{c_{\textit{correct}}}{c_{\textit{exact}}}$ (5)

where $c_{\textit{correct}}$ is the total number of correct user interest extracted by an approach, and $c_{\textit{exact}}$ is the total number of automatic extracted interest. Since there is no human-labelled correct interest set in microblogs to be taken as benchmark, we refer to the previous work [1, 11] to construct the benchmark interest set.

(2) MRR

MRR [25] is used to evaluate how the first correct interest for each user is ranked. MRR for the top- $K$ extracted interest is defined as

$\displaystyle\textit{MRR@ K}=\frac{1}{|U|}\sum_{u\in U}\frac{1}{\textit{rank}_% {u}}$ (6)

where $K$ denotes the number of extracted interest for each user.

5.3 Comparison of subgraph stream strategies

In our experiments, we exploit various subgraph stream construction strategies to choose the better strategy for user interest inference in microblogs. In addition, we choose UNITE as the unified user interest extraction method when different subgraph stream strategies are compared.

5.3.1 Effectiveness comparison of subgraph stream construction strategies

Baseline: We compare four subgraph construction strategies with the method of community detection. Additionally, we introduce a community detection method as the baseline for subgraph construction strategies, because the aim of community detection is similar to that of our approach. That is, there are lots of connections between users in a subgraph (community). Specifically, we implement a typical community detection method called CommunityDetection [9] as the baseline, which is a modularity-based community detection algorithm that performs well in terms of both efficiency and effectiveness.

Figure 8 shows the comparing results with different subgraph construction strategies under MRR, where the value of $R$ is sampled randomly to be 2,50,100,150 and 200 thousand users form SubGraphData to construct a subgraph stream for user interest extraction. From experimental results demonstrated in Fig. 8, it is easy to conclude the following observations.

Figure 8.

Comparison for all subgraph stream construction strategy using MRR.

First, MaxAssociation $\&$ HighFrequency performs best in all the strategies. MaxAssociation strategy is more inclined to find socially rich users. In addition, HighFrequency strategy tends to find active users. Therefore, influential users will be given priority to construct a subgraph by the combination of these two strategies, which will make users in the subgraph have a wide range of network relations between them. Furthermore, this kind of network relationship is of great significance to improve the quality of users’ interest.

Second, as the number of sampled users $R$ increases, the value of MRR increases accordingly. This conclusion demonstrates that the greater the number of users in the dataset, the more information it provides to improve the quality of user interest extraction.

Third, CommunityDetection doesn’t perform best. The reason should be attributed to that this method ignores the directivity of social relations between users, and these communities (subgraphs) constructed by CommunityDetection have either a great deal of users or few users. Furthermore, user interest extracted from communities (subgraphs) that has plenty of users cannot keep pace with the updating posts of the user, which leads to that the extracted interest is out of date, and some communities even cannot be well handled due to large-scale nodes in the communities.

Fourth, Random shows the worst performance. This is due to the fact that inactive user is usually selected as the first node for subgraph construction, which may cause the subgraph have less users and less association between these users.

5.3.2 Efficiency comparison of subgraph stream construction strategies

To validate the efficiency of constructing a subgraph stream, Table 2 shows the Runtime of each subgraph construction strategy combining with the different subgraph selection strategies, where the Runtime can be divided into two main procedures: the Runtime of a subgraph stream construction and the Runtime of user interest extraction on each subgraph.

Table 2
The Runtime of different subgraph stream strategies for user interest extraction

Strategy	Random	MaxAssociation	HighFrequency	ShortDistance	MaxAssociation	HighFrequency& CommunityDetection
The Runtime of a	7.45(h)	8.37(h)	8.26(h)	20.45(h)	8.73(h)	104.28(h)
subgraph stream
construction
The Runtime of user	2.03(h)	2.75(h)	2.7(h)	2.74(h)	2.82(h)	5.95(h)
interest extraction
Total Runtime	9.48(h)	11.12(h)	10.96(h)	23.19(h)	11.55(h)	110.23(h)

According to the results presented in Table 2, we can obtain the following observations.

First, CommunityDetection performs the worst in terms of efficiency, this strategy constructs communities (subgraphs) based on the globe network, and thus it spends more time than other strategy on constructing subgraphs. Moreover, the number of users in the subgraph constructed by CommunityDetection are not balance, some subgraphs contain hundreds of thousands of users, while the number of users in each subgraph constructed by our proposed strategies are balance and are no more than $M$ (described in Section 3). Therefore, CommunityDetection may spend more time than other strategy on topic modelling and topic ranking.

Second, ShortDistance consumes most time in all the subgraph stream constructions. Since each candidate user selected from List for a subgraph construction needs to search all the List, which is time-consuming.

Based on the above experimental results, we choose the combination of MaxAssociation $\&$ HighFrequency as the best solution for constructing a subgraph stream due to their best performance and efficiency.

5.4 Comparison with baseline approaches using a subgraph stream

To validate the effectiveness of our proposed approach, we compare the performance of different user interest extraction methods by leveraging the same subgraph stream. The baselines and our proposed methods are outlined as follows:

•
RWTP [3] based on a subgraph stream, where RWTP is a user interest propagation method to extract microbloggers’ interest by combining both text and link information. Here, we use the strategy of MaxAssociation&HighFrequency for subgraph stream construction. In addition, we implement RWTP on each subgraph to extract user’s interest.
•
IFT (Interest Found by Time) [4] based on a subgraph stream, where IFT can identify users’ interest within a specific time interval. The subgraph stream used in IFT is same as that in RWTP.
•
ParallelLDA [6], which extracts user interest by building a large-scale LDA model and implement it on Hadoop.
•
UNITE_SS, which utilizes UNITE [11] based on a subgraph stream, where UNITE is a user interest inference method to extract microbloggers’ interest with both the contextual information and the ‘following’ relations between microbloggers. The subgraph stream used in UNITE_SS is same as that in RWTP.

Additionally, we do not directly apply approaches such as RWTP [3], IFT [4] and UNITE [11] to SubGraphData, because these approaches cannot be performed on SubGraphData due to resource constraints.

Table 3 shows the comparing results with different approaches for user interest inference on SubGraphData regarding MRR, Precision and Runtime, respectively. As can be seen from Table 3, it is feasible to extend traditional approaches for user interest extraction on a subgraph stream, which solves the problem that user interest cannot be extracted from large-scale microblogs. Additionally, three observation are concluded in the following.

Table 3
Results of different approaches for user interest inference on SubGraphData

Approaches A subgraph stream based on IFT ParallelLDA A subgraph stream based on RWTP UNITE_SS

Runtime 20.78(h) 3.07(h) 12.34(h) 11.55(h)

MRR 0.623 0.478 0.745 0.834

Precision@3 0.601 0.426 0.719 0.813

First, UNITE_SS is much more effective and efficient than RWTP based on a subgraph stream, which is consistent with the results of [3]. Since RWTP ranks users’ candidate interest heavily depending on the influences of their neighbors, slightly considering that of posts, which is not suitable for extracting general users. Different from this, UNITE balances the effect of textual posts and social neighbors for performance improving. The details can be found in [11].

Second, IFT based on a subgraph stream doesn’t perform better than RWTP based on a subgraph stream. Since IFT identifies user interest relying on posts published by microbloggers and ignores the social relations between microbloggers, leading to poor quality of microbloggers’ interest. On the other hand, IFT spends more time than RWTP because of incorporating external knowledge such as Wikipedia.

Third, ParallelLDA performs best on efficiency and worst on effectiveness. ParallelLDA performs best because the big data technology such as Hadoop are applied to implement LDA for user interest extraction. On the other hand, ParallelLDA only use short and noisy posts published by microbloggers, and LDA is more effective in regular document such as news than short posts, leading to spoor effectiveness.

In summarize, our proposed approach UNITE_SS takes the trade-off between effectiveness and efficiency into consideration, which provide a better solution to infer user interest over large-scale microblogs comparing with the state-of-the art baseline approaches.
6. Conclusions

Approaches	A subgraph stream based on IFT	ParallelLDA	A subgraph stream based on RWTP	UNITE_SS
Runtime	20.78(h)	3.07(h)	12.34(h)	11.55(h)
MRR	0.623	0.478	0.745	0.834
Precision@3	0.601	0.426	0.719	0.813

In this paper, we proposed a microbloggers’ interest inference approach based on a subgraph stream. We first proposed different strategies of subgraph stream construction. Furthermore, we introduced the steps of extending traditional user interest extraction approaches on each subgraph. Compared with the previous work which only leveraged static information of microblog or few information of microbloggers to extract interest, our proposed approach UNITE_SS can extract a real-time and effective interest for microbloggers from dynamic, persistence and large-scale microblogs. The data structure of the subgraph stream proposed in our work can solve the problem of extracting microbloggers’ interest, and the idea of subgraph stream can also provide a solution for the problem of large-scale graph partitioning with constraint resources. In our future work, we will study the improvement of efficiency of subgraph stream construction and use more test data for evaluation to validate the robustness of our proposed approach.

Footnotes

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China under grant 2016YFB1000900, the National Natural Science Foundation of China (NSFC) under grant 61503114 and 61906060, the Key Project of the Natural Science Foundation of Educational Commission of Anhui Province under grants KJ2019A0642 and KJ2018A0432, the Anhui Provincial Natural Science Foundation under 1808085MF177 and 2008085MF224, and the Project of the Natural Science Foundation of Educational Commission of Anhui Province under grant KJ2018B03.

References

Fan

Zhou

and Zheng

T.F.

, Mining the personal interests of microbloggers via exploiting wikipedia knowledge, in: Proceedings of the 15𝑛𝑑 International Conference on Intelligent Text Processing and Computational Linguistics, Kathmandu, Nepal, 2014, pp. 188–200.

Michelson

and Macskassy

S.A.

, Discovering users’ topics of interest on twitter: A first look, in: Proceedings of the 4𝑛𝑑 Workshop on Analytics for Noisy Unstructured Text Data, Toronto, Canada, 2010, pp. 73–80.

Wang

T.T.

Liu

H.Y.

and Du

, Mining user interests from information sharing behaviors in social media, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Gold Coast, Australia, Vol. 7819, 2013, pp. 85–98.

Zarrinkalam

Fani

Bagheri

Kahani

and Du

, Semantics-enabled user interest detection from twitter, in: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Singapore, 2015, pp. 469–476.

Nie

Jia

Zhu

and Zhou

, Identifying users across social networks based on dynamic core interests, Neurocomputing 210 (2016), 107–115.

Pennacchiotti

and Gurumurthy

, Investigating topic models for social media user recommendation, in: Proceedings of the 20𝑛𝑑 International Conference on World Wide Web, Hyderabad, India, 2011, pp. 101–102.

Spasojevic

Yan

Rao

and Bhattacharyya

, LASTA: large scale topic assignment on multiple social networks, in: Proceedings of the 20𝑛𝑑 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, 2014, pp. 1809–1818.

Henzinger

M.R.

Raghavan

and Rajagopalan

, Computing on data streams, Tech. Rep. SRC-TN-1998-011, May, 1998, 107–118.

Blondel

V.D.

Guillaume

J.L.

Lambiotte

and Lefebvre

, Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment P10008 (2008), 155–168.

10.

Newman

M.E.

, Fast algorithm for detecting community structure in networks, Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 69(6133) (2004), 1–5.

11.

Wang

Huang

X.L.

and Li

, Microblog oriented interest extraction with both content and network structure, Intelligent Data Analysis 22(3) (2018), 515–532.

12.

Budak

Kannan

Agrawal

and Pedersen

, Inferring user interests from microblogs, Microsoft Research Lab, New York, USA, Tech. Rep. MSR-TR-2014-68, May, 2014.

13.

Sang

J.T.

D.Y.

and Xu

C.S.

, A probabilistic framework for temporal user modeling on microblogs, in: Proceedings of the 24𝑛𝑑 ACM International Conference on Information and Knowledge Management, Melbourne, Australia, 2015, pp. 961–970.

14.

Bao

Liao

S.S.

Song

and Gao

, A new temporal and social PMF-based method to predict users’ interests in micro-blogging, Decision Support Systems 55(3) (2013), 698–709.

15.

Bian

Topaloglu

and Yu

, Towards large-scale twitter mining for drug-related adverse events, in: Proceedings of the 2012 International Workshop on Smart Health and Wellbeing, Maui, Hawaii, USA, Vol. 286, 2012, pp. 25–32.

16.

Shmueli-Scheuer

Roitman

Carmel

Mass

and Konopnicki

, Extracting user profiles from large scale data, in: Proceedings of the Workshop on Massive Data Analytics on the Cloud, Vol. 4, 2010, pp. 1–6.

17.

Smith

Szongott

Henne

and Voigt

G.V.

, Big data privacy issues in public social media, in: The 6𝑛𝑑 IEEE International Conference on Digital Ecosystems and Technologies, Campione d’Italia, Italy, 2012, pp. 1–6.

18.

Abu-Salih

Wongthontham

and Chan

K.Y.

, Twitter mining for ontology-based domain discovery incorporating machine learning, Journal of Knowledge Management 22(5) (2018), 949–981.

19.

Herzig

Mass

and Roitman

, An author-reader influence model for detecting topic-based influencers in social media, in: Proceedings of the 25𝑛𝑑 ACM Conference on Hypertext & Social Media, Santiago, Chile, 2014, pp. 46–55.

20.

Smola

A.J.

and Narayanamurthy

S.M.

, An architecture for parallel topic models, in: Proceedings of the Very Large Data Bases Endowment, Vol. 3(1), 2010, pp. 703–710.

21.

Mitrović

and Tadić

, Spectral and dynamical properties in classes of sparse networks with mesoscopic inhomogeneities, Physical Review E 80(26123) (2009), 1–11.

22.

Newman

M.E.J.

and Girvan

, Finding and evaluating community structure in networks, Physical Review E 69(26113) (2004), 1–15.

23.

Z.H.

Xiang

and Yang

, Discovering user interest on twitter with a modified author-topic model, in: IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Lyon, France, 2011, pp. 422–429.

24.

Manning

C.D.

Raghavan

and Schütze

, Introduction to information retrieval, USA: Cambridge University Press, 2008.

25.

Voorhees

E.M.

, The TREC-8 question answering track report, in: Proceedings of the 8th Text Retrieval Conference, 1999, pp. 77–82.

26.

Zainab

Vasiliki

Paris

and Vladimir

, Streaming Graph Partitioning: An Experimental Study, in: Proceedings of the VLDB Endowment, Vol. 11, 2018, pp. 1590–1603.

27.

Lin

Ooi

B.C.

Wan

and Yu

, Scalable Distributed Stream Join Processing, in: ACM Sigmod International Conference, 2015, pp. 811–824.

28.

Pacaci

and Tamer Özsu

, Experimental Analysis of Streaming Algorithms for Graph Partitioning, in: 2019 International Conference on Management of Data, 2019, pp. 1375–1392.

29.

Chaudhry

H.N.

, FlowGraph: Distributed temporal pattern detection over dynamically evolving graphs, in: Proceedings of DEBS’19, 2019, pp. 272–275.

30.

Zhen

and Lin

C.Y.

, Improving user interest inference from social neighbors, in: ACM International Conference on Information & Knowledge Management ACM, 2011, pp. 1001–1006.

31.

Piao

G.Y.

and Breslin

J.G.

, Inferring User Interests for Passive Users on Twitter by Leveraging Followee Biographies, in: European Conference on Information Retrieval, 2017, pp. 122–133.

32.

Chen

H.H.

Jin

and Wu

S.L.

, Minimizing inter-server communications by exploiting self-similarity in online social networks, IEEE Transactions on Parallel and Distributed Systems 27(4) (2016), 1116–1130.