TSPS: A Topic based Shortest Path Set algorithm for influence maximization

Abstract

This paper focuses on the influence maximization problem in social networks, which aims to find some influence nodes that maximize the spread of information. Most existing achievements usually adopt a uniform propagation probability, without considering the topic information. Moreover, the classic Independent Cascade Model and its approximations have suffered from much running time. To overcome this limitation, this paper proposed a Topic based Shortest Path Set algorithm (TSPS). Additionally, a comprehensive set of experiments are conducted on large real-world networks, showing that our proposal provides more impressive results in the aspects of influence spread and running time.

Keywords

Influence maximization social networks topic-based heuristic algorithm

1. Introduction

As the paper all feel now, the social media cloud [1, 2, 3, 4] has gradually become a data storage platform for various social networks, such as Facebook, Bright Cloud, and Twitter. There is little doubt that the number of people publishing ideas or opinions on social networks is overtly increasing. Thus a large amount of information can propagate by “word-of-mouth” in social networks, such as ideas and topics [5].

Let us suppose the following examples in product marketing. To advertise a new product, viral marketing wants to select some initial users that will potentially recommend the product to a large number of other users, which could maximize the influence mostly in the form of “word-of-mouth”. Some other examples include rumor blocking and analyzing people’s behavior, which can be analyzed similarly. In the above examples, it is a natural issue to find a relatively superior way to sort out influential users from whom to maximize their influence over others. The context problem, named influence maximization [6], is meant to design the solution to single out top-K influential nodes that could maximize the total number of affected nodes from a social network. In the literature, the paper emphasizes addressing the maximization of influence that runs through large-scale networks [7, 8].

Over the last several years, the influence maximization analysis has attracted considerable research interests. There are some models which simulate information dissemination in networks, among which Independent Cascade Model (IC) [6, 9] is widely used. The IC model is a well-known probabilistic model of the connections. More concretely, each node is associated with the same influence probability to activate its neighboring nodes, and the probability indicates the power of one node influencing its neighbors. Note that the probability needs to be predefined without taking some important information into accounts, such as topics and user interest in different topics. However, the existence of the topic-sensitive phenomenon has been observed in many social networks [10, 11, 12, 13], i.e., a user usually follows his friends since they share the same topics, and his friends also probably follow back because of the same reason. In other words, the propagation process of different topics is different in real networks. Therefore, how to define the probability is worthy of consideration to obtain more realistic results. Moreover, both the IC model and its approximate approaches such as the greedy algorithm [6] have a serious drawback-suffering from much running time, which is not scaled to large networks successfully as the paper will point out in Section 2.

Since social networks are usually topic-sensitive and large-scale in reality, it is necessary to improve a novel algorithm by considering the topic and running time. To account for the above problems, the paper proposes the heuristic algorithm TSPS by modeling topic-level influence maximization on large-scale networks. Expressly, the proposed TSPS takes (1) the results of a predefined topic distillation based on Latent Dirichlet Allocation (LDA) [14, 15, 16]; (2) the extended IC Model to accommodate user interest in different topics; (3) the heuristic strategy based on the shortest path set [17, 18]. Extensive experiments illustrate that the TSPS algorithm achieves significantly better results in terms of impact propagation and running time for specific topics, which is in contrast with other methods (including the original greedy method and several existing heuristic algorithms).

In a word, the main contribution of this paper is to propose and evaluate a topic-based heuristic method that addresses the influence maximization problem in social networks. The contribution details are summarized below.

•
Based on putting forward to the topic-level influence maximization problem formally, we present a generative EIC model that is able to capture the topic link information in social networks. Different from the original IC model which simply sets to propagate uniformly in influence diffusion process [9], the paper emphasizes measuring the strength of topic-level influence quantitatively. In particular, the paper takes the results of topic distillation based on the LDA, and then adapt to user preferences source from the calculation results of topic distillation by extending the traditional IC model.
•
The paper proposes a novel heuristic strategy based on the shortest path set to select the most influential users that maximize the influence. Especially, the paper firstly computes the maximum propagation probability paths that nodes pairs occupy and then selects the maximum one to construct the shortest path set, which can maximize the number of nodes activated by a set of selected designated seeds.
•
The proposal provides significantly better results owing to incorporating topic and heuristic strategy by conducting adequate contrast experiments with other related algorithms on large real-world networks.

The rest paper is structured as follows. The second of the paper outlines the related works. Then, the next section depicts the TSPS algorithm we developed. Section 4 introduces experimental results and validates the computational efficiency derived from our method. At last, our conclusion of this paper is given in the fifth one.
2. Related works

Our breakthrough is the intersection of maximizing influence and topic targeting. Here the paper gives an overview of the related literature on these topics.

Much effort has been made for influence maximization, and there is a large number of pioneering work. Usually, users can be mathematically modeled as nodes and the connections among them can be represented by edges, so that the graph composed of nodes and edges maps perfectly a social network. The influence can propagate under a stochastic cascade model [9] with the specific rules in the network, and the paper just takes the following IC Model as an example. Considering a directed graph $G=(V,E)$ , the details of the influence between nodes perform by follows. Supposed that an inactive node $u\in V$ becomes active at step $t-1$ , it has a one-time chance to activate its neighbor $v$ who is inactive [15] through edge $(u,v)$ with a probability $p$ . And the inactivated node will become active in step $t$ if it succeeded in the previous step. Whether or not $v$ succeeds, it does not have any other chance to activate its neighbors in subsequent steps [9]. The generation iteration will not stop unless there is no further activation.

Note that the probability in the IC model is a predefined uniform parameter without taking some important information into account. As a result, the model neglects some important information among users, such as the information of user interest in different topics. The propagation process of different topics is different in the networks. Especially, the more interested a node and its neighbors are in the certain topic they share, the more likely the influence will happen. Therefore, without considering topic information, the results of influence maximization may be inaccurate. Consequently, the paper focus on measuring the strength of topic-level influence maximization quantitatively by using the LDA algorithm. The most similar work with ours is LeaderRank [20] and TwitterRank [10]. The LeaderRank algorithm is designed to detect opinion leaders in BBS, which involves finding the interest user group based on topic analysis. TwitterRank algorithm measures the topic-sensitive influence on Twitter by using the LDA. The experimental results show that both of their work (LeaderRank and TwitterRank) outperform the original PageRank [21] and other related algorithms. Meanwhile, their work concludes that the approaches can address the shortcomings of the PageRank algorithm by taking into account topical similarity. However, LeaderRank and TwitterRank are both PageRank-based methods to detect the opinion leaders in the networks, which are not suitable for the problem of influence maximization.

In pioneering work, the influence maximization has proved to an NP-hard [7, 22] problem. Consequently, a greedy approximation is proposed to address the problem, which is visibly superior to the classic methods such as the heuristic using degree centrality. However, the greedy approach may be time-consuming for large-scale social networks, which has testified that this is a daunting mission. Therefore, later studies try to improve the algorithm’s efficiency, scaling to large social networks. Even after these improvements, the greedy algorithm and other speeds ups such as NewGreedy [22], MixGreedy [23] and TLGA [24] still take much running time on large social networks, while these methods become completely infeasible for real networks. Consequently, several recent studies try to address the efficiency issue by proposing effective heuristic algorithms using their great efficiency and speed, such as Degree Discount Heuristic [23] and TLPA [24]. Their results give us a new light in addressing the influence maximization problem. Rather than paying more attention to improving the time efficiency of the original greedy algorithm, perhaps it cannot do better than proposing new heuristic methods.

Following the above discussion, for the sake of solving the efficiency limitation of existing algorithms, the paper observes that it is necessary to look for alternative ways, such as incorporating the topic and heuristic strategy efficiently. Along this line of consideration, the paper develops a topic-based heuristic influence maximization algorithm TSPS, taking the influence of the topics into account. Additionally, the paper adopts the heuristic strategy based on the shortest path set, preventing it from much time-consuming.

3. Topic-based influence maximization

3.1 Overview of our method

The paper draws into first the topic-based heuristic algorithm TSPS, which combines topic and heuristic strategy efficiently and is affirmed in Section 4. The general structure of our method is illustrated in Fig. 1. The TSPS algorithm works in three stages. Firstly, for recognizing voluntarily the topics of interest to the users, the paper conducts topic distillation based on the LDA. Based on the topics distilled, the paper extends the original IC model to incorporate user interest obtained from the first stage, providing conditions for the subsequent process. At last, considering concurrently the link structure as well as the topic similarity between users, the paper applies a novel heuristic strategy based on the shortest path set for evaluating the influence of the user.

Figure 1.

The general structure of the proposed approach TSPS.

3.2 Topic distillation based on LDA

Topic distillation is primarily intended for quantitative topic-level analysis based on the input network and topic distribution on each node. For this purpose, knowing that the LDA algorithm is a hierarchical Bayes model utilizing Dirichlet priors, which can recognize potential topic information that comes out of a large document corpus, the paper applies the algorithm for further research. The LDA involves analyzing a corpus to produce words and topics distribution for each topic and document, respectively.

In particular, a corpus of $D$ documents ties to a Dirichlet distribution over $T$ topics characterized by $\varTheta$ . Also, each topic is related to a distribution over words $w=\{w^{(1)},w^{(2)},\ldots,w^{(D)}\}$ denoted as $\phi$ . $\varTheta$ and $\phi$ have Dirichlet prior with concentration parameters $\alpha$ and $\beta$ respectively. Consequently, for each word in one document $d$ , the paper samples a topic $z$ from the distribution $\varTheta$ related to the document, and samples sequentially a word $w$ from the distribution $\phi$ related to topic $z$ . Finally, when the number of repetitions $N_{d}$ reaches the total number of words in document $d$ , the generation iteration will terminate.

Formally, given the hyper-parameters $\alpha$ and $\beta$ , the joint distribution of a topic distribution $\varTheta$ , a set of $N$ topics $z$ , and a set of $N$ words $w$ is defined as:

$\displaystyle p(\varTheta,z,w\mid\alpha,\beta)=p(\varTheta\mid\alpha)\prod_{n=% 1}^{N}\mathbf{p}(z_{n}\mid\varTheta)p(w_{n}\mid z_{n},\beta)$ (1)

Where $p(z_{n}\mid\varTheta)$ is simply $\varTheta_{i}$ for the unique $i$ such that $z_{n}^{i}=1$ . Integrating over $\varTheta$ and summing over $z$ , we provide the marginal distribution of a document:

$\displaystyle p(w\mid\alpha,\beta)=\int p(\varTheta\mid\alpha)\left(\prod_{n=1% }^{N}\mathbf{\!}\!\sum_{z_{n}}p(z_{n}\mid\varTheta)p(w_{n}\mid z_{n},\beta)% \right)d\varTheta$ (2)

Finally, the marginal probabilities of a single document are multiplied to gain corpus probability:

$\displaystyle p(D\!\mid\!\alpha,\beta)\!=\!\prod_{d=1}^{N}\mathbf{\int}p(% \varTheta_{d}\mid\alpha)\left(\prod_{n=1}^{N_{d}}\mathbf{\!}\!\sum_{{z_{d}}_{n% }}p(z_{d_{n}}\!\mid\!\varTheta_{d})p(w_{d_{n}}\!\mid\!z_{d_{n}},\beta)\right)d% \varTheta_{d}$ (3)

From the generative algorithm, the LDA model can be illustrated as a probabilistic graphical model shown in Fig. 2. It is obvious that the model has three levels including corpus level, document level, and word level. At the corpus level, the parameters $\alpha$ and $\beta$ are its parameters and are sampled once in the process of generating a corpus. At the document level, the variables $\varTheta_{d}$ and $\varphi_{d}$ are its variables, assuming that each document is sampled once. And at the word level, the variables $z_{d_{n}}$ and $w_{d_{n}}$ are its variables, sampling each word in each document once.

Figure 2.

Graphical representation of LDA model.

In order to use the LDA for distilling the topics users prefer, documents naturally correspond to the published posts, with each post being a document. In addition, the paper applies the Gibbs sampling algorithm for estimating model parameters $\varTheta$ and $\phi$ .

3.3 IC model extension

Based on the results of predefined topic distillation on the networks, the paper obtains the topic distribution which is mainly the probability distribution of users’ interest in different topics. Formally, given a directed network $G=(V,E)$ , each node $u\in V$ is related to a vector $r_{u}\in\mathbb{R}^{T}$ of $T$ -dimensional topic distribution ( $\sum_{z}r_{uz}=1$ ). Each element $r_{uz}$ contains the number of times a word is apportioned to a topic in a post, which captures the probability that users’ preference in certain topics. Consequently, the paper can define the topical similarity between users next.

Definition 1 (Topical similarity). The topical similarity between users $u$ and $v$ can be defined as:

$\displaystyle\textit{SIM}^{z}_{uv}=\frac{r_{u}^{z}*r_{v}^{z}}{\sqrt{\sum^{T}_{% z=1}(r_{u}^{2}*r_{v}^{2})}}$ (4)

Where $r_{u}$ and $r_{v}$ are the topic distribution of user $u$ and $v$ respectively (obtained from the topic distillation stage), and $T$ is the total numbers of topics.

Obviously, the topical similarity describes the cosine distance between two nodes, and its range is from 0 to 1. The value 1 implies that users are closely related based on topics, while the value 0 implies that users are quite different.

From the approach in definition 1, the paper is able to gain topical similarity vectors set that mainly depicts the users’ influence on each topic. The vectors set can evaluate the rounded influence of users on their neighbors likewise. As the paper has discussed in Section 2, a uniform probability such as the traditional IC model can not capture the users’ interest in different topics. Therefore, the paper considers that the probability is related to user preference: the more interested the users $u$ and $v$ in topic $T$ , the more likely the influence will happen. Thereupon, the activation probability between users $u$ and $v$ can be introduced to describe the attribute vividly, as follows:

Definition 2 (Activation probability). The activation probability $p_{uv}$ between users can be measured by the topical similarity vectors set in different topics, as shown below.

$\displaystyle p_{uv}=\sum_{z=1}^{T}{\textit{SIM}^{z}}_{uv}r^{z}_{u}r^{z}_{v}$ (5)

Depending on the applications, the traditional IC model is extended by considering topic information, called Extended Independent Cascade (EIC) model, and different probabilities can be assigned to describe the topical influence between $u$ and $v$ . This new influence model EIC is similar to the original IC model. In the EIC model, within a given time the active node $u$ will have an opportunity of infecting its inactive neighbor $v$ with the characteristic of activation, and $v$ will turn into the active node with probability $p_{uv}$ (defined as the Eq. (5)) that indicates the power of $u$ at influencing $v$ . The generative process continues until no more activation is possible. If there are parallel edges $c_{u,v}$ between $u$ and $v$ , the activation property will urge $u$ to activate $v$ with the gross probability $1-(1-p)^{c_{u,v}}$ .

3.4 User influence measurement

Similar to the original IC model, the NP-hard property also exists in the EIC model the paper proposed to find the nodes that meet the required influence, but the nice thing is that it can be approximated with a heuristic algorithm. Consequently, the paper proposes a novel heuristic strategy based on the shortest path set considering link structure as well as the similarity in users’ topics when weighing the influence of the user.

Supposing that $u$ can get to $v$ through path ( $u\to u_{1}\to u_{2}\cdots u_{m}\to v$ ), the propagation probability of the path can be figured out by:

$\displaystyle p(\textit{path})=\prod_{i=1}^{m-1}p(p_{i},p_{i+1})$ (6)

Since all nodes need to be activated along the path, the probability of $u$ activating $v$ along the path is indeed $p(\textit{path})$ . Then the paper utilizes the maximum influence path to evaluate the influence from $u$ to $v$ , trying to approximate the actual expectation of influence.

Definition 3 (Maximum influence path). In a given graph $G$ , the paper can aware the maximum influence path from $u$ search to $v$ is:

$\displaystyle\textit{path}_{\max}=\arg\max_{p}\{p(\textit{path})\mid\textit{% path}\in P\}$ (7)

Where $P$ consists of all possible paths from $u$ to $v$ in the networks. The paths will be broken in a consistent and predetermined way, and thus any $\textit{path}_{\max}$ is always unique.

For $u$ and $v$ nodes pair, the paper can covert the propagation probability (described as Eq. (6)) to a distance weight on the edge. Therefore, the maximum influence path can be treated as the shortest path. Consequently, for a given node $u$ in the networks, an aggregation of the maximum influence paths of a target seed set can also be obtained as follows:

Definition 4 (Shortest path set). In a given graph $G=(V,E)$ , for a node $u\in V\setminus S$ and a target seed set $S$ , the shortest path set can be defined as:

$\displaystyle\textit{pathSet}_{\textit{shortest}}(u,S)=\{\textit{path}_{\max}(% u\to v)\mid v\in S\}$ (8)

Obviously, the influence of node $u$ to target seed set $S$ can be approximated $\sum_{v\in S}\textit{path}_{\max}(u\to v)$ . As a result, given $S$ the paper can use this approximation to figure out the probability that $u$ is activated precisely.

Thus, the paper can measure user influence by proposing TSPS algorithm, as indicated by following algorithm 1.

: TSPS algorithm: the topic-based heuristic algorithm for finding top- $K$ influential nodes[1] graph $G=(V,E)$ , number $K$ , LDA parameters $\alpha$ and $\beta$ seed set $S$ $S=\phi$ $//$ Topic distillation Apply LDA algorithm $//$ IC model extension $(u,v)\in E$ $P_{uv}=\sum_{z=1}^{T}{\textit{SIM}_{uv}^{z}}r^{z}_{u}r^{z}_{v}$ $//$ User influence measurement $u\in V\setminus S$ $\textit{pathSet}_{\textit{shortest}}(u,S)$ $|S|<K$ $S=S\cup\textit{argmax}_{u\in V\setminus S}(\sum_{v\in S}\textit{path}_{\max}(u% \to v))$ $S$

The TSPS algorithm proposed runs $K$ rounds to search the nodes that can maximize the incremental influence spread during each round. First, the paper conduct topic distillation based on the LDA, with the purpose of recognizing voluntarily the appealing topics to users (line 2–3). Then for each pair of nodes, the paper extends the original IC model by computing the activation probability, based on their topical similarity (line 4–7). Fianlly, the paper presents a novel heuristic strategy based on the shortest path set to measure users’ influence (line 8–15). In particular, since the EIC model is an extended original IC model, it also has the submodularity presented in the influence maximization objective. For this property, it is not necessary to re-measure the incremental influence spread value in each round, if the values of the previous rounds are less than that of other nodes in the current round.

Time complexity. According to the description above, the segment calculations of the topic distillation stage, IC model extension stage, and user influence measurement stage are summed up to the total calculation of our TSPS algorithm.

During the topic distillation stage, detecting community requires $O(n\log n)$ , where $n$ means nodes number. Subsequently, the IC model extension stage consumes the calculation of $O(n)$ . Last but certainly not least, the paper gains $k$ influential seeds through a novel heuristic strategy based on the shortest path set at user influence measurement stage, and it expends $O({kn^{\prime}m^{\prime}})$ of time with candidate nodes number $n^{\prime}$ and candidate edges number $m^{\prime}$ in consequence. Collectively, the TSPS needs $O({n\log n+n+kn^{\prime}m^{\prime}})$ to calculate.

4. Experimental results

Extensive experiments are carried out with our proposed TSPS algorithm and some other baseline algorithms on enough types of real-world networks. This section will illustrate the impressive performance of our proposed algorithm, generally from three perspectives: (a) the topic distillation, (b) the influence spread comparing to other algorithms, and (c) time costs comparison with others.

4.1 Experiment setup

Datasets. The paper conducts simulation experiments on four real-world datasets. The paper selects these datasets because they cover various networks with many structural features. The node and edge sizes of the networks range from 6K to 36K and 13K to 6 M respectively.

•
CSDN. The dataset is crawled from the “CSDN Forum” in the period November 2016 to August 2017 (10 months), which is the largest IT technology exchange platform in China.
•
Twitter. This is a much larger network, which is part of the most renowned micro-blogging services, provided by Kwak [25] and Yang [26] et al. The dataset includes user ID, link information and tweets.
•
Sina Sport Forum. It gathers all the posts and comments in “Sina Sport Forum” from November 2016 to November 2017.
•
Sina Microblog. It is inclusive of microblog content, forwarding relations and friend relations, which is chalked up in May 2014.

Table 1 depicts the concrete attributes of the above datasets.

Table 1
Concrete attributes of the datasets

Datasets CSDN Twitter Sports Forum Sina Microblog

Type Directed, weighted Directed, weighted Directed, weighted Directed, weighted

Node 5736 6301 11475 36117

Edge 13893 20777 45300 621269

Average degree 2.977 6.595 7.895 34.403

Maximal degree 428 97 3200 5987

Connected Component 5703 4234 8574 26208

Weakly Connected Components 33 2068 2487 8993

Influence models. The paper carries out extensive experiments to compare the related algorithms on the original IC model. Set the activation probability $p$ of the original IC model to be unified, which is 0.01. By contrast, Eq. (5) calculates the probability $p_{uv}$ in the proposal TSPS algorithm, taking topical similarity into account.

Algorithms. The paper estimates our proposed two-stage strategy with both the greedy algorithm and some other heuristic algorithms. The specific algorithms are listed below.

•
TSPS: Our proposed algorithm 1 discussed in Section 4.
•
Greedy [6]: The classic greedy algorithm on influence expectation with lazy-forward optimization. We set the activation probability of a node is 1/indegree.
•
PMIA [27]: A heuristic algorithm based on the maximum influence arborescence. When simulating the algorithm, we set $\theta=1/320$ .
•
CELF [28]: An optimization for greedy algorithm which gets higher effectiveness. We set the activation probability of a node is 1/indegree.
•
Degree Discount [19]: The heuristic method with a propagation probability of $p=0.01$ .
•
PageRank [21]: The ranking algorithm popularly is applied to search engine. We set the restart probability to 0.15.
•
Degree [6]: The degree centrality selecting nodes in order to decrease their degrees. It is a normative means to consider high-degree nodes of social networks.
•
Random: The baseline algorithm choosing random nodes in the networks.

From the earlier experiments, there is no outstanding difference between the approximation mass for 10000 iterations and that of after 10000 even more iterations. Accordingly, for each targeted set, the paper takes the average of the simulation results processing 10000 times.
4.2 Results of topic distillation

Datasets	CSDN	Twitter	Sports Forum	Sina Microblog
Type	Directed, weighted	Directed, weighted	Directed, weighted	Directed, weighted
Node	5736	6301	11475	36117
Edge	13893	20777	45300	621269
Average degree	2.977	6.595	7.895	34.403
Maximal degree	428	97	3200	5987
Connected Component	5703	4234	8574	26208
Weakly Connected Components	33	2068	2487	8993

Topic distillation. The paper first evaluate the effect of topic distillation. Based on the proceeding definitions, use LDA to cluster and classify posts or comments by their title and content, and format different topics. To keep it simple, the paper regulate the parameters of the LDA, i.e., $\alpha$ and $\beta$ are preset to 1 and 0.01 respectively. Additionally, to simplify the number of topics is set to 50.

The distribution of topics for the datasets is shown in Table 2. Each topic is associated with 5 keywords that can mostly represent the content in the comments, which provide a meaningful description of the topic. Each keyword is associated with its LDA value in the back bracket.

Table 2
The five topics extracted from the datasets

Topic	CSDN	Twitter	Sports Forum	Sina Microblog
Topic1	reply (0.035), where (0.022), influence (0.014), register (0.096), Quote (0.008)	Follow (0.0281), breve (0.0210), por (0.0202), mayor (0.0192), travel (0.0184)	Team-member (0.142), Security staff (0.078), division (0.077), Ligament (0.075), Competition season (0.071)	Spring (0.0187), Long-johns (0.0183), experts (0.0186), people (0.0158), youthfulness (0.0134)
Topic2	report (0.0252), tweet (0.0174), Senior-leadership (0.0144), Corrupt-officials (0.0084), Pattern (0.0062)	Will (0.0309), great (0.0156), trump (0.0148), people (0.0147), country (0.0105)	Sponsor (0.054), bonus (0.039), manlian (0.038), Penalty Shootout (0.036), Match (0.032)	Korean (0.0143), annoyance (0.0094), experience (0.0047), video (0.0046), Love-drama (0.0033)
Topic3	world (0.0163), embedded-system (0.0075), assistant (0.0062), Java (0.0044), reason (0.0042)	Para (0.0374), con (0.0246), todos (0.0191), Buenos (0.0139), siempre (0.0131)	Premier League (0.1661), webcast (0.1043), Confrontation (0.0831), state (0.0802), remember (0.0765)	You at the same table (0.0033), Regret leaving (0.0033), graduation (0.0032), unfortunately (0.0032), Farewell family (0.0031)
Topic4	share (0.0068), learn (0.0066), support (0.0045), Interface (0.0041), medium (0.0034)	Traffic (0.0332), vote (0.0226), trump (0.0105), see (0.0099), tomorrow (0.0097)	times (0.1312), Milan (0.0224), sports (0.0120), project (0.0119), noontide (0.0107)	Sharing-experience (0.014), Headlines today (0.0057), Beat-an opponent (0.0049), Rockets (0.0023), Take the crown (0.0022)
Topic5	proposal (0.0062), building master (0.0044), Internet (0.0041), network Environment (0.0034), LAN (0.0023)	Pero (0.052), como (0.0283), los (0.0226), la (0.0214), con (0.0208)	Championship Cup (0.0303), Community (0.0174), Physical fitness training (0.0129), Midfield (0.0120), rest (0.0101)	Bowen (0.0209), Chinese football (0.0091), CSL (0.0043), AFC Champions League-news (0.0036), NetEase Sports (0.0034)

As the table makes clear, the LDA can obtain meaningful clusters consistent with reality, except for the Twitter network. Perhaps it is because of the unique features of the Twitter on which the number of characters published by its users is limited, and short text language processing is quite different from traditional natural language processing.

Influence spread. The paper does algorithm analog utilizing datasets to present influence results on the IC model. To predigest the results, control the seed set $k$ across 1 to 50.

Figure 3 compare various algorithms in influence spread for the four datasets respectively. The comparison illustrates that under all circumstances, considering the topic information can gain relatively better results. As the baseline algorithm, the Random performs very badly, which means it is necessary to select the initial nodes carefully. Because the TSPS algorithm does take the topic information into account, it outperforms other algorithms, including Greedy, PMIA, CELF, Degree Discount, Degree, PageRank, and Random, which obtains remarkably margins as the paper have expected. More concretely, the TSPS algorithm is at least 89%, 82.9%, 67.6%, and 37.9% better than other algorithms on the datasets respectively. It is worth noticing that the performance of the TSPS on Sina Microblog is not as good as on the other datasets. In addition, the changing trend of its curve is close to horizontal, which leads to about 2816 activated nodes, but makes no further progress as $k$ increases. When investigating why the TSPS does not perform as well, one sees the Sina Microblog dataset is a highly sparse network, with a few influence users providing larger impact.

Figure 3.

The comparison of influence spread on the four datasets.

Running time. Figure 4 depicts the comparison of running time on different datasets respectively. To keep it simple, the results are the time for selecting the top-50 initial seed set, i.e., $k=$ 50. The comparison demonstrates the TSPS has very good scalability and can handle large social networks with satisfactory running time. Surprisingly, the Random takes the least amount of time to select the initial seed set, although it performs badly in the aspect of influence spread. The reason is probably that the Random does not focus exclusively on influential nodes, and thus the node selection of the Random would likely reach the termination condition soon after a few attempts, drastically decreasing the running time. Except for the Random, the TSPS performs consistently best compared to the other algorithms, with only minority cases where the comparative algorithms make excellent performance (mainly on Sina Microblog), but in most other cases these algorithms perform not as well. Perhaps it is because Sina Microblog has relatively many topics to distillate as a loosely structured social network. And thus the TSPS takes relatively much time to conduct topic distillation. As expected the Greedy takes the most expensive time to select seed sets. For example, the paper sees that the Greedy is very slow by taking 243.1 seconds on Sina Microblog, while the TSPS only takes 15.1 seconds, more than 16 times faster than Greedy. As the optimization algorithm of the Greedy, the CELF is better than greedy in running time. However, the core idea of the CELF is still greedy-based, the TSPS is 5 times faster than CELF. Besides, the running time of the TSPS is better than PMIA, because PMIA spends a lot of time in path selection.

Figure 4.

A comparison of running time on different datasets respectively.

5. Conclusions

The paper put forward a topic-based heuristic algorithm TSPS, which is specifically applied to maximize the spread of the influence in the social network. Existed work only sets a uniform probability for influence propagation, which does not take topic information into account. In addition, both the IC model and its approximate approaches have suffered from a large amount of running time. Firstly, the paper conducts topic distillation based on the LDA, with the purpose of recognizing voluntarily the appealing topics to users. Secondly, the paper extends the original IC model to incorporate user preferences obtained from the first stage. At last, the paper applies a novel heuristic strategy based on the shortest path set considering link structure as well as the similarity in users’ topics when weighing the influence of the user. Our experiments on four datasets attest that our TSPS algorithm provides more impressive results in aspects whether influence spread or running time.

In the future, the paper set out to further probe into more advantages of our TSPS algorithm. As an example, the paper will try to apply the proposal to other models, including Linear Threshold (LT) [6] Model, Weighted Cascade Model, and some others. Another future work is to design an adaptive strategy to suit the dynamic social networks, since they are many aspects of uncertainness in information propagation.

Footnotes

Acknowledgments

This paper is supported by the Nature Science Foundation of China (No. 61502281, 71772107).

References

Laghari

A.A.

Karim

Shah

H.A.

and Karn

N.K.

, Quality of experience assessment of video quality in social clouds, Wireless Communications and Mobile Computing 2017 (2017), 1–10.

Laghari

A.A.

Shafiq

and Khan

, Assessment of quality of experience (QoE) of image compression in social cloud computing, Multiagent and Grid Systems 14(2) (2018), 125–143.

Karim

Mallah

G.A.

Laghari

A.A.

Madiha

and Larik

R.S.A.

, The Impact of Using Facebook on the Academic Performance of University Students, in: International Conference on Artificial Intelligence and Security, 2019, pp. 405–418.

Karim

Laghari

A.A.

Memon

K.A.

Khan

and Magsi

A.H.

, The evaluation video quality in social clouds, Entertainment Computing 35 (2020), 100370.

Hamzehei

Wong

R.K.

Koutra

and Chen

, Collaborative topic regression for predicting topic-based social influence, Machine Learning 108(10) (2019), 1831–1850.

Kempe

Kleinberg

J.M.

and Éva Tardos, Maximizing the spread of influence through a social network, Theory of Computing 11(1) (2015), 105–147.

Chen

Y.-C.

Zhu

W.-Y.

Peng

W.-C.

Lee

W.-C.

and Lee

S.-Y.

, CIM: Community-based influence maximization in social networks, ACM Transactions on Intelligent Systems and Technology 5(2) (2014), 25.

Yang

Liao

Shen

Cheng

and Chen

, Relative influence maximization in competitive social networks, Science in China Series F: Information Sciences 60(10) (2017), 108101.

, Maximize Information Coverage Algorithm for Target Market, Chinese Journal of Computers, 2014.

10.

Weng

Lim

E.-P.

Jiang

and He

, TwitterRank: finding topic-sensitive influential twitterers, in: Proceedings of the Third ACM International Conference on Web Search and Data Mining, 2010, pp. 261–270.

11.

Barbieri

Bonchi

and Manco

, Topic-aware social influence propagation models, Knowledge and Information Systems 37(3) (2013), 555–584.

12.

Zhou

Zhang

and Cheng

, Preference-based mining of top-K influential nodes in social networks, Future Generation Computer Systems 31 (2014), 40–47.

13.

Fang

Sang

and Rui

, Topic-sensitive influencer mining in interest-based social media networks via hypergraph learning, IEEE Transactions on Multimedia 16(3) (2014), 796–812.

14.

Quercia

Askham

and Crowcroft

, TweetLDA: supervised topic classification and link prediction in Twitter, in: Proceedings of the 4th Annual ACM Web Science Conference on, 2012, pp. 247–250.

15.

Misra

Cappe

and Yvon

, Using LDA to detect semantically incoherent documents, in: CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning, 2008, pp. 41–48.

16.

Hoffman

Bach

F.R.

and Blei

D.M.

, Online Learning for Latent Dirichlet Allocation, in: Advances in Neural Information Processing Systems 23, 2010, pp. 856–864.

17.

Gong

Wang

and Tian

, An efficient shortest path approach for social networks based on community structure, CAAI Transactions on Intelligence Technology 1(1) (2016), 114–123.

18.

Liu

Jin

Yang

and Zhou

, Finding Top-k Shortest Paths with Diversity, in: 2018 IEEE 34th International Conference on Data Engineering (ICDE), 2018, pp. 1761–1762.

19.

Wang

Jin

Lin

Cheng

and Yang

, Influence maximization in social networks under an independent cascade-based model, Physica A-statistical Mechanics and Its Applications 444 (2016), 20–34.

20.

Wei

and Lin

, Algorithms of BBS opinion leader mining based on sentiment analysis, in: WISM’10 Proceedings of the 2010 International Conference on Web Information Systems and Mining, 2010, pp. 360–369.

21.

Kandiah

and Shepelyansky

D.L.

, PageRank model of opinion formation on social networks, Physica A-statistical Mechanics and Its Applications 391(22) (2012), 5779–5793.

22.

Lee

J.-R.

and Chung

C.-W.

, A query approach for influence maximization on specific users in social networks, IEEE Transactions on Knowledge and Data Engineering 27(2) (2015), 340–353.

23.

Chen

Wang

and Yang

, Efficient influence maximization in social networks, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 199–208.

24.

Shengnan

Yong

Jinghua

and Long

, Research on topic-based local influence maximization algorithm in social network, Journal of Frontiers of Computer ence and Technology 10(5) (2016), 646–656.

25.

Kwak

Lee

Park

and Moon

, What is Twitter, a social network or a news media, in: Proceedings of the 19th International Conference on World Wide Web, 2010, pp. 591–600.

26.

Yang

and Leskovec

, Patterns of temporal variation in online media, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, 2011, pp. 177–186.

27.

Wang

Chen

and Wang

, Scalable influence maximization for independent cascade model in large-scale social networks, Data Mining and Knowledge Discovery 25(3) (2012), 545–576.

28.

Leskovec

Krause

Guestrin

Faloutsos

VanBriesen

and Glance

, Cost-effective outbreak detection in networks, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 420–429.

TSPS: A Topic based Shortest Path Set algorithm for influence maximization

Abstract

Keywords

1. Introduction

3. Topic-based influence maximization

3.1 Overview of our method

4.1 Experiment setup

Table 2 The five topics extracted from the datasets

Footnotes

Acknowledgments

References

Table 2
The five topics extracted from the datasets