Credit distribution for influence maximization in online social networks with node features 1

Abstract

Influence maximization is a problem of identifying a small set of highly influential individuals such that obtaining the maximum value of influence spread in social networks. How to evaluate the influence is essential to solve the influence maximization problem. Meanwhile, finding out influence propagation paths is one of key factors in the assessment of influence spread. However, since nodes’ degrees are utilized by most of existent models and algorithms to estimate the activation probabilities on edges, node features are always ignored in the evaluation of influence ability for different users. In this paper, besides the node features, the Credit Distribution (CD) model is extended to incorporate the time-critical aspect of influence in online social networks. After assigning credit along with the action propagation paths, we pick up the node which has maximal marginal gain in each iteration to form the seed set. The experiments we performed on real datasets demonstrate that our approach is efficient and reasonable for identifying seed nodes, and the influence spread prediction by our approach is more accurate than that of original method which disregards node features in the influence evaluation and diffusion process.

Keywords

Online social networks influence evaluation influence maximization credit distribution greedy algorithm

1 Introduction

The developments of Internet not only bring us an unprecedented convenient life-style, but also change our communicating methods. With the increasing number of people making friends and exchanging ideas in online social networks, the structure of our society becomes more complex and diverse than ever before. Different from most traditional media such as newspaper or TV, the process of influence spread in online social networks mainly bases on the fictitious relationship between individuals. Without distance constraints, the information can be propagated extremely rapidly. At the same time, diffusion of influence in online social networks provides a great challenge for viral marketing, which aims at finding a set of individuals in the network to maximize the word-of-mouth propagation.

In general, there are three challenges for successfully viral marketing in online social networks [1 –3]. First, how to evaluate the influence ability of user and what components of it. Second, what the model is and how it reflects the influence diffusion process appropriately. Third, how to design an efficient algorithm to identify a set of seed nodes which have maximal social influence in the target society.

Domingos and Richardson firstly modeled influence maximization as an optimization algorithmic problem [4, 5]. Kempe et al. formulated the problem of selecting the seed set as a discrete optimization problem [6]. Meanwhile, they presented two cascade models i.e. the Independent Cascade (IC) model and the Linear Threshold (LT) model. Specially, they demonstrated that the functions of influence spread under both IC and LT models maintain desired properties such as the monotonicity and the submodularity, which allow a greedy algorithm to approximate the optimal solution with a radio of (1-1/e). However, one of the main constraints of existed models proposed by Kempe et al. and their follow-ups, is that they need running Monte-Carlo (MC) simulations to obtain an accurate estimate of the expected spread. This is a time-consuming process which makes the deployment of these techniques on large-scale social networks infeasible [7]. Therefore, it is desired to overcome this defect and find both efficient and scalable approaches for solving the influence maximization problem [8]. Many effective heuristic algorithms such as LDAG [9] and PMIA [7, 10] have been proposed to achieve high efficiency at the cost of sacrificing the accuracy of consequences. On the other hand, based on users’ action records, a data-based model has been proposed. In particular, it has been proved that the quality of seed nodes identified by Credit Distribution (CD) model could be guaranteed [11].

However, all of these aforementioned methods do not fully incorporate node features that have been well observed in the influence evaluation and diffusion. Motivated by this idea, we propose two concepts of influence, the User Static Influence and the User Dynamic Influence. We also propose a quantifiable method to assess the user static influence which incorporates with node features. To be mentioned, one node’s user static influence is made up of three components. The first one is User Activity which reflects vitality of the node in a period of time. The second is User Sensitivity which describes sensitivity between the user and actions he performed. The last is User Affinity that reflects similarity among different users by counting their common actions. In addition, we also leverage the temporal nature of influence to measure the user dynamic influence. In particular, we propose a new model, namely Credit Distribution with Node Features (CD-NF), and let the credit assigned between two adjacent nodes by user dynamic influence. We also design a heuristic algorithm named Greedy Algorithm with Node Features (GNF) to tackle influence maximization under the CD-NF model. After assigning credit along with the action propagation traces, we select the node which has maximal marginal gain in each iteration to form the seed set. Our experiments demonstrate that the quality of seed nodes identified by our approach is better than the others. Besides, the prediction of influence spread by our approach on real testing set is more accurate than that of original method which disregards node features in the influence evaluation and diffusion process. Figure 1 shows essential steps of our approach.

The rest of this paper is organized as follows. Section 2 briefly summarizes related work. Section 3 describes the definition of problem and the measurement we adopted to evaluate the node influence. Section 4 specifies the model and algorithm we proposed to solve the influence maximization problem. Section 5 presents the experiments and simulations. Finally, Section 6 concludes the paper and discusses the future work. Our paper makes the following contributions.

We present three quantitative measures to evaluate the node features respectively: user activity, user sensitivity and user affinity.

We merge these aspects of node features into the user static influence and use the continuous exponential decay function to transform the strength of user dynamic influence between two adjacent users, which reflects the temporal feature of influence in practice.

We propose the CD-NF model and redefine the credit which is given to individuals for influencing their adjacent neighbors as the user dynamic influence.

We design the GNF algorithm on the established CD-NF model. After implementing the experiments and simulations on Flickr and Reddit respectively, we demonstrate that our approach provides a high quality of seed set and a more accurate prediction of influence spread w.r.t. testing actions.

2 Related work

Domingos and Richardson [4, 5] firstly modeled the Influence Maximization (IM) as an optimization algorithmic problem. Kempe et al. [6] formulated it as a discrete optimization problem and proved that the optimization problem is NP-hard. Moreover, they presented two cascade models, i.e., IC and LT models, and showed that the greedy approximation algorithm whose influence spread is within (1-1/e) of the optimal result [6]. However, since its time complexity is restricted by the number of Monte-Carlo simulations, it is not feasible to be implemented in large-scale social networks. To address this inefficient issue, Leskovec et al. [8] presented a Cost Effective Lazy Forward (CELF) optimization which takes advantage of properties such as the monotonicity and the submodularity to reduce the iteration number of evaluations. Goyal et al. further improved the CELF scheme and named it as CELF++ optimization [12] (exploiting the submodularity to avoid unnecessary recomputation of the marginal gains), but their method still requires a large number of MC simulations to guarantee the quality of seed set.

Chen et al. [13] put forward the MIA model and the PMIA algorithm which use the approximated local tree structures to tackle with the problem of influence maximization. In addition, they proposed the MixGreedy algorithm [7] to compute the marginal gains and utilized the CELF approach [8] for updating. By leveraging the MIA structure to approximate influence propagated results, they developed two heuristic algorithms for influence maximization over IC model [14], but they disregarded users’ characters and actions’ records in online social networks.

D. N. Yang et al. [15] proposed an approach to maximize the probability that the target user would accept an invitation from the initial application individual. Besides, they formulated this new optimization problem and developed a polynomial time algorithm named SITINA to find the optimal solution. They demonstrated that it can effectively maximize the acceptance probability of the target user in online social networks. H. Li et al. [16] proposed a novel conformity-aware cascade model which leverages the interplay between influence and conformity to estimate the influence spread. They also proposed a variant of models which support context-specific influence and conformity of nodes. Based on these models, they proved that the novel greedy algorithm called CINEMA can obtain a high-quality seed set for influence maximization.

H. Zhang et al. [17] proposed a new model for influence maximization called Opinion-based Cascading (OC) model. They maximized the number of users who hold positive opinions instead of the amount of activated nodes for influence maximization. Furthermore, they proved that their model could distinguish the negative and positive opinions. Goyal et al. [11, 18] leveraged historical records to derive a more accurate diffusion model. Based on users’ action records and chronological allocated credit, a data-based approach has been proposed to solve the problem of influence maximization. The efficiency could be significantly improved, but it was not easy to obtain the action records of real users. In particular, they proved that the quality of seed nodes identified by Credit Distribution (CD) model could be guaranteed [11].

3 Problem and node influence definition

A. Problem definition

We first describe the problem of influence maximization, and then show our extensions that incorporate node features. In general, a social network can be modeled as a directed graph G = (V, E, p) with V as node sets, E corresponding to social ties, and edge weights p capturing influence probabilities. S denotes the set of seed nodes in the graph. Intuitively, the result of influence spread function σ (S) is a set of nodes influenced after process of influence diffusion from the seed set S. The influence maximization is the problem of finding a seed set S with k users such that the expected influence spread (defined as the expected number of influenced users) is maximized, i.e. finding $S^{'} = \arg \max_{S \subseteq V} σ (S)$ . In this paper, our target is to maximize the expected value of credit distribution, and the detail we will further discuss in Section 4.

B. Node influence definition

In online social networks, the relationship among users is virtualized. Because of lacking region restrictions, making friends mostly base on factors such as similar preferences or characters. Since people participate in different groups by their respective interests, diverse communities lead to form spontaneously. Besides, the relationship between individuals is fragile and time sensitive. For instance, the strength of credit between two users always decays with time elapsing, and it is not easy for one to persuade the other if both of them are not contacted for a long time. Hence, one common way to solve this problem is utilizing extensively repeated recommendations to persuade audiences, just like TV advertisements, while this method will consume lots of extra resources. Compared with wasting of resources to make blindness influence, the recommendation qualified with guidance will receive much better profits and revenues.

Several possibility can be considered in mapping the weights on nodes to reflect the specificity of different individuals. On one hand, just like different people have different interest-orientations, various people whether to accept the same viewpoint is determined by the attitude in each person’s mind. On the other hand, every individual’s influence ability is also different, for example, the recommendations from friends or relatives, may have much stronger effect than the ones from strangers. This specificity can be traced back to the affinity and confidential relationship between individuals. Therefore, if we consider influence with specificity of individuals, the node feature is an important factor in determining the optimal seed set. By scanning the action-logs and analyzing users’ behaviors, we build and imitate action propagation process through tracking influence dissemination in real online social networks. Now with pre-determined components of node influence, we evaluate the influence ability for every individual by his specificity.

We now describe the definition of user static influence, and formulate its three components: user activity, user sensitivity and user affinity. Then we propose the definition of user dynamic influence. Related important variables in this paper are listed in Table 1.

1) User static influence:

In this paper, we conduct the strategy that the influence ability of user is represented by the node influence we proposed. Note that the original node influence we considered is a special case of user dynamic influence which has not incorporated the time-critical aspect of influence, denoted by user static influence. More importantly, the three components of user static influence we evaluated, i.e., user activity, user sensitivity, user affinity are mutual independent. We elaborate the definition of each component as below.

User Activity: Imaging the scenario, for a node u ∈ G, we assume that its degree is large, while its ability of receiving or forwarding messages is much lower than others in a long period of time, then it is not reasonable that we still regard u as highly influential candidate. Thus if we only consider the node’s degree as the judgement of node influence, there must be some errors in the evaluation of user influence ability, so that the user activity is one of critical aspects for evaluating node influence.

Intuitively, user activity for node u can be represented as the ratio of specific actions he performed with the number of all actions recorded in the action-log. More specifically, we further divide user activity into two ingredients, the user positive activity and the user passive activity. The user positive activity can be used in the mining of opinion leaders who act as the information sources, while the application of mining users who have high values of the user passive activity can help us to find the hub nodes which have huge potential influence ability.

We now consolidate these two categories of user activity to evaluate the integral user activity. After traversing the training action-log once, we get the values of user positive activity $A_{u}^{P}$ and user passivity activity $A_{u}^{N}$ respectively, then we assign parameter λ to positive activity and parameter δ to passivity activity. The user activity can be defined as $act (u) = \frac{λ A_{u}^{P} + δ A_{u}^{N}}{| A |} .$ (1)

Note that when both λ and δ equals to 1, user activity act (u) changes to the original form $a c t^{'} (u) = \frac{| A_{u} |}{| A |}$ . More importantly, we only need one parameter λ to keep balance between those components, thus the user activity can be written as $act (u) = \frac{λ A_{u}^{P} + (1 - λ) A_{u}^{N}}{| A | / 2} .$ (2)

Another form is given by $act (u) = \frac{λ A_{u} + (1 - 2 λ) A_{u}^{N}}{| A | / 2} .$ (3)

Let $A_{u}$ be the number of specific actions which u performed. The range of λ meets [0,1]. When λ = 0.5 the formula equals to the original form act′(u). If u performs the action a in the process of influence propagation, we consider that u has been activated, and it will maintain activity mode until the end of propagation. In addition, we also assume that there is no delayed time between activation and execution, i.e., if u performs an action a, u will put influence on all adjacent neighbors who have not performed the same action.

Average Delayed Time: Since the strength of node influence is not stable and constant, just like hot topics restricted by the time constraints, the influence between two nodes will reduce smoothly with time elapsing, which we called the time-critical feature of influence in online social networks. Hence, we define the average delayed time τ_v,u to measure the execution time span between two adjacent nodes for the same action a. The equation of average delayed time is defined as follows, $τ_{v, u} = \frac{\sum_{a \in A} (t_{u} (a) - t_{v} (a))}{A_{v 2 u}},$ (4) where t_u (a) and t_v (a) are the moments that u and v perform action a respectively. $A_{v 2 u}$ is the number of actions propagated from v to u, and $A$ is the universal set of actions. In addition, for each edge (u, v) ∈ E that v is directly adjacent to u, the relationship between u and v is undirected. More generally, we use another simplified variable to measure the average delayed time for u to all of its adjacent neighbors, denoted by τ_u.

User Sensitivity: There always exist some inherently ties between user and the actions he performed. In practical scenario, given the time stamp that u firstly heard the action a from his arbitrary adjacent neighbor, we can infer that the longer time span from it to his finial execution, the less sensitivity of user u to the action a, i.e., u does not interested in a. Therefore, we propose the user sensitivity to measure the strength of sensitivity by analyzing the appointed user and his performances.

Based on the action-log, we scan the log records to obtain the user actions record matrix (UARM) which denoted by E_n∗q. The element e_u,a in URAM represents u performed the corresponding action a or not. Intuitively, e_u,a = 1 as long as node u executed the action a, or vise versa.

Note that the strength of user sensitivity will be weaken with time elapsing. To take the time-critical feature into consideration, we convert each element in UARM by Logistic Distribution Function, thus the user sensitivity is given by ${sen}_{u} (a) = (1 + \exp (\frac{t_{s}^{0} (a) - t_{u} (a)}{- τ_{u}}))^{- 1},$ (5) where $t_{u}^{0} (a)$ is the moment of initial performer for the action a among u’s adjacent neighbors. t_u (a) is u’s final execution time in terms of the action a. Obviously, the time span $t_{u} a - t_{u}^{0} (a)$ is inverse proportional to sen_u (a). Then we replace the element e_u,a with sen_u (a) and update each element in UARM. The average value of user sensitivity is defined as follows: $\bar{{sen}_{u}} = \frac{\sum_{a \in A} {sen}_{u} (a)}{A_{u}}$ (6)

To be mentioned, the user sensitivity sen_u (a) is defined as the symmetric form of logistic distribution function. Because the functional graph is odd, and it reduces from 1 to 0 smoothly, it is more practical to fit the attenuation trend than the exponential method.

User Affinity: In online social networks, the affinity always exists among users who have similar characters or preferences. In contemporary society, the affinity is largely dependent on real environmental factors such as family, school, workplace, etc., which is largely restricted by geographical limitations. However, the mode of communication in online social networks breaks those constraints and makes the relationship among members more complex and diverse.

For any nodes u and v in graph G, intuitively, their common actions can reflect the similarity of their preferences. Therefore, we analyze their records of performances and measure their affinity by the reciprocal form of Jaccard Distance.

Let $J_{u, v}$ be the Jaccard Similarity Coefficient, i.e., $J_{u, v} = \frac{A_{v & u}}{A_{u | v}}$ , where $A_{u | v}$ is the number of actions either u or v executed, and $A_{v & u}$ is the number of actions performed by both u and v. Similarly, Jaccard Distance is defined as $J_{δ} = 1 - J u, v$ , thus we use the reciprocal of Jaccard Distance to measure the user affinity. The reciprocal of $J_{δ}$ between u and v can be computed as follows: $\frac{1}{J_{δ}} = \frac{1}{1 - J u, v} = \frac{A_{u | v}}{A_{u | v} - A_{v & u}} .$ (7)

Note that the user affinity is inversely proportional to Jaccard distance. Compared with $J_{u, v}$ , $\frac{1}{J_{δ}}$ can enlarge the similarity of performances among adjacent users. To be mentioned, when we calculate the user affinity, we ignore the execution sequence temporarily, which means that $A_{v & u}$ equals to $A_{u & v}$ .

In particular, when $A_{v | u} = A_{v & u}$ , this equation loses its efficacy because the denominator equals to 0. To overcome this defect, we use Laplace Smoothing Method to transform the original formula, so that we add the value of $1 + A_{v | u}$ to denominator, and at the same time, the value of numerator is incremented by 1. By the way, we treat $\frac{A_{v & u}}{A_{v | u}}$ as an prior conditional probability. This leads to the user affinity between v and u is given by

$\begin{matrix} {aff}_{v, u} = \frac{A_{v & u}}{A_{v | u}} \cdot \frac{A_{v | u} + 1}{2 A_{v | u} - A_{v & u} + 1} \\ = \frac{A_{v & u} A_{v | u} + A_{v & u}}{2 A_{v | u}^{2} - A_{v & u} A_{v | u} + A_{v | u}} . \end{matrix}$ (8)

To be mentioned, after eliminating the person who did not perform any actions, the affinity between two adjacent users is corresponding to the proportion of $A_{v & u}$ and $A_{v | u}$ reciprocally. When $mathcal A_{v | u} = A_{v & u} \neq 0$ , the similarity of performances between v and u are extremely the same, so that their affinity is maximum, i.e., aff_v,u = 1. Meanwhile, if $A_{v & u} = 0$ , there is no similarity of behaviors reflected by the training set, so that aff_v,u = 0. For the reason that this method simply utilizes the number of actions instead of the vectors or the elements in E_n∗q, we can achieve the user affinity results rapidly and efficiently. Next, the average value of user affinity is defined as $\bar{{aff}_{u}} = \frac{1}{| N (u) |} \sum_{v \in N (u)} {aff}_{v, u},$ (9) where N (u) denotes the set of adjacent neighbors of u, v ∈ N (u). For the case that if there is a high value of user affinity between a pair of nodes, we can assert that there is a strong similarity of behaviors between them, so as to the probability of their performance consistency. Therefore, we can conclude that the highly value of user affinity indicates the strong credit relationship between the pair of two individuals.

2) Comprehensive influence evaluation: For these components we propose above, we compute the user static influence for node u as logistic function, i.e., $ini (u) = \frac{1}{1 + \exp [9 - 6 (act (u) + \bar{{sen}_{u}} + \bar{{aff}_{u}})]} .$ (10)

Actually, we can compute the value of ini (u) (i.e. the original form of node u’s influence) with great rapidity. Note that the expression $9 - 6 (act (u) + \bar{{sen}_{u}} + \bar{{aff}_{u}})$ is within [-9, 9]. Hence, ini (u) is approximated between 0 and 1, and when $act (u) + \bar{{sen}_{u}} + \bar{{aff}_{u}}$ closes to 3, ini (u) tends to be 1. Next, based on the time-critical feature, we propose the user dynamic influence and use it to redefine the direct credit given to nodes for influencing others.

3) User dynamic influence: Since the node influence is not constant and permanent, just like the attention degree of hot topics in our society, we suppose that the strength of node influence will decay with time elapsing. Given τ_v,u and user static influence ini (u), we use continuous exponential decay function to transform user static influence between two adjacent nodes v and u. the equation is defined as $Ψ_{u \in N_{out} (v)}^{v, u} (a) = ini (v) \cdot \exp (- \frac{(t_{u} (a) - t_{v}^{0} (a))^{2}}{\sqrt{4 π \cdot τ_{v, u}}}) .$ (11)

Here, $Ψ_{u \in N_{out} (v)}^{v, u} (a)$ is defined as the user dynamic influence for a certain action a between v and u. N_out (v) is the set of v’s out-degree neighbors in the network, u ∈ N_out (v). τ_v,u is the average delayed time. $t_{v}^{0} (a)$ is the moment that node v performs the action a. Similarly, t_u (a) is the moment that the node u been influenced and performed the same action. To be mentioned, due to the time delay phenomenon of influence diffusion in social networks, there still exists an uncertain period of time between activation and execution. In the paper. To simplify our model, we temporarily ignore this factor, which means that there is no time delaying to perform actions. Similarly, nodes will put influence on their neighbors who have not performed the same action immediately. In addition, the node influence is shown as the credit given by every user for influencing others in online social networks.

4 The model and algorithm for influence maximization

In this section, we describe our extensions on the CD model, and propose the CD-NF (Credit Distribution with Node Features) model. We also present an algorithm that implements on CD-NF model to identify seed nodes for influence maximization. The influence spread is equivalent to the adoption spread which is the expected number of users who adopt the information [19 –21]. There are numerous approaches for seed nodes identification and influence spread evaluation. Most of them are based on the edge-weighted graph and the greedy algorithm (Algorithm 1) to obtain a stable result, while they

Algorithm 1 Greedy Algorithm

Require:G = (V, E), k, σ.

ensure: seed set S.

1: S← ∅;

2: fori = 1 ← kdo

3: $u \leftarrow {argmax}_{v \in V ∖ S} {σ (S + x) - σ (S)}$ ;

4: S← S ⋃ {x};

5: end for

6: return S;

need to execute expensive Monte-Carlo (MC) simulations to get a accurate evaluation of influence spread. Some other approaches simplify the influence spread function to get heuristics approximately solving the problem of influence maximization. Our work is a step towards building a more realistic model of influence evaluation.

Based on the work of Goyal et al. [11], CD model has been shown as multiple advantages. For example, since it is learning from real records to get action propagation paths, we do not need any MC simulations. Besides, the quality of seed set for influence maximization can be guaranteed. However, it does not concern node features such that the evaluation of the credit is only by node’s degree. Hence, based on the definition of user dynamic influence we present above, we extend CD model and redefined the credit assignment between adjacent nodes. The GNF algorithm (Greedy Algorithm with Node Features) is to invoke the process of credit assignment and select the node which has maximal marginal gain recursively to obtain the seed set S. It is noticing that we evaluate the node features and characters of relationship between users and actions before credit distribution. Our model is comprised of two steps, the calculation of user dynamic influence and the assignment of credit.

A. The credit distribution with node features model

The CD-NF model can be described as follows, when user u executes an action a, we define u should be either initial performer or passivity transmitter. Meanwhile, u will influence on other adjacent users who have not performed the same action a in graph G. Similarly, we define the credit value be given to v for influencing its neighbor u, when u previously performs an action a than his adjacent neighbor v. For some certain pair of nodes u and v, the credit given to node v for influencing u equals to the user dynamic influence we proposed above, i.e., $Ψ_{u \in N_{out} (v)}^{v, u} (a) = γ_{v, u} (a)$ , v ∈ N_in (u). In addition, we define the credit assignment by mode of cascade propagation, thus not only u’s adjacent neighbors but also their predecessors should be given credit to influence u. As for any pairs of nodes v and u, the total credit given to v for influencing u is defined as $Γ_{v, u} (a) = \sum_{w \in N_{in} (u)} Γ_{v, w} (a) \cdot γ_{w, u} (a) .$ (12)

Let w be the in-neighbor of u, and γ_w,u (a) is the direct credit given to w for influencing u on the action a, the total credit for a set of nodes S to influence u can be given by $Γ_{S, u} (a) = \sum_{w \in N_{in} (u) \cap w \notin S} Γ_{S, w} (a) \cdot γ_{w, u} (a) .$ (13)

To be mentioned, the CD-NF model is not just like the IC or LT model. On the contrary, it traces the action propagation paths and assigns the credit by learning records of action-log. More importantly, we design a new algorithm to identify seed nodes and predict the influence spread for appointed actions in online social networks. Specially, this algorithm is efficient because it obviates the need of running expensive MC simulations for the purpose of estimating influence spread.

B. The marginal gain of nodes

The accumulated credit indicates the amount of influence the node has. We use σ_cd-nf (S + x) to represent the influence spread function, which equals to the total credit given to current seed set S for influencing others in the networks, i.e. $\sum_{u \in V} \frac{1}{A_{u}} \sum_{a \in A_{u}} Γ_{S, u} (a)$ . The marginal gain of node x on all actions [11] is given by

$\begin{matrix} σ_{cd - nf} (S + x) - σ_{cd - nf} (S) \\ = \sum_{a \in A_{u}} {(1 - Γ_{S, x} (a)) \cdot \sum_{u \in V} (\frac{1}{A_{u}} Γ_{x, u}^{V - S} (a))} \end{matrix}$ (14) where $Γ_{x, u}^{V - S} (a)$ is defined as the credit given to x for influencing u on the action a among node set V - S. Based on this equation, if we want to calculate the marginal gain of x, we only need to calculate $\sum_{u \in V} (\frac{1}{A_{u}} Γ_{v, u}^{V - S} (a))$ , i.e., the total credit given to v for influencing others except the current seed set S, and Γ_S,x (a), i.e. the total credit given to current seed set S in terms of the action a.

Algorithm 1 Greedy Algorithm with Node Features

Require:G, k, λ

Ensure: seed set S.

1: /*initialization*/

2: set S← ∅ , Q ← ∅

3: for each u ∈ Vdo

4: generate $A_{u}, A_{u}^{N}, t_{v} (a), t_{u} (a), A_{v 2 u}, t_{S}^{0} (a), A_{u | v}, A_{v & u}$ , τ_v,u, τ_u, N (u);

5: end for

6: /*main loop*/

7: for eachu ∈ Vdo

8: compute user activity act (u), average user sensitivity $\bar{{sen}_{u}}$ , and average user affinity $\bar{{aff}_{u}}$ ;

9: generate user dynamic influence $Ψ_{u \in N_{out} (v)}^{v, u} (a)$ ;

10: computer marginal gain (σ_cd-nf (S + x) - σ_cd-nf (S)) of u, set u . it = 0, push u into queue Q in decreasing order;

11: end for

12: fori = 1 to kdo

13: x = pop(Q);

14: if i = 1 to k

15: S = S⋃ {x};

16: update credit distribution of V - S, resort Q;

17: else

18: re-compute marginal gain (σ_cd-nf (S + x) - σ_cd-nf (S)) of x;

19: x . it = |S|, insert x into Q, resort Q;

20: end if

21: end for

22: return S;

C. The greedy algorithm with node features

To deal with the problem of influence maximization, many efficient heuristic algorithms have been proposed, while they always sacrifice the accuracy of consequences to guarantee the efficiency of executions. In this paper, we present the Greedy Algorithm with Node Features (GNF) to incorporate CD-NF model. Obviously, similar to the influence spread function which has been demonstrated as monotone and sub-modular in [11], it is easy to demonstrate that our approach can also provide a (1-1/e) approximation to the optimal solution. Meanwhile, our algorithm ignores the need of MC simulations, so that the calculation of marginal gain for each node is efficient. We select the node which has maximum marginal gain in each iteration to form the seed set for influence maximization problem. Algorithm 2 shows our full algorithm of selecting k seed nodes. Line 2–11:endinitial are the preparation phase, and each component of node influence is calculated in this phase. Next, we use queue Q to maintain the value of marginal gain for each node u ∈ V, which is the average incremental influence of u to all actions if u is selected as the seed node. The parameter λ is initialized to 0.5 for keeping balance between $A_{u}^{P}$ and $A_{u}^{N}$ . The main loop in lines 12–21 is the greedy approach with the CELF optimization to select k seed nodes. In each iteration, the top element x of Q is analyzed, if x is analyzed before in the current iteration (x . it = |S|), it can be treated as the next seed node which has the highest marginal gain with respect to the current seed set S. After selecting the node x (line 15), we need to update the credit distribution for nodes related in V - S, then the step can jump to the next iteration and identify a new influential node. However, if x . it < |S|, we need to re-compute the marginal gain of x with respect to S, then let x . it = |S| and reinsert x into Q (line 19).

5 Experimental evaluation

We conduct our experiments on real online social networks to test and verify the performances of CD-NF model and GNF algorithm. More specifically, we compare our approach with different models and algorithms, and focus on validating the following aspects: (i) The influence spread of seed set identified by our approach compared with other solutions; (ii) the running time of our approach compared with other techniques; (iii) The accuracy of the predicted influence spread compared with real values w.r.t. specific actions in testing set.

All experiments were conducted on an Quad-Core Intel(R) Xeon(R) CPU E5506 @ 2.13GHz machine with 8G RAM running Ubuntu 12.04.5. The algorithms were implemented in C++.

A. Dataset describe

We first conduct our experiments in real dataset from Flickr [22]. As Table 2 shown, it contains 105938 images from four images resources. We select one of them as the original material which contains 2602 authors, 222292 links and 24648 images. Since the credit assignment should comply with the temporal nature and topological structure, we dispose the original dateset and generate two files: the graph file and the action-log. Intuitively, the graph file reflects the relationship among image authors, and the action-log contains the user’s actions in chronological order. Next, we divide the action-log’s records into two parts. The training set is made up of 2724 kinds of actions, and the testing set contains 1816 kinds of actions. The total number of records in action-log is 75269.

In contrast, we implement the same experiments in Reddit [23], which is a collection of 132308 image submissions from reddit.com. For each submission, we collect the labels (author, time, image etc.) as original resources. The relational network contains 58428 nodes and 71096 edges. In addition, we treat each image as an individual action, after analyzing the whole submission records, we generate the action-log which includes 115775 items of records. We also separate those records into two parts. The training part consists of 10041 kinds of actions, and the testing part is composed of 6695 kinds of actions.

B. Methods compared

CD model: The traditional CD model is still defined as [11].

IC model: The edge probabilities can be learnt from the training set by means of the EM method [24, 25]. Besides, we run 10000 MC simulations to make the experiments.

LT model: We define the weight on edges as $p_{u, v} = \frac{A_{v 2 u}}{N_{in} (u)}$ where $A_{v 2 u}$ is the number of actions propagated from v to u in the training set and N_in (u) is the number of neighbors who have an arc pointto u.

C. Experimental results

We perform our GNF algorithm on the CD-NF model. Compared with three different models (the order is CD, IC, LT) and algorithms on the same training dataset, we want to explore the performances of our approach.

1) Quality of seed sets: First of all, we structure the CD-NF model and utilize the GNF algorithm to identify the seed set for influence maximization. Meanwhile, we establish and select the same amount of seed nodes under the CD, IC and LT models respectively. After the seed sets obtained by these four kinds of approaches, we compare the influence spread separately to verify their qualities. Both left subgraphs of Fig. 2 illustrates the influence spread of seed nodes picked by GNF algorithm is larger than those running on IC and LT models. Specially, it is also larger than the performance of seed nodes identified by the algorithm running on CD model. For instance, when k = 4, the influence spread of the seed set picked by GNF algorithm in Reddit is 490.29, while the influence spread performed by seed set obtained from other three models (CD, IC, LT) is 449.28, 389.71 and 436.21. Similarly, when k = 50, the results of those approaches implemented in Flickr are 102.68, 91.63, 84.91 and 78.70. In order to facilitate observation, the comparable consequences of influence spread under CD-NF and CD models are shown as the right subgraphs of Fig. 2. Obviously, based on the curves reflected in these two subgraphs, we find that the gaps between those two approaches are increasing, which mean that the superiority of our approach is increasing with the growth of seed nodes’ quantity. This is not only attribute to the real action-log of information resources we are learning from, but also owing to the objective and reliable measurement of node influence associated with node features. Therefore, it can be conclude that our approach has an improved capability in identifying seed nodes and describing the influence spread in online social networks.

2) Running time:Figure 3(a) illustrates the running time (in minutes) of identifying seed nodes under different models in Flickr. It can be seen that both GNF algorithm on CD-NF model and CELF algorithm on CD model are several orders of magnitude faster than the ones on IC and LT models. For instance, to select 50 seed nodes on Flickr training set, the greedy algorithm takes 30.53 and 168.35 minutes under IC and LT models respectively, while our algorithm takes only 1.89 minutes (against CD model of 1.76 minutes). After shrinking the scope of y-axis, the expanded detail of comparable results under CD-NF and CD models are shown as the right subgraph in Fig. 2. Obviously, although there exists a tiny gap between the curves of identifying seed sets under CD-NF and CD model correspondingly, the linear growth tendency of time consuming demonstrates that our GNF algorithm is much efficient than other works of exponential growth running on IC and LT models.

Similarly, Fig. 3(b) shows the time usage of identifying k seed nodes in Reddit w.r.t. relatively algorithms. It should be noticed that the time spent on GNF algorithm under CD-NF model is tiny bigger than selecting the same size of seed nodes on CD model, but significantly less than the works running on IC and LT models, which stems from the same reasons that the additional calculation on node features and the elimination of Monte-Carlo simulations. For example, it takes 14.57 (mins) and 11.98 (mins) to select 30 seed nodes under CD-NF and CD model respectively, while 2382.90 (mins) incurred under LT models (for the experiment under IC model running for 10 days only picked 4 seed nodes). In addition, the IC model improves the efficiency by adopting the CELF optimized strategy, while the growth trend of time consuming by algorithm running on LT models is exponential. It can be seen that the time usage of our approach is linearly proportional to the number of seed nodes we target. In general, the GNF algorithm running on CD-NF model has many advantages in the efficiency of identifying seed nodes.

3) The accuracy of spread prediction for testing set: With the intensified requirements of behavior-pattern analysis in viral marketing of online social networks, how to find an efficient approach to predict influence diffusion for specific information has become an valuable problem to deal with. To have a better understanding about the performance of our approach, we perform simulations to predict the influence spread on CD and CD-NF models in terms of each action recorded in testing set of Flickr. By scanning the testing part of action-log, we record the first performer for each action and treat him as an initiator to trigger the progress of influence spread in the whole network. After obtaining the results of simulations separately, we sort the predicted values in increasing order and add real values as comparison. As Fig. 4 shown, there are 1816 kinds of actions totally in testing set. The true value with No. 605 action is 8, by contrast, the prediction values on CD-NF and CD models are 5.52 and 4.91 correspondingly. The situation also suitable for the No. 1657 action, the real value and the predicted ones are 45, 36.26 and 33.79. Thus it can be seen that except for some special samples, the predicted outcomes under both CD-NF and CD models are less than the real values. However, the predicted results by our approach are more close to the real ones, which means the prediction of our approach is more accurate.

6 Conclusion

In this work, we extend the traditional Credit Distribution model and propose an approach that incorporates node features and temporal nature of influence for influence maximization. More specifically, we describe node features from three different aspects and combine those components into user static influence for evaluating the original node influence. Furthermore, we first adopt the user dynamic influence to improve the credit assignment among adjacent nodes. After learning from action-log and relational network, we track action propagation paths and assign credit in opposite direction for each action listed in training set. Then, based on the GNF algorithm, we calculate the average marginal gain for each node and identify the node which has maximal marginal gain to form the seed set S recursively. It can be proved that the CD-NF model can guarantee the seed set quality compared with the optimal result, which means our algorithm is scalable for influence maximization. Meanwhile, by learning from the real records on action-log, we need not to perform any Monte-Carlo simulations to get the accurate evaluation of influence spread. This is one of the key reasons that our approach is more efficient and reliable than others in identifying seed nodes and predicting influence diffusion for influence maximization.

There are many places to further explore the approach we proposed. For example, we believe that the time constraint can be added to the measurement of influence spread. Besides, as for an arbitrary node u in the graph G, u should contain the same interest to similar actions, how to utilize the distribution of user’s actions to evaluate the influence is another issue to deal with. Meanwhile, by learning from the action tags, we can distinguish the categories of actions accurately and veritably. For future work, we plan to further study the topic level influence mining in other social networks and design more reasonable and efficient algorithms for influence maximization.

References

Cheng

, Sun

and Zeng

, Information Overload and Viral Marketing: Countermeasures and Strategies, in Advances in Social Computing, 2010, pp. 108–117.

Even-Dar

and Shapira

, A note on maximizing the spread of influence in social networks, Internet and Network Economics, 2007, pp. 281–286.

Deng

X.H.

, Liu

and Chen

Z.G.

, Memory-based evolutionary game on small-world network with tunable heterogeneity, Physica A Statistical Mechanics and Its Applications389 (2010), 5173–5181.

Domingos

and Richardson

, Mining the network value of customers, Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, United States, 2001, pp. 57–66.

Richardson

and Domingos

, Mining knowledge-sharing sites for viral marketing, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alta, Canada, 2002, pp. 61–70.

Kempe

, Kleinberg

and Tardos

, Maximizing the spread of influence through a social network, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, United States, 2003, pp. 137–146 .

Chen

, Wang

and Yang

, Efficient influence maximization in social networks, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, France, Paris, 2009, pp. 199–207 .

Leskovec

, Krause

, Guestrin

, Faloutsos

, Vanbriesen

and Glance

, Cost-effective outbreak detection in networks, Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, United States, 2007, 420–429.

Chen

, Yuan

, Zhang

, Scalable Influence Maximization in Social Networks under the Linear Threshold Model, Data Mining (ICDM), 2010 IEEE 10th International Conference, Los Alamitos, CA, United States, 2010, 88–89.

10.

Chen

, Collins

, Cummings

, Ke

, Liu

, Rincon

, Sun

, Wang

, Wei

and Yuan

, Influence maximization in social networks when negative opinions may emerge and propagate, SDM,Mesa,AZ,United states, vol. 112011, 379–390.

11.

Goyal

, Bonchi

and Lakshmanan

L.V.S.

, A data-based approach to social influence maximization, in Proceedings of the VLDB Endowment5, 2011, pp. 73–84.

12.

Goyal

, Lu

and Lakshmanan

L.V.S.

, CELF++: Optimizing the greedy algorithm for influence maximization in social networks, Proceedings of the 20th International Conference Companion on World Wide Web, Hyderabad, India, 2011, pp. 47–48.

13.

Wang

, Chen

and Wang

, Scalable influence maximization for independent cascade model in large-scale social networks, Data Mining and Knowledge Discovery25 (2012), 545–576.

14.

Chen

, Lu

and Zhang

, Time-Critical Influence Maximization in Social Networks with Time-Delayed Diffusion Process, CoRR, vol. abs/1204.3074, 2012.

15.

Yang

D.N.

, Hung

H.J.

, Lee

W.C.

and Chen

, Maximizing Acceptance Probability for Active Friending in On-Line Social Networks, in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, pp. 713–721.

16.

, Bhowmick

S.S.

, Sun

and Cui

, Conformity-aware influence maximization in online social networks, VLDB Journal24 (2015), 117–141.

17.

Zhang

, Dinh

T.N.

and Thai

M.T.

, Maximizing the Spread of Positive Influence in Online Social Networks, Distributed Computing Systems (ICDCS), 2013 IEEE 33rd International Conference, 2013 IEEE, 2013, pp. 317–326. IEEE.

18.

Goyal

, Bonchi

and Lakshmanan

L.V.S.

, Learning influence probabilities in social networks, in Proceedings of the Third ACM International Conference on Web Search and Data Mining (2010), 241–250.

19.

Zhou

, Jiao

and Lei

, A linear threshold-hurdle model for product adoption prediction incorporating social network effects, Information Sciences307 (2015), 95–109.

20.

Liu

, Xiang

, Chen

, Xiong

, Tang

and X.

, Yu, Influence Maximization over Large-Scale Social Networks: A Bounded Linear Approach, in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, 2014, pp. 171–180.

21.

, Zhou

, Tang

, Tian

, Al-Dhelaan

, Al-Rodhaan

and Lee

, Social network and tag sources based augmenting collaborative recommender system, IEICE Transactions on Information and Systems98 (2015), 902–910.

22.

McAuley

and Leskovec

, Image Labeling on a Network: Using Social-Network Metadata for Image Classification, ECCV, 2012.

23.

Lakkaraju

, McAuley

J.J.

and Leskovec

, What’s in a name? Understanding the interplay between titles, content, and communities in social media, ICWSM2013.

24.

Saito

, Nakano

and Kimura

, Prediction of information diffusion probabilities for independent cascade model, Knowledge-Based Intelligent Information and Engineering Systems, 2008, pp. 67–75.

25.

Wang

, Chen

, Li

, Deng

X.H.

and Li

, Memory does not necessarily promote cooperation in dilemma games, Physica A Statistical Mechanics and Its Applications395 (2014), 218–227.

26.

Deng

X.H.

, Pan

, Wu

and Gui

J.S.

, Credit Distribution and Influence Maximization in Online Social Networks Using Node Features, in Proceedings of the 12th International Conference on Fuzzy Systems and Knowledge Discovery, Zhangjiajie, China, 2015, 2137–2144.