Finding reinforced structural hole spanners in social networks via node embedding

Abstract

Identifying structural hole spanners that benefit from acting as bridges between communities is a core study in social network analysis. Existing methods for identification mainly focus on measuring the ability of users to control information propagation by bridging holes, while ignoring the impact of reinforcement of the holes themselves on the benefits of bridging spanners. A recent sociological study shows that the more reinforced a hole is, the more likely it is to bring high benefits to its spanners. In this paper, we propose a node embedding-based method ReHSe for identifying reinforced structural hole spanners in social networks. Specifically, an integrated embedding method is devised to extract features encoding reinforcement properties of nodes into a low-dimensional space. Further, to improve the robustness and accuracy of identification, an incremental learning strategy based on a reserved set is employed to train a scoring network in this subspace, to find top- $k$ reinforced hole spanners. Extensive experimental results show that the performance of hole spanners identified by the proposed method outperforms several existing methods.

Keywords

Reinforced structural hole spanners social networks node embedding incremental learning

1. Introduction

Nowadays, social networks (e.g., Twitter, Weibo, and TikTok) completely changed our lifestyle and become our main channels to access information and establish contacts. A pervasive and important concept in social network analysis is the structural hole theory proposed by the leading sociologist Burt [2]. Intuitively, a structural hole is a gap between disconnected communities that limit interactions and communications between the communities. Individuals who bridge the gaps can take advantage of the limitations to gain benefits, and such individuals are referred to as structural hole spanners [22].

It is of great value to identify structural hole spanners in social networks. For example, serving as bridge roles, it is easier for structure hole spanners to manipulate the information flow across communities in social networks [14, 16, 22, 37]. Finding a few important structural hole spanners can, on the one hand, promote information spread faster to different communities, and on the other hand, dramatically reduce the spread of rumors by filtering the misinformation passing through the hole spanners and thus avoiding harm and panic caused by the shaped public opinion [31, 39]. In addition, identifying employees who occupy bridging positions facilitates innovation in the enterprise. Because the employees have more opportunities to develop innovative ideas by merging diverse and valuable information from the non-redundant communities (i.e., companies or departments) they bridge, thereby creating more value [3, 5, 23]. More valuable applications include community detection [22], link prediction [13] and node or edge classification [9, 29], etc.

A lot of efforts have been devoted to identify structural hole spanners [3, 16, 17, 20, 22, 29, 36, 37]. Most traditional studies detect hole spanners using the network topology directly or based on analyzing the information flow. Burt [3] considered that the number of redundant contacts of a user determines the constraints imposed on him/her by a network, and higher constraints imply fewer opportunities to span structural holes, where the redundant contacts means that two contacts of a user are connected to each other. Lou and Tang [22] identified important hole spanners based on the two-step information flow theory, which suggests that nodes connecting with opinion leaders in different communities are more likely to act as hole spanners. He et al. [16] devised a harmonic modularity method to select the nodes with neighbors belonging to more different communities as the structural hole spanners.

Due to the effectiveness of machine learning in network analysis, some identification methods based on this have been proposed, which are more expressive and can mine more potential information than traditional methods. Integrating the idea of algorithm [16], Luo et al. [20] designed a graph convolutional network model based on an auto-encoder to simultaneously detect communities and hole spanners. Jiang et al. [17] adopted a non-backtracking strategy to embed the graph into a low-dimensional space, and then distinguished hole spanners in this space by evaluating the topological importance of nodes.

1.1 Motivations

Although the hole spanners identified by the existing methods yield good results in different applications, they all fail to capture an important property: the influence of a structural hole itself on its spanner. To bring more benefits to a user acting as a bridge, Burt [4] found that the more reinforced a structural hole, the greater the difficulty in bridging it, but the more likely a successful bridge will bring a high profit to the user. Here, a structural hole being reinforced implies the existence of strong internal cohesion within communities on each side of the hole, and we refer to the user bridging the hole as a reinforced hole spanner.

Figure 1.

An illustration of reinforced structural hole spanners, where there are six communities $C_{1},\ldots,C_{6}$ , and nodes $A$ and $B$ are the hole spanners with different degrees of reinforcement between communities.

We use Fig. 1 as an example to explain why a more reinforced hole spanner benefits more. A network $G$ contains six communities and two hole spanners. User $A$ spans the structural hole formed by communities $C_{1}$ , $C_{2}$ , and $C_{3}$ , and user $B$ spans the structural hole formed by communities $C_{3}$ , $C_{4}$ , $C_{5}$ , and $C_{6}$ . Existing studies may regard $B$ as the more important spanner than user $A$ , because he bridges more communities and lies on the larger number of the shortest paths. However, it can be seen that members either in community $C_{4}$ , $C_{5}$ , or $C_{6}$ are loosely connected with each other, the sparse ties reduce the efficiency of collaboration within the community, and the smaller number of information exchange increases the probability of disagreement among members [3], such that the information propagated by user $B$ has difficulty in gaining unanimous trust from members in the community.

In contrast, each neighbor of user $A$ is embedded in a more tightly connected community, the hole spanned by $A$ is deepened. Information within communities (i.e., $C_{1}$ , $C_{2}$ , and $C_{3}$ ) on either side of user $A$ is more likely to be homogeneous and redundant, while the disconnection between communities enables different communities to operate with diverse ideas. Therefore, although user $A$ bridges only three communities, fewer than the four communities spanned by user $B$ , information within a community propagated by user $A$ is more likely to be innovative for other communities and thus adopted by members in the communities. Also, the successful bridge provides user $A$ with the opportunity to innovate by synthesizing ideas from both sides or to learn ideas and techniques from one side to solve the difficulties faced by the other side [3, 4]. As a result, user $A$ with access to a more reinforced structural hole can benefit more.

Motivated by the above considerations, in this paper, we combine this new research progress in sociology, i.e., the concept of reinforced structural holes, to detect hole spanners by capturing the influence of the properties of holes themselves on the benefits of their spanners, instead of relying on measuring the bridging ability of the users themselves in a network, as in existing studies.

Therefore, in this paper, we focus on the problem of identifying reinforced structural hole spanners in social networks. The primary challenge of the problem is to find a way to represent the extent to which each structural hole spanner is reinforced. With the capability of the graph embedding in capturing structural features, we consider generating vector representations to preserve the structural information of nodes as well as the connections within surrounding communities, thus capturing the reinforcement properties of each node so that it can be easily employed by the downstream identification task.

1.2 Contributions

To find the important structural hole spanners in social networks that not only have neighbors in many different communities, but also access reinforced structural holes, in this paper, we devise a novel method ReHSe, identifying top- $k$ REinforced structural Hole Spanners based on node Embedding, which considers only the topological structure of a network for generality reasons. Our main contributions are summarized as follows:

•
This is the first attempt to identify top- $k$ structural hole spanners using node embedding from the perspective of structural holes being reinforced.
•
We present an integrated embedding method by utilizing first-order proximity and $r$ -order reinforcement to characterize the reinforcement properties of each node in a social network, where $r$ can be flexibly tuned to allow the method to focus on local or global structure. And then we detect hole spanners by learning a score for each node in the embedding subspace.
•
To improve the accuracy and robustness of the model, we employ a reserved set-based incremental learning strategy to training, which combines the old and new samples to continuously optimize the feature representation space of nodes.
•
Finally, we conduct extensive experiments on real-world datasets, the experimental results demonstrate that the reinforced score of top- $k$ hole spanners identified by the proposed method is about 17% higher than those by existing methods, and have a stronger ability to spread information across communities.

The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 presents key definitions used in this paper and formally defines the problem. Section 4 proposes a method for identifying reinforced hole spanners. Section 5 evaluates the performance of the proposed method and other benchmark methods through extensive experiments, and analyzes the parameters by ablation experiments. Section 6 concludes the paper.
2. Related work

This paper is mainly related to node embedding and structural hole spanners identification. In this section, we discuss both of them briefly.

2.1 Node embedding

The goal of node embedding is to map high-dimensional and intractable nodes from an original graph to low-dimensional vectors in an embedding space, while encoding the connection information between nodes, such that the low-dimensional embeddings can be conveniently applied to various graph analysis tasks, such as node classification, clustering, link prediction, and recommendation.

Depending on how the existing methods model the defined “close” (i.e., node/edge/subgraph similarities), they can be classified into three types [35]. The first type is the matrix decomposition-based method, which expresses the close relationship between nodes as a matrix, and generates the embeddings by decomposing the matrix. The typically adopted methods are to decompose the graph Laplacian eigenmaps [1, 30] modeling first-order proximity to learn node embeddings, or decompose the proximity matrix [38, 40].

The second type is the random walk-based method, which attempts to capture high-order proximity between nodes. Different from the above methods using deterministic measures of nodes proximity, these methods describe the co-occurrence relationships between nodes with a stochastic strategy. Perozzi et al. [25] pioneered the application of the famous neural language model Skip-Gram [24] into network analysis, representing the high-order neighborhood structure of nodes through a truncated random walk, so that node similarity and community information can be provided for downstream tasks. Moreover, a number of studies inspired by this have been proposed [11, 12, 33, 41], either using modified random walk strategies to sample paths or applying Skip-Gram with different optimization ways (e.g., Skip-Gram with negative sampling, heterogeneous Skip-Gram) on the generated paths to encode relationships between nodes. For example, in order to address the problem that method [25] can only capture the local structural properties of networks, Grover et al. [12] generated sampling paths utilizing a biased random walk, which controls the model by adjusting two hyper-parameters to approximate the depth-first search that captures global structural information, or the breadth-first search that captures local structural information, resulting in higher quality and more comprehensive embeddings. Dong et al. [11] devised a node embedding framework based on observations of the heterogeneity of node and edge types in most social networks. They first captured the relationships between different types of nodes using meta-paths-based random walk and employed a heterogeneous Skip-Gram on sampled paths to obtain node embeddings in heterogeneous information networks.

With the success of deep learning in various research fields(e.g. natural language processing, computer vision, etc.), several methods to encode proximity between nodes based on this have been proposed [8, 21, 32]. For example, Wang et al. [32] developed an embedding method to capture highly nonlinear network structures that adopts a deep auto-encoder to preserve the first- and second-order similarities between nodes. Cao et al. [8] first got nodes co-occurrence information by random surfing, which avoids the limitation of fixed length paths of traditional sampling, and then input the transformed positive pointwise mutual information matrix into an auto-encoder to extract node features.

The aforementioned methods aim to capture local neighbor similarity or higher order global neighborhood information, however, they cannot be directly applied to encode the reinforcement properties of nodes, therefore, to identify reinforced hole spanners, we propose an integrated embedding method for extracting the structural features of nodes and the connections between members within the communities they bridge.

2.2 Structural hole spanner identification

The identification of structural hole spanners has attracted a lot of attention from researchers due to its wide applications in various fields, such as the promotion and obstruction of information diffusion in social networks, enterprise settings, community detection, and node classification, etc. However, finding these hole spanners is remain a challenging task.

Most of the existing studies identified the top- $k$ hole spanners relying on measuring the ability of each node itself to bridge structural holes in the network. Burt [2] proposed a set of measures based on ego network, with two key metrics both focusing on the concept of redundancy, where a user with redundant connections means that his/her neighbors are also connected. One is called network constraint, which is used to measure the extent to which a node has access to redundant contacts. Higher constraint scores indicate fewer opportunities to access favorable resources. Then, the top- $k$ hole spanners identified are the $k$ node with the lowest scores. The other is effective size, which is a measure of the number of non-redundant contacts of each node. Nodes with higher effective size scores have more opportunities for information and control advantage, thereby acting as important hole spanners. Goyal et al. [15] ranked nodes according to their betweenness centrality, that is, measured the ability of nodes to participate in the shortest path between all pairs of nodes. The time complexity is however unsatisfactory. Tang et al. [29] improved it by counting only the number of two-step shortest paths where a node lies, extending its applicability to large graphs. Lou et al. [22] argued that the widespread of information in a network depends largely on the delivery between opinion leaders within communities and hole spanners across communities. Therefore, they selected the $k$ nodes that have connections with more opinion leaders in different communities as top- $k$ hole spanners. The prerequisite for the execution of the method is a well-suited community detection algorithm. He et al. [16] devised a harmonic function, which can be used to learn the communities indicator of nodes, and simultaneously help to select the top- $k$ hole spanners connected to more different communities. However, the method requires the ground-truth community labels and is not available in large-scale networks. Xu et al. [36] observed that the top- $k$ spanners are the nodes whose removal from the network will lead to the maximum increase in the average distance of the shortest path between all pairs. They then proposed two novel yet scalable algorithms for this task. Later, Xu et al. [37] suggested that important hole spanners should not only build relationships with multiple communities, but also maintain frequent and close ties with these communities. Therefore, they measured the bridging capability of each node by calculating the number of blocked information diffusion after removing the node from the network. And the top- $k$ spanners are the nodes that block the largest number of information diffusion.

The advantages of machine learning in performance improvement enable its application to network analysis. Compared with traditional methods, machine learning-based methods are more expressive and can capture more potential information, therefore, a few studies to identify hole spanners on this basis have been presented recently. Jiang et al. [17] utilized a non-backtracking random walk strategy to capture the neighborhood information of nodes, and factorized the combinatorial Laplacian matrix of an oriented line graph to obtain node embeddings. They then detected communities and structural hole spanners in this low-dimensional space. Luo et al. [20] designed a deep learning framework based on an auto-encoder to implement the idea of the algorithm proposed in [16] and improve its performance for joint detection of community and hole spanners under specific scenarios.

Unlike the aforementioned methods, we identify top- $k$ hole spanners by capturing the extent to which the structural holes themselves are reinforced.

3. Preliminaries

Let $G=(V,E)$ be an undirected unweighted social network, where $V$ is the set of nodes $v_{i}$ , and $E$ is the set of edges $e_{ij}$ , representing the users and interactions between users in the social network, respectively. Let $n=|V|$ and $m=|E|$ . Denote by $\bm{A}=[a_{ij}]_{n\times n}$ the adjacency matrix of network $G$ , where $a_{ij}=0$ if $e_{ij}\notin E$ , otherwise $a_{ij}=1$ represents the edge weight of $G$ . For weighted network, we have $0<a_{ij}\leqslant 1$ . Denote by $\bm{A}^{r}=[c_{ij}]_{n\times n}$ the power of matrix $A$ , where $c_{ij}$ represents the number of paths with length exactly $r$ from node $v_{i}$ to node $v_{j}$ . Let diagonal matrix $\bm{D}=\textit{diag}[d_{ii}]_{n\times n}$ be the degree matrix of network $G$ , where $d_{ii}=\sum_{j=1}^{n}a_{ij}$ is the degree of node $v_{i}$ . Note that when $G$ is a directed and/or weighted social network, the proposed method ReHSe is also applicable.

In this paper, we aim to identify reinforced structural hole spanners by measuring the internal cohesion of the communities to which each node is connected. For this purpose, the structural features within the $r$ -length range around the node should be preserved. However, the existing graph embedding methods that attempt to preserve local pairwise proximity or high-order proximity between nodes cannot be directly applied to encode such structural information. We thus employ an integrated node embedding method to extract features of each node, which involves the first-order proximity and the $r$ -order reinforcement. We define them as follows.

(First-order proximity).

The first-order proximity represents the local pairwise similarity between nodes. For any two nodes $v_{i}$ and $v_{j}$ in the network $G$ , the first-order proximity between them is the edge weight $a_{ij}$ .

The first-order proximity emphasizes on the relationship of a pair of nodes. If two nodes are directly connected in network $G$ , their embedding vectors in the latent space tend to be similar. Since the contacts between intra-community nodes are relatively frequent compared to the inter-community nodes, if a node and its neighbors belong to the same community, they are more likely to have similar embedding vectors. Otherwise, the embedding vector of the node may be less similar to that of its neighbors belonging to different communities, indicating that it may be a hole spanner.

However, communities connected by different hole spanners vary greatly in their extent of cohesion, the first-order proximity fails to capture the information about the extent to which the structural hole bridged by a spanner is reinforced. Therefore, we introduce the $r$ -order reinforcement to extend the structural features of each node to a larger range.

(The $r$ th-order local relation).

For any node $v_{i}\in V$ in the network $G$ , let $N_{r}(v_{i})$ be the set of the $r$ th-order neighbors of $v_{i}$ (e.g., $r=1,2,3\ldots$ ), then the $r$ th-order local relation of $v_{i}$ is the difference between the embedding vector of $v_{i}$ and that of the nodes in $N_{r}(v_{i})$ .

( $r$ -order reinforcement).

The $r$ -order reinforcement of each node $v_{i}$ is the reinforcement that node $v_{i}$ obtains from the combination of the first-order local relation to the $r$ th-order local relation.

The $r$ -order reinforcement of a node captures the network structure of the node at a different $r$ -order range, and the structure varies with the value of $r$ . By investigating such structural features, we can obtain information about the location advantage of this node. Since the differences in vector representations between a node and its $r$ th-order neighbors will reflect the number of the $r$ th-order neighbors of the node, which helps to find important hole spanners that have contacts in multiple communities. On the other hand, it can reveal how tightly connected it is between members within the communities it bridges.

Taking both the first-order proximity and the $r$ -order reinforcement into consideration for node embedding, and it can be formalized as follows.

(Node embedding).

Given a network $G=(V,E)$ , the node embedding problem aims to learn a mapping function $f_{G}:v_{i}\rightarrow\bm{y}_{i}\in\mathbb{R}^{d}(d\ll n)$ that embeds each node ${v_{i}}\in V$ into a low-dimensional space, in which the first-order proximity and $r$ -order reinforcement are preserved.

(Top- $k$ reinforced structural hole spanners).

Given a social network $G=(V,E)$ , the mapping function $f_{G}:V\rightarrow\mathbb{R}^{d}(d\ll n)$ , the top- $k$ reinforced hole spanners problem in $G$ is to find a set $S$ of $k$ nodes $(k\leqslant n)$ , such that the nodes in $S$ with the maximum reinforcement, i.e.,

$\displaystyle\underset{S\subseteq V,\lvert S\rvert=k}{\text{maximize}}\{\ Q(f_% {G}(S))\ \}.$ (1)

4. Methodology

In this section, we first present the general framework of the proposed method ReHSe. We then elaborate on the details of each of its components.

4.1 Model description

A desirable method for detecting structural hole spanners in real-world social networks should meet the following three requirements. The first is adaptability, it can handle arbitrary types of networks (directed or undirected, weighted or unweighted). The second is effectiveness, the structural hole spanners identified by it can play stronger bridging roles in the network, yielding more benefits on information and control. The third is robustness, since the structural features of hole spanners differ slightly in different social networks, by learning from various types of networks to expand the feature representations of hole spanners, it is able to deliver a better accuracy even for a new network to be identified. And the model performance continuously improves with the increase of training networks. In this paper, we develop a method ReHSe based on node embedding for identifying reinforced structural hole spanners, which satisfies the above requirements, and its framework is shown in Fig. 2.

Specifically, we first characterize the topological structure features of each node in network $G$ by preserving first-order proximity and $r$ -order reinforcement, such that the low-dimensional vectors are generated. Then, we input the obtained node embeddings into a fully connected neural network, which assigns a score to each node embedding by learning from the labeled hole spanners identified by the algorithm [4], and outputs the top- $k$ reinforced hole spanners ordered by the score. To enhance the accuracy and robustness of the model for identification, we finally use an incremental learning strategy based on a reserved set, whenever a new network to be identified is added, it is merged with old training samples to continuously optimize the feature representation space of nodes.

Figure 2.

An illustration of the ReHSe framework for identifying reinforced structural hole spanners. The model input is the relationships of nodes in a social network $G$ and the output is the top- $k$ reinforced structural hole spanners.

4.2 Learning of node reinforcement features

Recall that to measure the extent to which each node is reinforced in the network $G$ , we first need to extract features covering first-order proximity and $r$ -order reinforcement about each node, i.e., the final embedding matrix of nodes, defined as $\bm{Y}$ , is a combination of the structural information of each order of the nodes, see the first stage in Fig. 2. Note that for ease of presentation, here we set the order $r=2$ to focus on the connection relationships within the second-order range. In fact, $r$ can be extended to a higher order, allowing the model to capture a wider range of structural features, and we discuss the impact of different values of $r$ on model performance in Section 5.3 ablation study.

4.2.1 Embedding with first-order proximity

We first model the first-order proximity. Let matrix ${\bm{Y}^{(0)}}=[\bm{y}^{(0)}_{1},\bm{y}^{(0)}_{2},\ldots,\bm{y}^{(0)}_{n}]^{% \mathsf{T}}\in\mathbb{R}^{n\times d}$ be the $d$ -dimensional representation ( $d\ll n$ ) of network $G$ , the $i$ -th row $\bm{y}^{(0)}_{i}$ of $\bm{Y}^{(0)}$ is the embedding vector of node $v_{i}$ , $\|\bm{y}^{(0)}_{i}-\bm{y}^{(0)}_{j}\|_{2}^{2}$ represents the distance between node $v_{i}$ and $v_{j}$ in the latent space. We aim to find the optimal embedding to keep the connected nodes in $G$ are close to each other in the $d$ -dimensional space, and the natural way is to minimize the following objective function:

$\displaystyle{\min}\sum_{{v_{i}},{v_{j}}\in V}a_{ij}\|{\bm{y}^{(0)}_{i}-\bm{y}% ^{(0)}_{j}}\|_{2}^{2},$ (2)

where $a_{ij}$ indicates whether node $v_{i}$ is connected to $v_{j}$ . Based on the Laplacian Eigenmaps [1], denote by $\bm{L}=\bm{D}-\bm{A}$ the Laplacian matrix of $G$ , the Eq. (2) can be rewritten as:

$\displaystyle\operatorname*{arg\,min}_{{\bm{Y}^{(0)}}^{\mathsf{T}}\bm{D}{\bm{Y% }^{(0)}}=\bm{I}}\textit{trace}({\bm{Y}^{(0)}}^{\mathsf{T}}\bm{L}{\bm{Y}^{(0)}}),$ (3)

where the constraint ${\bm{Y}^{(0)}}^{\mathsf{T}}\bm{D}{\bm{Y}^{(0)}}=\bm{I}$ ensures that the scaling invariance of the objective function in embedding. The optimal solution to this problem is given by the following generalized eigenvalue problem:

$\displaystyle\bm{L}{\bm{y}^{(0)}}=\lambda\bm{D}{\bm{y}^{(0)}}.$ (4)

We sort the obtained eigenvalues in ascending order $0=\lambda_{0}\leqslant\lambda_{1}\leqslant\dots\leqslant\lambda_{d}$ , the eigenvector corresponding to $\lambda_{0}$ is disregarded since it is a constant. And we then select $d$ eigenvectors corresponding to $\lambda_{1},\ldots,\lambda_{d}$ to construct the matrix ${\bm{Y}^{(0)}}\in\mathbb{R}^{n\times d}$ by columns, the $d$ -dimensional embeddings $\bm{y}^{(0)}_{i}$ of each node $v_{i}\in V$ are finally generated.

4.2.2 Embedding with r-order reinforcement

Then, we adopt a simple way to represent the second-order reinforcement features of each node $v_{i}\in V$ , i.e., to calculate the first- and second-order local relation. Specifically, for any node $v_{i}$ with an embedding vector $\bm{y}^{(0)}_{i}=[y^{(0)}_{i1},y^{(0)}_{i2},\ldots,y^{(0)}_{id}]$ , where $y^{(0)}_{iq}$ denotes the $q$ th element in vector $\bm{y}^{(0)}_{i}$ . Denote by $\bm{y}^{(1st)}_{i}\in\mathbb{R}^{1\times d}$ the first-order local relation embedding of node $v_{i}$ , which indicates the differences between $v_{i}$ and each of its direct neighbors, we have:

$\displaystyle\bm{y}^{(1st)}_{i}=\left[y^{(0)}_{i1}-\sum_{A_{ij}\neq 0}y^{(0)}_% {j1},y^{(0)}_{i2}-\sum_{A_{ij}\neq 0}y^{(0)}_{j2},\ldots,y^{(0)}_{id}-\sum_{A_% {ij}\neq 0}y^{(0)}_{jd}\right].$ (5)

We thus have the first-order local relation matrix $\bm{Y}^{(1st)}=[\bm{y}^{(1st)}_{1},\bm{y}^{(1st)}_{2},\ldots,\bm{y}^{(1st)}_{n% }]^{\mathsf{T}}\in\mathbb{R}^{n\times d}$ consisting of each node $v_{i}\in V$ . Similarly, the second-order local relation embedding $\bm{y}^{(2nd)}_{i}\in\mathbb{R}^{1\times d}$ of node $v_{i}$ refers to the differences between $v_{i}$ and each of its indirect neighbors with path of length two, such that

$\displaystyle\bm{y}^{(2nd)}_{i}=\left[y^{(0)}_{i1}-\sum_{A_{ip}^{2}\neq 0}y^{(% 0)}_{p1},y^{(0)}_{i2}-\sum_{A_{ip}^{2}\neq 0}y^{(0)}_{p2},\ldots,y^{(0)}_{id}-% \sum_{A_{ip}^{2}\neq 0}y^{(0)}_{pd}\right],$ (6)

where $A_{ip}^{2}\neq 0$ means node $v_{i}$ can reach $v_{p}$ via a two-step path. The second-order local relation matrix is thus $\bm{Y}^{(2nd)}=[\bm{y}^{(2nd)}_{1},\bm{y}^{(2nd)}_{2},\ldots,\bm{y}^{(2nd)}_{n% }]^{\mathsf{T}}\in\mathbb{R}^{n\times d}$ .

By concatenating the matrices $\bm{Y}^{(0)},\bm{Y}^{(1st)},\bm{Y}^{(2nd)}$ horizontally, we obtain the final node embeddings $\bm{Y}=[\bm{y}_{1},\bm{y}_{2},\ldots,\bm{y}_{n}]^{\mathsf{T}}\in\mathbb{R}^{n% \times(r+1)\times d}$ .

4.3 Prediction of reinforcement score

For each node $v_{i}$ with embedding vector $\bm{y}_{i}$ in a given network $G$ to be identified, we predict the score of $\bm{y}_{i}$ by training a reinforcement scoring network to determine the extent to which each node is reinforced. Specifically, we aim to evaluate the effect of each component (i.e., $\bm{y}_{i}^{(0)},\bm{y}_{i}^{(1st)},\bm{y}_{i}^{(2nd)}$ ) of a vector on its score by training, therefore, we take the embedding vectors $\bm{y}^{\prime}_{i}$ of nodes in a training network $G^{\prime}$ as input, and then apply the reinforcement scoring network with four fully connected layers to map the input into the target representation space (see the second stage of Fig. 2).

To introduce the nonlinearity in the scoring network, we use Rectified Linear Unit (ReLU) as the activation function, since it can reduce the possibility of vanishing gradient, which causes the parameters to be updated slowly in each iteration. On the other hand, compared to the saturated nonlinear functions (e.g., sigmoid, tanh), the training time with gradient descent of non-saturated nonlinear function ReLU is faster, i.e., ReLU is more efficient in learning, such that networks with ReLU typically have a better convergence performance in practice [18]. Therefore, batch normalization and a function ReLU are used between the layers of the network. The hidden representation of each layer is formulated as:

$\displaystyle\bm{H}^{(1)}=\textit{ReLU}(\bm{W}^{(1)}\bm{Y}+\bm{b}^{(1)}),$ (7) $\displaystyle\bm{H}^{(i)}=\textit{ReLU}(\bm{W}^{(i)}\bm{H}^{(i-1)}+\bm{b}^{(i)% }),$

where $\bm{W}^{(i)}$ is the $i$ -th layer weight matrix, and $\bm{b}^{(i)}$ is the $i$ -th layer bias vector with $i=\{1,2,3,4\}$ .

We use the algorithm RSH [4] proposed by Burt to calculate the ground-truth score $\textit{rsh}(v_{i})$ of each node $v_{i}$ , which indicates how much the node is reinforced by the networks around its neighbors. Given a set of $n$ training samples $\{(\bm{y}^{\prime}_{1},\textit{rsh}(v_{1})),\ldots,(\bm{y}^{\prime}_{n},% \textit{rsh}(v_{n}))\}$ with $\bm{y}^{\prime}_{i}\in\bm{Y}^{\prime}$ , the proposed model is then trained with the following loss function:

$\displaystyle L(\bm{y}^{\prime}_{i},\textit{rsh}(v_{i}),\bm{\theta})=\sqrt{% \frac{1}{n}\sum_{i=1}^{n}{(x_{i}-\textit{rsh}(v_{i}))}^{2}},$ (8)

where $\bm{\theta}=\{\bm{W}^{(i)},\bm{b}^{(i)}\}$ represents the set of parameters, and the model is optimized with stochastic gradient descent.

Finally, the predicted scores $X=\{x_{1},\ldots,x_{n}\}$ of the nodes in $G$ can be obtained by feeding the embedding matrix $\bm{Y}$ of $G$ directly into the scoring network, and top- $k$ reinforced hole spanners are found by sorting the scores in non-increasing order. The process is presented in Algorithm 4.3.

[htp] : A nimble method for identifying top- $k$ reinforced hole spanners.[1] A social network $G=(V,E)$ for test, a training network $G^{\prime}=(V^{\prime},E^{\prime})$ , a training label set $L^{*}=\{\textit{rsh}(v_{1}),\ldots,\linebreak\textit{rsh}(v_{n})\}$ , a set $X$ of predicted scores of nodes in $G$ , embedding dimension $d$ , embedding order $r$ , and a positive integer $k$ A set $S$ of $k$ reinforced hole spanners Let set $S\leftarrow\emptyset$ and $X\leftarrow\emptyset$ ; Generate $d$ -dimensional embedding matrices $\bm{Y}$ and $\bm{Y}^{\prime}$ with $r$ -order for the networks $G$ and $G^{\prime}$ , respectively, by Eqs (4)– (6); Randomly select a batch from $\{(\bm{y}^{\prime}_{i},\textit{rsh}(v_{i}))\}$ ; Train model with loss function $L(\bm{y}^{\prime}_{i},\textit{rsh}(v_{i}),\bm{\theta})\leftarrow\sqrt{\frac{1}% {n}\sum_{i=1}^{n}{(x_{i}-\textit{rsh}(v_{i}))}^{2}}$ ; convergence Predict the score $x_{i}$ of each node embedding $\bm{y}_{i}$ in $G$ with the model; Sort the nodes in $G$ in non-increasing order by the predicted scores in $X$ , and select the top- $k$ nodes to $S$ ; $S$ .

4.4 An incremental learning method

Since network properties (e.g., density, average degree, clustering coefficient, etc.) vary across social networks, the structural features of hole spanners may be also slightly different. It is possible that the trained model works may works well on some test networks, while it may fails due to changes in the feature representation space of nodes caused by the addition of networks with new structural features. However, we cannot encompass all types of networks into the dataset at once to achieve a robust model for reinforced hole spanners identification.

Therefore, to address this problem, we introduce an incremental strategy to train the scoring network, such that the additional available information from the addition of new training networks can help the model to adjust the parameters and expand the structural feature space of nodes, continuously improving the accuracy and robustness of the model. This corresponds to the third part in Fig. 2. Denote by $\mathcal{G}_{t}=\{G^{1},\ldots,G^{p}\}$ the stream of $p$ training networks, and $G$ represents the given test network to be identified, in line with the idea of most incremental learning algorithms [7, 10, 34], in each incremental step (i.e., whenever a new training network enters), we construct a new training set for the model, which includes two parts: a sample set of the current new network, and a reserved set containing samples selected from the old networks that have been trained, as illustrated in Fig. 3.

Figure 3.

An illustration of incremental learning, where orange dots represent the embedding vectors of nodes in the current training network, red squares represent the embedding vectors of nodes in the test network, and blue triangles indicate the representative samples selected for the reserved set (best viewed in color).

4.4.1 Domain alignment

Before we construct the training set for the current task, we first need to ensure that feature vectors from all training networks, which have different distributions, can be represented in a uniform latent space for subsequent sample selection. Therefore, we consider the feature space of each newly added training network as source domain, and align it toward that of the test network $G$ , which is viewed as target domain.

Specifically, suppose the $b$ th training network $G^{b}$ comes, denote by $\bm{Y}^{b}=[{\bm{y}_{1}^{b},\ldots,\bm{y}_{n}^{b}}]^{\mathsf{T}}\in\mathbb{R}^% {n\times(r+1)\times d}$ and $\bm{Y}=[{\bm{y}_{1},\ldots,\bm{y}_{n}}]^{\mathsf{T}}\in\mathbb{R}^{n\times(r+1% )\times d}$ the set of feature vectors of the network $G^{b}$ and the test network $G$ , respectively. To align the feature distributions of the source domain $G^{b}$ and the target domain $G$ , we first normalize the features on each dimension(i.e., $d$ dimensions) to z-score, where the z-score of a raw value is measured in terms of standard deviation from the mean. Each element $y_{ij}$ in $\bm{Y}$ can be normalized by the following way

$\displaystyle\overline{y_{ij}}=\frac{y_{ij}-{\mu_{\bm{Y}_{(j)}}}}{\sigma_{\bm{% Y}_{(j)}}}.$ (9)

where $\mu_{\bm{Y}_{(j)}}$ is the mean of the $j$ th column of $\bm{Y}$ , and $\sigma_{\bm{Y}_{(j)}}$ is the standard deviation of the $j$ th column of $\bm{Y}$ , we then have the normalized target domain $\overline{\bm{Y}}$ , as well as the normalized source domain $\overline{\bm{Y}^{b}}$ .

Although we have the normalized features of both the source and target domains, the distributions may still be different caused by differences in correlations. Here, we deal with correlation alignment based on the non-parametric feature learning method CORAL [27], which matches the distribution of source towards target domain features by aligning the second-order statistics (i.e., covariance) without the target being labeled. Applying the method on normalized $\overline{\bm{Y}^{b}}$ of training network, we thus have

$\displaystyle \widehat{\mskip-1.5mu \bm{Y}^{b}\mskip-1.5mu }\mskip 1.5mu =% \mskip 1.5mu \overline{\mskip-1.5mu \bm{Y}^{b}\mskip-1.5mu }\mskip 1.5mu \cdot% (\textit{cov}({\mskip 1.5mu \overline{\mskip-1.5mu \bm{Y}^{b}\mskip-1.5mu }% \mskip 1.5mu })+\bm{I}_{\bm{Y}^{b}})^{-\frac{1}{2}}\cdot(\textit{cov}({\mskip 1% .5mu \overline{\mskip-1.5mu \bm{Y}\mskip-1.5mu }\mskip 1.5mu })+\bm{I_{Y}})^{% \frac{1}{2}}$ (10)

where $\textit{cov}(\cdot)$ is the covariance matrix, $\bm{I}_{\bm{Y}^{b}}$ and $\bm{I_{Y}}$ are identity matrices of dimension $((r+1)\times d)$ . This process can be viewed as whitening the source to decorrelation, and then recoloring it with the covariance of the target [27].

4.4.2 Construct a reserved set

We then adopt a clustering center-based strategy to construct the reserved set, which is updated in each incremental step to include representative exemplars sampled from the feature space. That is, after the ( $b-1$ )th incremental training, we select a given number of samples that can represent the structural features of nodes and store them in the memory unit to prepare for the $b$ th incremental task (see Fig. 3). Denote by $\bm{T}^{(b-1)}$ the training set of the ( $b-1$ )th task. Denote by $\bm{R}$ the reserved set and $M$ the number of samples in $\bm{R}$ , and $M$ is a positive integer, e.g., $M=100$ .

Since nodes with different embedding vectors correspond to different rsh scores, which is a quantification of the reinforcement properties of nodes, we first divide the nodes with vectors $\bm{t}_{i}\in\bm{T}^{(b-1)}$ into two categories according to the score $\textit{rsh}(v_{i})$ : nodes with score $\textit{rsh}(v_{i})>0$ and $\textit{rsh}(v_{i})=0$ . The centers of each category are calculated as follows.

$\displaystyle\mu_{{c_{j}}q}=\frac{1}{n_{c_{j}}}\sum_{i=1}^{n_{c_{j}}}t_{iq},$ (11)

where $t_{iq}$ is the $q$ th element in vector $\bm{t}_{i}$ , with $0\leqslant q\leqslant{((r+1)\times d)}$ . $j=\{0,1\}$ indicates the two categories, and $n_{c}$ is the number of samples in each category. Thus, the representation of the mean vector for each category ${c_{j}}$ is $\bm{\mu}_{c_{j}}=[\mu_{{c_{j}}1},\mu_{{c_{j}}2},\ldots,\mu_{{c_{j}}{((r+1)% \times d)}}]$ .

Then, we calculate the Euclidean distance between each sample $\bm{t}_{i}$ in category ${c_{j}}$ and the center $\bm{\mu}_{c_{j}}$ , we have

$\displaystyle r_{i{c_{j}}}={\left(\sum_{q=1}^{(r+1)\times d}{|t_{iq}-{\mu}_{{c% _{j}}q}|}^{2}\right)}^{\frac{1}{2}}.$ (12)

For each category, we retain $M/2$ samples by the distance $r_{i{c_{j}}}$ , and construct the reserved set $\bm{R}$ by merging the two categories. The procedure is described in Algorithm 4.4.2.

We finally generate a training set $\bm{T}^{b}$ for the $b$ th incremental training task by combining the aligned feature vectors $\mskip 1.5mu \widehat{\mskip-1.5mu \bm{Y}^{b}\mskip-1.5mu }\mskip 1.5mu$ of the current training network $G^{b}$ and the constructed reserved set $\bm{R}$ .

[!htb] : Construct a reserved set after the ( $b-1$ )th incremental task.[1] The training set $\bm{T}^{(b-1)}$ of the ( $b-1$ )th incremental task, and memory size $M$ A reserved set $\bm{R}$ of representative samples Let set $\bm{R}\leftarrow\emptyset$ ; For the embedding vector $\bm{t}_{i}\in\bm{T}^{(b-1)}$ of each node $v_{i}$ , classify it by the score $\textit{rsh}(v_{i})$ ; Calculate the mean vector $\bm{\mu}_{c_{0}}=[\mu_{{c_{0}}1},\ldots,\mu_{{c_{0}}d}]$ and $\bm{\mu}_{c_{1}}=[\mu_{{c_{1}}1},\ldots,\mu_{{c_{1}}d}]$ for category ${c_{0}}$ and category ${c_{1}}$ , respectively; For each vector $\bm{t}_{i}\in{c_{j}}$ , calculate its Euclidean distance $r_{i{c_{j}}}$ to the mean vector $\bm{\mu}_{c_{j}}$ ; Sort vectors $\bm{t}_{i}$ in each category ${c_{j}}$ separately in non-decreasing order by the distance $r_{i{c_{j}}}$ ; $\bm{R}\leftarrow\{\bm{t}_{{c_{0}}1},\ldots,\bm{t}_{{c_{0}}{M/2}}\}\cup\{\bm{t}% _{{c_{1}}1},\ldots,\bm{t}_{{c_{1}}{M/2}}\}$ .

4.5 Identifying the top-k reinforced hole spanners

[!htb] : ReHSe: REinforced Hole Spanners detection based on Embedding.[1] A social network $G=(V,E)$ for test, the set $\mathcal{G}_{t}$ of $p$ training networks, a set of labeled data $\bm{L}^{*}=\{L^{*}_{1},\ldots,L^{*}_{p}\}$ of $\mathcal{G}_{t}$ generated by algorithm [4], a reserved set $\bm{R}$ , a set $X$ of predicted scores of nodes in $G$ , embedding dimension $d$ , embedding order $r$ , and a positive integer $k$ A set $S$ of $k$ reinforced structural hole spanners Let set $S\leftarrow\emptyset$ and $\bm{R}\leftarrow\emptyset$ ; Generate $d$ -dimensional embedding matrix $\bm{Y}$ with $r$ -order of network $G$ , by Eqs (4)–(6); Calculate the normalized target domain $\mskip 1.5mu \overline{\mskip-1.5mu \bm{Y}\mskip-1.5mu }\mskip 1.5mu % \leftarrow\textit{zscore}(\bm{Y})$ , by Eq. (9); each training network $G^{b}$ in $\mathcal{G}_{t}$ Generate $d$ -dimensional embedding matrix $\bm{Y}^{b}$ with $r$ -order of network $G^{b}$ , by Eqs (4)–(6); Compute the normalized source domain $\mskip 1.5mu \overline{\mskip-1.5mu \bm{Y}^{b}\mskip-1.5mu }\mskip 1.5mu % \leftarrow\textit{zscore}(\bm{Y}^{b})$ , by Eq. (9); Obtain the aligned feature matrix $\mskip 1.5mu \widehat{\mskip-1.5mu \bm{Y}^{b}\mskip-1.5mu }\mskip 1.5mu$ by matching the distributions toward the target domain $\mskip 1.5mu \overline{\mskip-1.5mu \bm{Y}\mskip-1.5mu }\mskip 1.5mu$ , by Eq. (10); Construct a training set $\bm{T}^{b}\leftarrow\{\bm{y}_{i}^{b}|\bm{y}_{i}^{b}\in\mskip 1.5mu \widehat{% \mskip-1.5mu \bm{Y}^{b}\mskip-1.5mu }\mskip 1.5mu ,i=1,\ldots,n\}\cup\{\bm{r}_% {i}|\bm{r}_{i}\in\bm{R},i=1,\ldots,M\}$ for the $b$ th incremental task; Randomly select a batch from $\{(\bm{t}_{i}\in\bm{T}^{b},\textit{rsh}(v_{i}))\}$ ; Minimize the loss function $L(\bm{t}_{i},\textit{rsh}(v_{i}),\bm{\theta})\leftarrow\sqrt{\frac{1}{n}\sum_{% i=1}^{n}{(x_{i}-\textit{rsh}(v_{i}))}^{2}}$ by gradient descent; convergence Update the reserved set $\bm{R}$ by invoking Algorithm 4.4.2; Predict the score $x_{i}$ of each node embedding $\bm{y}_{i}\in\bm{Y}$ with the model; Sort the nodes in $G$ non-increasingly by the score $x_{i}$ , select the top- $k$ nodes, and update set $S$ ; $S$ .

For the problem of top- $k$ reinforced hole spanners identification, we first extract features encoding reinforcement properties of nodes to the low-dimensional space. To improve the quality of identification, we then train the reinforcement scoring network using an incremental learning strategy for training, whenever a new network is available, we update the feature space to generate a new training set. Finally, we identify the hole spanners with the model. The detailed algorithm is given in Algorithm 4.5.

5. Experiments

In this section, we first evaluate the effectiveness of the proposed method ReHSe on eleven real-world networks, and compare it with various benchmark methods. We then conduct ablation experiments to analyze the effects of different parameters on the performance of the proposed method ReHSe.

5.1 Experimental setting

Datasets. We adopt eleven real-world network datasets with different types to study the effectiveness of algorithms, which are shown in Table 1. They include six social network datasets: karate club (social ties between members), fb-Reed98 (friendship ties between users), email-Eu-core (communication between members), hamsterster (friendship and family ties between users), wiki-Vote (voting data between users), LastFM-asia (mutual follower relationships between users); two collaboration network datasets: ca-GrQc, ca-HepTh; two entity network datasets: football (American football games between Division IA colleges), fb-politician (mutual links between facebook pages); and a biological network dataset: HC-BioGrid (physical and genetic interactions). The datasets email-Eu-core, wiki-Vote, LastFM-asia, ca-GrQc and ca-HepTh are obtained from SNAP repository [19], the dataset HC-BioGrid is acquired from [28], and the other five datasets are available on network repository platform [26].

Table 1
Statistics of eleven real-world networks

Datasets	$\lvert V\rvert$	$\lvert E\rvert$	Avg. degree	Avg. clc
Karate club	34	156	4.6	0.5706
Football	115	1,226	10.7	0.4032
Fb-Reed98	962	37,624	39.1	0.3184
Email-Eu-core	1,005	25,571	25.4	0.3994
Hamsterster	2,426	33,260	13.7	0.5375
Ca-GrQc	5,241	28,968	5.5	0.5296
Ca-HepTh	9,875	51,946	5.3	0.4714
Fb-politician	5,908	83,458	14.1	0.3851
Wiki-Vote	7,115	103,689	14.6	0.1409
LastFM-asia	7,624	55,612	7.3	0.1786
HC-BioGrid	4,039	20,642	5.1	0.0595

Benchmark methods. In this paper, we compare the proposed method ReHSe with the following eight benchmark methods, the first method identifies reinforced hole spanners, while the other seven methods detect hole spanners without considering reinforcement.

(1)

RSH [4] first calculates the extent to which each node is reinforced by each pair of its neighbors, and then selects the most reinforced top- $k$ nodes as the hole spanners.

(2)

Constraint [3] measures the value of each node constrained by its neighbors, and selects the top- $k$ hole spanners with lowest values.

(3)

HIS [22] first assigns each node a score that is the importance of its neighbors in different communities, then chooses the $k$ nodes with the highest scores as hole spanners, where the ground-truth communities need to be predefined.

(4)

HAM [16] considers the number of communities that neighbors of each node belong to, then selects the top- $k$ nodes.

(5)

MaxD [22] detects $k$ nodes whose removals will result in maximizing the decrease in the number of minimal cut of communities in a network, assuming that the community labels are already known.

(6)

maxBlockFast [37] identifies $k$ nodes to maximize the number of blocked information propagation in a network after removing the $k$ nodes.

(7)

APGreedy [36] finds $k$ nodes, so that the increase of communication cost in the residual network will be maximized by their removals.

Note that since the execution of the methods HIS, HAM and MaxD requires ground-truth communities of networks, we use a community detection algorithm [6] to reveal community structure of datasets, which excludes datasets karate club, football and email-Eu-core, as their communities are available. The codes of algorithms HAM and ReHSe are written in python, and other algorithms are implemented in C $++$ . All algorithms are run on a server configured with an Intel(R) Xeon(R) CPU E5-2680 v4 with 2.4 GHz and a 64 GB RAM.

5.2 Performance evaluation

5.2.1 Evaluation criteria

In order to quantitatively evaluate the quality of the top- $k$ identified hole spanners, we consider the following two evaluation criteria.

(1)
Percent Reinforced hole spanners Score (Percent-RS) uses the metric Normalized RSH defined by Burt [4] to measure the sum of average extent to which $k$ spanners are reinforced by their pairs of neighbors. The Normalized RSH standardize the score $\textit{rsh}(v_{i})$ by dividing the number of pairs of neighbors of a node $v_{i}$ (i.e., $\frac{N(v_{i})(N(v_{i})-1)}{2}$ ), and it varies from zero to one, with the larger value indicating that each of the spanner’s neighbors lies in a larger and more cohesive community, thus bringing more benefits to the spanner. Given a network $G=(V,E)$ , and a set $S$ of top- $k$ hole spanners identified by an algorithm, then, the Percent-RS of the set $S$ is

$\displaystyle\textit{Percent-RS}(S)=\sum\nolimits_{{v_{i}}\in S}\frac{\textit{% rsh}(v_{i})}{\frac{N(v_{i})(N(v_{i})-1)}{2}}.$ (13)
(2)
Structural Hole Influence Index (SHII) [16] measures the influence of each hole spanner in information propagation across communities, and a higher SHII score indicates better performance. For each identified hole spanner $v_{i}$ , we compute its SHII score in $G$ as follows. We first construct a seed set by combining the spanner and a certain number of randomly sampled neighbors from the community $C_{v_{i}}$ to which $v_{i}$ belongs, and then perform an information propagation model (i.e., independent cascade (IC) or linear threshold model (LT)) several times to find the set of influenced nodes. The SHII of spanner $v_{i}$ is defined as

$\displaystyle\textit{SHII}(v_{i})=\frac{\sum_{{C_{i}}\in C\setminus{C_{v_{i}}}% }\sum_{u\in{C_{i}}}I_{u}}{\sum_{{C_{i}}\in C}\sum_{u\in{C_{i}}}I_{u}},$ (14)

where $C$ is the set of communities in $G$ , $I_{u}$ is an indicator of whether a node $u$ is activated by the seed set. In our experiment, for each hole spanner, we set the ratio of randomly sampled neighbors to 5% for constructing the seed set, and we repeat the sampling 100 times and run the IC and LT models 1,000 times for each sampled seed set. Finally, we show the average values of SHII for the top- $k$ hole spanners identified by each method.

5.2.2 Percent reinforced hole spanners score

Figure 4.

Percent reinforced hole spanners score of different methods in various networks.

We first investigate the Percent-RS of top- $k$ hole spanners identified by different methods in slightly smaller networks. It can be seen from Fig. 4a–c that ReHSe outperforms the other seven methods. Specifically, Fig. 4a shows that in the fb-Reed98 network, the Percent-RS of top-50 hole spanners identified by ReHSe is about 3% ( $\approx\frac{35.5-34.5}{28.5}$ ) higher than the best existing method RSH, which is almost identical to the results of methods Constraint and MaxD, and about 11% ( $\approx\frac{35.5-31.9}{31.9}$ ) higher than that of method maxBlockFast, where the Percent-RS of top-50 hole spanners found by methods ReHSe, RSH, Constraint, HIS, HAM, MaxD, maxBlockFast, and APGreedy are 29.1, 28.9, 23.1, 24.7, 18.5, 28.5, 26.2, 24.6, respectively. And Fig. 4b plots that method ReHSe exhibits similar performance in the email-Eu-core network as in the fb-Reed98 network. Moreover, Fig. 4c demonstrates that the Percent-RS of hole spanners found by ReHSe in the hamsterster network is about 3% and 2.3 ( $\approx\frac{37}{16}$ ) times higher than the existing best method Constraint and the poorest method HAM, when $k=50$ , respectively.

We then study the performance of the different methods in other six networks ca-GrQc, ca-HepTh, fb-politician, wiki-Vote, LastFM-asia and HC-BioGrid. Figure 4d–i plots that the advantages of hole spanners detected by the proposed method ReHSe are more visible in these relatively large networks, especially in networks ca-GrQc and HC-BioGrid. For example, Fig. 4d demonstrates that the Percent-RS of top-50 nodes found by method ReHSe in the ca-GrQc network is approximately 17% ( $\approx\frac{32.8-28.1}{28.1}$ ) and 29% higher than the best two existing methods APGreedy and RSH among the eight benchmark methods, respectively. Moreover, Fig. 4i shows that the Percent-RS of top-50 hole spanners detected by method ReHSe in the HC-BioGrid network is about 12% ( $\approx\frac{35.9-32.1}{32.1}$ ) and 20% higher than the best two methods RSH and Constraint, respectively.

In contrast, although the method RSH identifies spanners from the perspective of structural holes being reinforced by the spanners’ neighbors, its Percent-RS are lower than the proposed method ReHSe in all nine networks, because the method RSH focuses only on the reinforcement from two-step neighbors, while ReHSe can be extended to multi-order neighbors. On the other hand, the incremental strategy allows ReHSe to improve the accuracy of identification by preserving the structural features of nodes in different networks, thus detecting spanners that are connected to a larger number and more cohesive communities.

5.2.3 Structural hole influence index

Table 2
Results of average SHII for different methods in various datasets under the independent cascade influence model

		Comparative methods
Datasets	# of Hole spanners	ReHSe	RSH	Constraint	HIS	HAM	MaxD	maxBlockFast	APGreedy
Karate club	3	0.536	0.536	0.536	0.495	0.501	0.506	0.489	0.457
	Hole spanners	[0, 33, 2]	[0, 33, 2]	[0, 33, 2]	[8, 13, 19]	[2, 19, 8]	[33, 0, 32]	[33, 32, 2]	[0, 1, 3]
Football	20	0.932	0.916	0.917	0.925	0.907	0.91	0.906	0.91
Email-Eu-core	50	0.963	0.967	0.958	0.961	0.956	0.968	0.955	0.959
Ca-GrQc	50	0.902	0.889	0.865	0.822	0.857	0.844	0.876	0.933
Ca-HepTh	50	0.862	0.821	0.81	0.83	0.771	0.879	0.846	0.907
Fb-Reed98	50	0.443	0.449	0.418	0.574	0.382	0.449	0.283	0.334
Fb-politician	50	0.914	0.842	0.871	0.855	0.773	0.901	0.819	0.85
Hamsterster	50	0.392	0.390	0.365	0.374	0.409	0.39	0.407	0.484
HC-BioGrid	50	0.964	0.919	0.921	0.921	0.953	0.923	0.926	0.911
LastFM-asia	50	0.942	0.91	0.9	0.914	0.867	0.912	0.938	0.896
wiki-Vote	50	0.837	0.818	0.809	0.822	0.878	0.836	0.813	0.756

Table 3

Results of average SHII for different methods in various datasets under the linear threshold influence model

		Comparative methods
Datasets	# of Hole spanners	ReHSe	RSH	Constraint	HIS	HAM	MaxD	maxBlockFast	APGreedy
Karate club	3	0.382	0.382	0.414	0.661	0.661	0.382	0.321	0.422
Football	20	0.343	0.329	0.341	0.367	0.288	0.299	0.267	0.285
Email-Eu-core	50	0.833	0.847	0.631	0.788	0.565	0.851	0.8	0.77
Ca-GrQc	50	0.668	0.522	0.509	0.397	0.650	0.404	0.377	0.411
Ca-HepTh	50	0.698	0.615	0.611	0.623	0.551	0.655	0.576	0.576
Fb-Reed98	50	0.427	0.433	0.403	0.536	0.337	0.432	0.257	0.3
Fb-politician	50	0.665	0.651	0.673	0.661	0.597	0.696	0.633	0.657
Hamsterster	50	0.292	0.293	0.268	0.281	0.301	0.290	0.308	0.368
HC-BioGrid	50	0.712	0.536	0.506	0.522	0.665	0.502	0.516	0.462
LastFM-asia	50	0.598	0.59	0.589	0.588	0.657	0.62	0.625	0.591
wiki-Vote	50	0.653	0.582	0.575	0.584	0.625	0.595	0.578	0.538

Tables 2 and 3 present the average structural hole influence index of each method in different networks under IC and LT models, respectively. We count the number of times each method achieves optimal results in a total of 22 evaluations across 11 networks, the results show that ReHSe leads 9 times, HIS and APGreedy lead 4 times, HAM and MaxD lead 3 times, RSH and Constraint lead 1 time, and maxBlockFast leads 0 time. It can be observed that the top- $k$ hole spanners found by ReHSe have a stronger influence on information propagation across communities, even compared to the methods HIS, HAM and MaxD which require ground-truth communities information, especially in relatively sparse networks. In addition, although the hole spanners detected by APGreedy have the best SHII in several networks, the Percent-RS of the hole spanners are lower in most networks.

To be more intuitive, we plot the karate club network and label the top-3 hole spanners found by ReHSe, i.e., $S=\{0,1,33\}$ , see Fig. 5, and the top-3 spanners found by other methods are listed in Table 2. We can observe that ReHSe identifies nodes that not only connect with multiple communities, but also access to more reinforced structural holes. For example, we zoom in on a hole formed by nodes 6 and 31 from different communities bridged by node 0, and it can be seen from the subgraph that nodes 6 and 31 are closely connected with the members in their respective communities, which strengthens the bridge role of node 0. Moreover, although methods RSH and Constraint find the same top-3 spanners as ReHSe in the karate club network, the Percent-RS of spanners found by them are lower than ReHSe when $k$ takes larger values in other networks, see Fig. 4a–g, i.e., the top- $k$ spanners found by methods RSH and Constraint with less reinforcement in these networks.

Figure 5.

A visualization of the karate club network, where the ground-truth communities are colored differently, and the shaded nodes are the top-3 hole spanners identified by ReHSe.

In summary, Fig. 4 and Tables 2 and 3 validate our claim in Section 1.2 that the top- $k$ hole spanners identified by the method ReHSe bridge a larger number of reinforced holes, such that more benefits (e.g., greater propagation influence) can be achieved, compared to other benchmark methods.

5.3 Ablation study

We now discuss the impacts of components on the proposed method ReHSe, including the value of embedding order $r$ , the scoring network model, the number of incremental steps, and the correlation alignment. We first investigate the effect of the value of embedding order $r$ on the model performance in the network ca-GrQc with $r=\{1,2,3,4,6\}$ . It can be observed from Fig. 6a that the Percent-RS of the top- $k$ hole spanners found by ReHSe are the highest and almost equal when $r=2$ and $r=3$ (i.e., each node $v_{i}\in V$ with $\bm{y}_{i}=\textit{cat}(\bm{y}_{i}^{(0)},\bm{y}_{i}^{(1st)},\bm{y}_{i}^{(2nd)})$ and $\bm{y}_{i}=\textit{cat}(\bm{y}_{i}^{(0)},\bm{y}_{i}^{(1st)},\bm{y}_{i}^{(2nd)}% ,\bm{y}_{i}^{(3rd)})$ ), among all values of the order $r$ . ReHSe has a lower Percent-RS of hole spanners identified with $r=1$ than that identified with $r=2$ , which may be due to the fact that the idea of ReHSe with $r=1$ is similar to the method Constraint, which measures the extent to which a user has access to many non-redundant neighbors, without considering the reinforcement information. Moreover, when $r=4$ and $r=6$ , although there may be lower Percent-RS due to the limitation of labeled data, they are still comparable to the best two existing methods APGreedy and RSH in the ca-GrQc network.

Figure 6.

Ablation study on the performance of the proposed method ReHSe.

We then study the effect of different scoring models on performance in the ca-GrQc network. We compare two models: the model with fully connected layers, and the model with multi-head attention mechanism. It can be seen from Fig. 6b that the Percent-RS of the hole spanners found by training with a fully connected network is about 86% better than that of detected by training with the attention mechanism. The fully connected layers provide a simple way to capture dependencies of a node embedding vector on each of its components (i.e., $r$ -orders neighborhood relation embeddings at the corresponding positions). By contrast, although the multi-head attention mechanism helps to obtain richer features, it cannot capture the location information, i.e., the sequential relationship of the components in an embedding vector cannot be learned, thus leading to a performance drop in identification.

Furthermore, Fig. 6c demonstrates that the Percent-RS of the hole spanners identified by ReHSe increases with the number of training networks for incremental learning in the LastFM-asia network. The results validate our claim in Section 4.4 that ReHSe on the one hand adjusts the model parameters by introducing additional information from the new networks, and on the other hand, with the feature distribution changes, it updates the reserved set to continuously improve the accuracy of detecting spanners.

We finally analyze the role of correlation alignment on method ReHSe in the fb-Reed98 network. We reshape the embedding vectors of nodes into a two-dimensional space to observe the changes in the distribution covariances. Figure 7 plots the transformation of the distribution covariances of the training network fb-Reed98 (source) towards the test network football (target). We can see that although the initial source and target domains have different feature distributions, they can align well in the feature space after the transformation. Moreover, compared with the method without alignment, the performance of the identified spanners is improved by about 30% by aligning the training network toward the given test network first at each incremental step, see Fig. 6d.

Figure 7.

An illustration of correlation alignment, where football with 115 nodes and 1,226 edges is taken as the test network, and fb-Reed98 with 962 nodes and 37,624 edges is regarded as the training network.

6. Conclusion

In this paper, we proposed a node embedding-based method ReHSe for identifying reinforced hole spanners in social networks that not only connect to multiple communities, but also access reinforced structural holes. Specifically, we first devised an integrated embedding method to extract features encoding the reinforcement properties of nodes into a low-dimensional space. To improve the robustness and accuracy of identification, we then employed an incremental learning strategy based on a reserved set to train a reinforcement scoring network in this subspace, to find top- $k$ reinforced structural hole spanners. We finally evaluated the performance of the proposed method on eleven real-world datasets. The results showed that the reinforced score of the top- $k$ hole spanners found by ReHSe is about 17% higher than that by existing methods.

The theory of structural holes is an important concept for social network analysis, which indicates the social advantages of individuals. In the future, it is a worthwhile work to further improve the proposed method in this paper, for example, the method ReHSe will suffer from expensive computational and intensive memory costs in large-scale social networks with millions of nodes due to matrix operations, or even inapplicable. In addition, ReHSe only utilizes the network topology to detect hole spanners, and how to introduce the social behavior features of users into ReHSe is also an interesting research point. Moreover, the study of novel methods for identifying hole spanners in dynamic networks would also be of great interest.

Footnotes

Acknowledgments

The work by Feihu Huang was supported by Sichuan Science and Technology Program with grant number 22ZDYF3599. The work by Jian Peng was supported by the National Key Research and Development Program of China with grant number 2020YFB0704502.

References

Belkin

and Niyogi

, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput 15(6) (2003), 1373–1396.

Burt

R.S.

, Structural holes: the social structure of competition, Harvard university press, Cambridge, 1992.

Burt

R.S.

, Structural holes and good ideas, Am. J. Sociol 110(2) (2004), 349–399.

Burt

R.S.

, Reinforced structural holes, Soc. Netw 43 (2015), 149–161.

Burt

R.S.

and Wang

, Bridge supervision: Correlates of a boss on the far side of a structural hole, Acad. Manage. J (2021), published online, doi: 10.5465/amj.2021.0676.

Batagelj

and Zaversnik

, An

O(m)

algorithm for cores decomposition of networks,

C o R R

cs.DS/0310, 2003.

Castro

F.M.

Marín-Jiménez

M.J.

Guil

Schmid

and Alahari

, End-to-end incremental learning, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 233–248.

Cao

and Xu

, Deep neural networks for learning graph representations, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 30, 2016, pp. 1145–1152.

Chen

Xiong

Liu

and Yin

, TranGAN: Generative adversarial network based transfer learning for social tie prediction, in: IEEE International Conference on Communications (ICC), 2019, pp. 1–6.

10.

Dong

Hong

Tao

Chang

Wei

and Gong

, Few-shot class-incremental learning via relation knowledge distillation, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 35, 2021, pp. 1255–1263.

11.

Dong

Chawla

N.V.

and Swami

, metapath2vec: Scalable representation learning for heterogeneous networks, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2017, pp. 135–144.

12.

Grover

and Leskovec

, Node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016, pp. 855–864.

13.

Ghaffar

Buda

T.S.

Assem

Afsharinejad

and Hurley

, A framework for enterprise social network assessment and weak ties recommendation, in: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2018, pp. 678–685.

14.

Ghaffar

and Hurley

, Modelling social capital: the structural hole connections game, in: Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2021, pp. 9–13.

15.

Goyal

and Vega-Redondo

, Structural holes in social networks, J. Econ. Theory 137(1) (2007), 460–492.

16.

Cao

Shen

and Yu

P.S.

, Joint community and structural hole spanner detection via harmonic modularity. in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016, pp. 875–884.

17.

Jiang

Zheng

Zhu

and Yu

P.S.

, On spectral graph embedding: a non-backtracking perspective and graph approximation. in: Proceedings of the 2018 SIAM International Conference on Data Mining (SDM), 2018, pp. 324–332.

18.

Krizhevsky

Sutskever

and Hinton

G.E.

, ImageNet classification with deep convolutional neural networks, Commun. ACM 60(6) (2017), 84–90.

19.

Leskovec

and Krevl

, SNAP Datasets: Stanford Large Network Dataset Collection, 2014. http://snap.stanford.edu/data.

20.

Luo

and Du

, Detecting community structure and structural hole spanner simultaneously by using graph convolutional network based Auto-Encoder, Neurocomputing 410 (2020), 138–150.

21.

Lerique

Abitbol

J.L.

and Karsai

, Joint embedding of structure and features via graph convolutional networks, Appl. Netw. Sci. 5(1) (2020), 1–24.

22.

Lou

and Tang

, Mining structural hole spanners through information diffusion in social networks, in: Proceedings of the 22nd International Conference on World Wide Web (WWW), 2013, pp. 825–836.

23.

Lin

Zhang

Gong

Chen

Oksanen

and Ding

A.Y.

, Structural hole theory in social network analysis: A review, IEEE Trans. Comput. Soc. Syst. (2021), 1–16.

24.

Mikolov

Chen

Corrado

and Dean

, Efficient estimation of word representations in vector space, in: Proceedings of the 1st International Conference on Learning Representations (ICLR), 2013.

25.

Perozzi

Al-Rfou

and Skiena

, Deepwalk: Online learning of social representations, in: Proceedings of the 20nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2014, pp. 701–710.

26.

Rossi

and Ahmed

, The network data repository with interactive graph analytics and visualization, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2015, pp. 4292–4293.

27.

Sun

Feng

and Saenko

, Return of frustratingly easy domain adaptation, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2016, pp. 2058–2065.

28.

Stark

Breitkreutz

B.J.

Reguly

Boucher

Breitkreutz

and Tyers

, BioGRID: A general repository for interaction datasets, Nucleic Acids Res. 34(1) (2006), D535–D539.

29.

Tang

Lou

Kleinberg

and Wu

, Transfer learning to infer social ties across heterogeneous networks, ACM Trans. Inf. Syst. 34(2) (2016), 1–43.

30.

Tenenbaum

J.B.

Silva

V.D.

and Langford

J.C.

, A global geometric framework for nonlinear dimensionality reduction, Science 290(5500) (2020), 2319–2323.

31.

Vosoughi

Roy

and Aral

, The spread of true and false news online, Science 359(6380) (2018), 1146–1151.

32.

Wang

Cui

and Zhu

, Structural deep network embedding, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016, pp. 1225–1234.

33.

Wang

Zhao

Zhang

Xie

and Guo

, Learning graph representation with generative adversarial nets, IEEE Trans. Knowl. Data Eng. (TKDE) 33(8) (2019), 3090–3103.

34.

Chen

Wang

Liu

Guo

and Fu

, Large scale incremental learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 374–382.

35.

, Understanding graph embedding methods and their applications, SIAM Review 63(4) (2021), 825–853.

36.

Rezvani

Liang

J.X.

and Liu

, Efficient algorithms for the identification of top-k structural hole spanners in large social networks, IEEE Trans. Knowl. Data Eng. (TKDE) 29(5) (2017), 1017–1030.

37.

Liang

J.X.

Yang

and Gao

, Identifying structural hole spanners to maximally block information propagation, Inf. Sci. 505 (2019), 100–126.

38.

Yang

Shi

Xiao

Yang

and Bhowmick

S.S.

, Homogeneous network embedding for massive graphs via reweighted personalized PageRank, Proc. VLDB Endow. 13(5) (2020), 670–683.

39.

Wang

and Li

, Modeling and analysis of rumor propagation in social networks, Inf. Sci. 580 (2021), 857–873.

40.

Zhang

Xie

Wang

and Huang

, Learning based proximity matrix factorization for node embedding, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2021, pp. 2243–2253.

41.

Zhao

Zhou

Tang

and Zeng

, DeepEmLAN: Deep embedding learning for attributed networks, Inf. Sci. 543 (2021), 382–397.

Finding reinforced structural hole spanners in social networks via node embedding

Abstract

Keywords

1. Introduction

1.1 Motivations

2.1 Node embedding

2.2 Structural hole spanner identification

3. Preliminaries

(First-order proximity).

(The r th-order local relation).

( r -order reinforcement).

(Node embedding).

(Top- k reinforced structural hole spanners).

4.1 Model description

4.2.1 Embedding with first-order proximity

5. Experiments

5.1 Experimental setting

Table 1 Statistics of eleven real-world networks

5.2.1 Evaluation criteria

Table 2 Results of average SHII for different methods in various datasets under the independent cascade influence model

Footnotes

Acknowledgments

References

(The $r$ th-order local relation).

( $r$ -order reinforcement).

(Top- $k$ reinforced structural hole spanners).

Table 1
Statistics of eleven real-world networks

Table 2
Results of average SHII for different methods in various datasets under the independent cascade influence model