Cross-community shortcut detection based on network representation learning and structural features

Abstract

As social networks continue to expand, an increasing number of people prefer to use social networks to post their comments and express their feelings, and as a result, the information contained in social networks has grown explosively. The effective extraction of valuable information from social networks has attracted the attention of many researchers. It can mine hidden information from social networks and promote the development of social network structures. At present, many ranking node approaches, such as structural hole spanners and opinion leaders, are widely adopted to extract valuable information and knowledge. However, approaches for analyzing edge influences are seldom considered. In this study, we proposed an edge PageRank to mine shortcuts (these edges without direct mutual friends) that are located among communities and play an important role in the spread of public opinion. We first used a network-embedding algorithm to order the spanners and determine the direction of every edge. Then, we transferred the graphs of social networks into edge graphs according to the ordering. We considered the nodes and edges of the graphs of the social networks as edges and nodes of the edge graphs, respectively. Finally, we improved the PageRank algorithm on the edge graph to obtained the edge ranking and extracted the shortcuts of social networks.

The experimental results for five different sizes of social networks, such as email, YouTube, DBLP-L, DBLP-M, and DBLP-S, verify whether the inferred shortcut is indeed more useful for information dissemination, and the utility of three sets of edges inferred by different methods is compared, namely, the edge inferred by ER, the edge inferred by the Jaccard index. The ER approach improves by approximately 10%, 9.9%, and 8.3% on DBLP, YouTube, and Orkut. Our method is more effective than the edge ranked by the Jaccard index.

Keywords

Community detection link prediction network embedding shortcut

1. Introduction

Many complex network systems can be described as graphs, where nodes represent individuals and links represent interactions between individuals. As one of the most intensively studied networks, social networks play an important role in connecting people with each other and disseminating various types of information. The connection relationships between individuals construct graph edges. We usually divide social networks into communities using graph theories. These edges and nodes in the graphs of social networks are divided into intro-community edges, nodes, and cross-community edges, nodes. Many studies have focused on cross-community edges, nodes that play an important role in diffusing information among different communities. Many studies have focused on key cross-community nodes, such as structural hole spanner detection. However, few studies have focused on key cross-community edge detection, such as shortcut detection and bridges. According to Easley and Kleinberg[1], if two individuals of an edge do not have a direct common friend, the edge is called a bridge or shortcut in a broad sense. If we remove a shortcut, then there are other paths in which its length is more than 3 from an endpoint of the shortcut to its other endpoint. In Fig. 1, the edge (1, 2) constructing a shortcut is not only a path connecting between nodes 1 and 2. In addition to path $1\rightarrow 2$ from node 1 to 2, there are other paths with length size $\geqslant 3$ , such as $1\rightarrow 3\rightarrow 4\rightarrow 5\rightarrow 6\rightarrow 2$ . In fact, structures like this are more common in real networks than in bridges. In other words, deleting edge $1\rightarrow 2$ increases the distance from nodes 1 to 2. We call edge (1, 2) as a shortcut. The other edges in Fig. 1 are not shortcuts. Chen and Zhang[2] showed that shortcuts play an important role in information dissemination. We find shortcuts that have a significant effect on public opinion research. This can bear the flow of information dissemination.

Figure 1.

Shortcuts and structural hole spanners among communities C1, C2, and C3. The graph has three communities and three shortcuts. Each community has six nodes. The internal nodes in communities connect with each other, and there are three structural hole spanners and three shortcuts.

Shortcut detection is a typical link prediction task for social networks [3]. Link prediction, one of the most important tasks in social network analysis and mining, studies the formation of missing links or new links based on current and historical social networks. Link predictions are broadly used in some applications, such as recommendation[4], pre-warning systems[5], and biomedical discovery[5]. Researchers have extensively studied effective link prediction methods for different types of social networks and in different application scenarios. Some simple heuristics, such as common neighbors and the Katz index, work well in practice and are scalable to large social networks. Other early studies also achieved good performance, including those based on Markov chains [6], probability graphical methods[7, 8], and supervised learning[9]. Recently, studies have predicted links in dynamic social networks[10] and heterogeneous social networks[11].

The two endpoints of shortcuts directly touch two different communities in social networks. This feature can be used to obtain information that is originally far from you. We use these features to focus on finding shortcuts to social networks. In this study, we propose an approach to detect shortcuts in social networks. Our main contributions are as follows.

•

We use an improved Q-HAM to detect community and calculate the H value of each node. In contrast to other node rankings, such as PageRank, the nodes we detect are nodes that connect the community, rather than nodes with higher influence.

•

To apply the ranking method of nodes to edges, we propose a transformed social network graph.

•

We propose edge rank (ER), an edge quantification method. This method borrows from PageRank. The traditional PageRank algorithm calculates the influence of nodes, and we apply its improvement to edges; in other people’s research on the edge, we found that few people study the shortcut. We found relevant literature on weak ties and discovered the relationship between weak ties and shortcuts.

•

We use a small network to illustrate how to quantify edges. Then, experiments on finding shortcuts were performed on three real datasets by comparing PageRank, hits, etc. Then, the flow of information dissemination was used to verify its effectiveness.

The remainder of this paper is organized as follows. Section 2 introduces related work. In Section 4, we propose an approach for shortcut detection. Section 5 discusses the experimental evaluation. In Section 6, we present our conclusions and directions for future research. In response to the shortcut problem in the social network, we studied earlier and the lack of corresponding achievements that can be used for reference. There are some related works that mainly include networking embedding for network representation, Q-HAM for determining cross-community edges by using community detection, and the PageRank algorithm for sorting edges.

2. Related works

2.1 Community detection

Community detection, also known as community discovery, is a technology used for social network aggregation behavior. Community detection is a method of network clustering. This can be understood as a collection of nodes with the same features.

Community detection is mainly based on the graph-theory division method. Karypis proposed a classic graph-partitioning community detection algorithm[12]. The core idea of the algorithm is to first divide the network randomly to obtain two initial communities, and then use a greedy algorithm to continuously exchange nodes in the two communities until the optimal partition is obtained. However, this method relies too much on the initial random partition, the detection results of different initial communities are different, and the algorithm complexity is high. Newman later proposed the concept of modularity, which has a significant impact on community detection. The researchers then proposed a series of community detection methods based on modularity. This makes the modularity-based method the most mainstream community detection method. Modularity is a metric used to evaluate the stability of network communities, and can also measure the quality of community detection results. For a well-divided community network, modularity can be expressed as

$\displaystyle\frac{1}{2m}\sum_{ij}\left[A_{ij}-\frac{k_{i}k_{j}}{2m}\right]% \delta(C_{i},C_{j})$ (1)

where the adjacency matrix $A$ , $A_{ij}=$ 1 indicates that there is an edge between node $i$ and node $j$ . $k$ represents the degree of the node in the graph. $\delta(C_{i},C_{j})=0$ means that nodes $C_{i}$ and $C_{j}$ are not in the same community, and $\delta(C_{i},C_{j})=1$ means that nodes $C_{i}$ and $C_{j}$ are in the same community. The community detection algorithms based on modularity can be divided into three categories:

(1)

Algorithm based on the idea of aggregation. This method uses a bottom-up approach to discover communities. The main idea is to treat each node as a community at the beginning, examine all neighbors for each node, and then add the neighbor with the highest modularity to the community to which the node belongs until the modularity no longer changes. The most typical methods are the Newman fast algorithm, CNM algorithm, and so on[13]. However, these methods cannot control the size of the community, and the sizes of the divided communities are very uneven. Fan et al.[14] made the first attempt to employ a deep learning technique for attributed multi view graph clustering, and proposed a novel task-guided One2Multigraph auto-encoder clustering framework.

(2)

Classification Algorithm This type of method uses a top-down approach to discover communities. The main idea is that at the beginning, all nodes are grouped into the same community, and then the edge decomposition communities with high betweenness are successively deleted until each node represents a community. The most typical method of this type is the GN algorithm proposed by Newman[15]. However, the time complexity of this type of method is too high, and large-scale networks are difficult to scale. Luo et al.[16] proposed a novel random-walk model.

(3)

Optimization Algorithm This type of method assumes that the higher the modularity, the better the result of community detection. Therefore, the key to this type of method is to design an objective function for modularity by using the greedy algorithm, ant colony algorithm, simulated annealing algorithm, and optimize the function, and obtain the optimal modularity.

Raghavan et al.[17] proposed a semi-supervised learning community detection algorithm based on label propagation. The basic idea is to give nodes different labels at the beginning, and then use the label propagation algorithm to spread the labels and nodes with the same label from a community. In addition, Rosvall et al. introduced information theory for community detection, which can obtain high-quality community division results. Hu [5] developed a novel framework called HeGAN for HIN embedding. Luo et al.[18] proposed a deep learning framework to solve both community and SH detection, and Du et al.[19] improved the harmonic modularity algorithm based on Q-modularity gain, which can infer community partitions and rank SH nodes. We draw on the experience of predecessors and regard community detection and node ranking as a joint task.s

2.2 Harmonic

The main idea of harmonic modularity is that the community indicator vector of a node $i$ , $f_{i}$ is the average vector of the community indicator vector of all its neighbor nodes.

$\displaystyle{\bm{f}}_{i}=\frac{1}{d_{i}}\sum_{j\in\text{neighbor }(i)}{\bm{f}% }_{{\bm{j}}}$ (2)

For nodes whose neighbors are in the same community, the modulus of its community indicator vector $||f_{i}||_{2}$ should be 1. For the structural hole node, because it has some cross-community neighbors, its community indicator vector is no longer pure, and its community indicator vector modulus $||f_{i}||_{2}$ is difficult to be 1 under the average representation of the vector, and it has more neighbor nodes in different communities, the closer $||f_{i}||_{2}$ is to 0. HAM proposes that this kind of representation will make the learned community indicator vectors have coordination and difference, and the effects of community division and structural hole discovery will therefore improve each other.

Solving the learning problem 3 is an NP-hard problem, so HAM introduces $l_{2,1}$ regularization and orthogonal constraints to solve its optimization problem. Therefore, this optimization problem is transformed into the formula.

$\displaystyle\min\left(\left\|H-D^{-1}AH\right\|_{2,1}\right),\text{ st. }H^{T% }H=I_{m}$ (3)

$D$ is the degree matrix and $i_{m}$ is the identity matrix. HAM used the non-negative matrix factorization problem in the literature[20] to solve the optimization problem of the formula. $h_{i}$ clustering to divide the node community, and $||h_{i}||_{2}$ to judge the structural hole ranking of the node. Researchers often denote a social network as an undirected graph $G=(V;E)$ , where $V={v_{1},v_{2},\cdots,v_{n}}$ represents the node set of $G$ , and $E\subseteq V\times V$ represents the edge set of $G$ , whose element $e_{ij}=(v_{i},v_{j})$ indicates that the node $v_{i}$ directs node $v_{j}$ . We denote the adjacency matrix of $G$ as $A=[a_{ij}]$ , where $1\leqslant i,j\leqslant\lvert V\rvert$ . $a_{ij}=1$ if node $v_{i}$ is directed to node vj by an edge $(v_{i},v_{j})$ , and $a_{ij}=0$ otherwise. $D=[d_{ij}]$ is a diagonal matrix, and $d_{i}j=\sum_{j\in\textit{neighbor(i)}}a_{ij}$ , and $d_{i}j=0$ otherwise. The community detection task group nodes of the social network into m communities $C=\{C_{1},C_{2},\cdots,C_{m}\}V=C_{1}\cup C_{2}\cup\cdots\cup C_{m}$ and $C_{i}\cap C_{j}=\emptyset$ for every pair $(i,j)$ with $i\neq j$ . Next, in the Fig. 1, we give a simple example to clarify definitions and main properties of nodes.

$\displaystyle h(v_{i})\equiv\frac{1}{d_{i}}\sum_{(v_{i},v_{j})\in E}h(v_{j})$ (4)

where $d_{i}=\sum_{j}a_{ij}$ is node $v_{i}$ ’s degree. This harmonic function indicates the following: node $v_{i}$ ’s degree is equal to the average of the neighboring node $v_{j}$ ’s degrees. We combine the direct modularity increment (DMI) and indirect modularity increment (IMI) to improve the objective function of HAM.

2.3 PageRank

PageRank was originally used to rank web pages on the Internet [21]. PageRank is essentially the application of the random walk model on the Markov chain: the nodes of the Markov chain are web pages, and the arc is the link that connects one page to another. Walker represents a general web surfer that moves from one page to another with a certain probability according to the network structure, occasionally “gets bored” and jumps to random nodes in the network. The steady-state probability vector of the random walk process saves the PageRank value of each node, which can be used to determine the global ranking.

Before describing in detail the normalization issues by showing possible problems, we briefly introduce the web ranking indicators that comprise the foundation for modern search engines. As mentioned earlier, the advantages and disadvantages of the PageRank algorithm were thoroughly analyzed. Although Pagerank was proposed a long time ago, it is still the backbone of multiple technologies, not limited to the web domain. For example, in [22], personalized PageRank is cited as an algorithm that may be used in Twitter’s “following” architecture. In [21], the author showed how the mathematical principles behind PageRank are used in a large number of applications that are not limited to web pages. In [23], another PageRank extension appeared as a temporary replacement for the publication h-index.

Other studies have either focused on providing shorter run times or PageRank’s adaptation to different fields. For example, Bahmani et al. [24] proposed a Monte Carlo technique to perform fast calculations based on random walk algorithms (such as PageRank). There is at least one distributed PageRank [25] version, which can better handle the ever-increasing number of pages to be ranked. One of the main disadvantages of the basic PageRank is that a large link matrix cannot be completely stored in the main memory. It is in speed changes. As I/O operations are slowed down, the distributed version of the algorithm solves this problem. Another method [24] introduced PageRank into the big data world using the MapReduce algorithm. Other types of optimization methods can be found in surveys [26, 27]. An important extension of PageRank, which runs on trusted network domains [28, 29, 30] is EigenTrust [31]. [32] proposed TrustRank algorithm. Because the application scenario involves a trust network, the entities involved are slightly changed: Web pages are replaced with network nodes, and links are replaced with arcs. However, the basic mathematics remained unchanged. However, none of these methods apply PageRank to the edges.We improved PageRank so that it can be applied on the edge.

3. Formalization

Community Detection: Community is an important feature of many social networks. Links within the same community are dense, whereas links between different communities are sparse. Given a network $G=<V,E>$ , the task is to divide all nodes $v_{i}$ into different subsets to obtain the collection of communities $\textit{Com}={\textit{Com}_{i}}^{K}_{i=1}$ , where $\textit{Com}_{i}\in V$ . Community detection methods can be classified as non-overlapping methods, overlapping methods [33], and heir archival methods [34]. The non-overlapping method outputs Com, where any $\textit{Com}_{i}\cap\textit{Com}_{j}=\phi$ and $i\neq j$ . An overlapping method outputs Com, where $\textit{Com}_{i}\cap\textit{Com}_{j}\neq\phi$ and $i\neq j$ . The hierarchical method is an iteration of the non-overlapping method, where each Comican is further divided into smaller communities.

Cross-link across non-overlapping communities. Let $C_{N}(v_{i})$ denote the community set of node $v_{i}$ . Here, N denotes a non-overlapping community detection method. Because the communities are non-overlapping, for any $v_{i}\in V,|C_{N}(v_{i})|=1$ .

Definition 1. (bridge edge) $B_{N}(e_{ij}$ ) denotes whether any given edge $e_{ij}$ is a bridge edge or not, which is defined as:

$\displaystyle B_{N}(e_{ij})=\left\{\begin{array}[]{ll}1&\text{if}\ C_{N}(v_{i}% )\neq C_{N}(v_{j})\\ 0&\text{otherwise}\end{array}\right.$ (5)

3.1 Effectiveness of shortcut

Figure 2.

Examples of edges across communities.

We define a utility function based on the average shortest path to quantify and verify the effectiveness of shortcuts in information dissemination.

The average shortest path distance $\overline{\textit{Dist}}$ is often used to measure the ability of social networks to spread information [35]. In general, the smaller the value of $\overline{\textit{Dist}}$ , the shorter is the distance from the connecting node to the node in the network, which is more conducive to spreading information. Here, n represents the total number of nodes in the network, and $d(v_{i},v_{j})$ represents the shortest path length between nodes $v_{i}$ and $v_{j}$ .

$\displaystyle\overline{\textit{Dist}}=\sum\limits_{v_{i},v_{j}\in V}\frac{d(v_% {i},v_{j})}{n(n-1)}$ (6)

For large-scale networks, the impact of a single edge on information diffusion is too small to be calculated. Therefore, we define the utility function $\Phi(E_{k})$ to measure the rate of change of the shortest path distance after deleting the $k$ edges.

$\displaystyle\Phi\left(E_{k}\right)=\frac{\sum_{e_{ij}\in E-E_{k}}\frac{d\left% (v_{i},v_{j}\right)}{n(n-1)}-\sum_{e_{ij}\in E}\frac{d\left(v_{i},v_{j}\right)% }{n(n-1)}}{\sum_{e_{ij}\in E}\frac{d\left(v_{i},v_{j}\right)}{n(n-1)}}$ (7)

where $E_{k}$ represents the deleted edge. $E-E_{k}$ represents the network after removing the first $K$ nodes. Generally, the larger the value of k, the more edges are deleted, and the more significant the rate of change. When the value of k is fixed, the higher the value of $\Phi$ , the greater the effect of the deleted k edges on information dissemination.

In Eq. (7), the numerator represents an increment in the average shortest path distance in the network after the first $k$ edges are deleted. The denominator represents the average shortest path distance from the original network. Therefore, the entire formula is the growth rate of the average shortest path in the network after deleting the previous edges.

4. Shortcut detection

In a social network, shortcuts are often weak ties and strong ties usually form small friend circles or communities, while weak ties gradually connect these communities into large networks [1]. Easley and Kleinberg’s experiments indicate that strong connections require more time to maintain interactions among users. It has a crowding effect on social time, making a person’s social relation network smaller and smaller. Because of the lack of weak ties, such as shortcuts, the information among a group of good friends (strong ties) will not be able to go out of that small friend circle. In other words, the information transmission between two different communities is transmitted from one community to another by weak ties, and weak ties play a very important role in information transmission. Building a large social network needs to shortcut among social network communities, so the speed of obtaining information is faster. Bakshy[36] mentioned that people are more likely to share information posted by close friends who interact frequently. However, most people’s information comes from other people who do not often contact. It is important for us to obtain useful information by using weak ties, such as shortcuts. To find shortcuts in social networks, we propose a general framework for shortcut detection, as shown in Fig. 3

•
Construct graph: First, we construct the social network graph from social media, such as QQ, Microblog, WeChat, BBS, etc. For example, for Microblog, we consider the users who they post and repost as the nodes of the social network graph. When a user posts, the other users repost; then, we consider the relation between the posters and reposters as the edges of the social network graph.

Figure 3.
Our proposed framework of shortcut detection.

•
Q-HAM Function: Second, we use Q-HAM to process the social network structure to get values H of nodes. The value H of nodes can reflect the degree of harmony of the nodes. The H value of these nodes connected to different communities is higher, and the H value of these nodes connected to the same communities is lower. It can detect communities from a social network graph.
•
Computing $\Delta H$ : Thirdly, we computing $\Delta H$ of edges by using the value H of nodes. The $\Delta H$ value of each node indicates that the information spreads from nodes with high-value H to nodes with low-value H. It plays an important role in determining the direction of edges in the edge graph.
•
Transform Graph: Fourthly, we transform the social network graph into edge graphs by using the detected communities and $\Delta H$ of each edge. The nodes and edges in the edge graph are the edges and nodes of the social network graph. The direction in the edge graph was determined according to the $\Delta H$ of the edges in the edge graph. This is a directed graph.
•
Edge Rank (ER): Finally, we improve the PageRank algorithm to computing the ordering scores of nodes in the edge graph. The nodes in the edge graph represent the edges of the social network graph. Then, we rank the edges in the social network graph. We select the edges with top-k ordering scores in the social network graph as shortcuts.

In Fig. 3, Q-HAM, Transform Graph and ER are the three most important modules. We compute the harmonic of the nodes and detect communities using Q-HAM. We use the results of the H value and community to transform a social network graph into an edge graph in the transform graph module. Traditional PageRank can only sort nodes. In the ER module, an improved PageRank is suitable for computing the ordering score of every edge in the social network graph. The following is a detailed introduction to the three modules.
4.1 An improving Q-HAM for detecting communities detection and computing H value of each node

Communities are common in social networks. Compared with their connections (ties) among nodes of outside communities, the connections among nodes in the same community are denser than connections among nodes of outside communities.

Community detection algorithms usually maximize the modularity degrees as their objection. However, the maximization of the modularity degree cannot distinguish between the external and internal nodes. In addition, the vector representation between internal nodes and their neighbors in the community should be as harmonious as possible. The distribution vectors of the inter-community nodes are in harmony. The distribution vectors of the inter-community and outer-community nodes are diverse. We can order the nodes and detect communities. The objective functions of the HAM can balance the harmony and diversity of the distribution vectors.

ComSHAE [18] analyzed the similarity between HAM and random walk-based spectral clusters to improve the method by using HAM. To overcome the shortcomings of HAM, Luo et al. adopted the direct modularity increment (DMI) and indirect modularity increment (IMI) to improve the objective function of HAM [20] to measure the harmony and diversity between inter-community and outer-community nodes. Because this is an NP-hard problem, we used Eq. (8), instead of using Eq. (3). This achieves the same goal.

$\displaystyle\min_{H}\text{Tr}\left(H^{T}\left(I-D^{-1}A\right)^{T}P\left(I-D^% {-1}A\right)H\right)\text{st.}\ H^{T}H=I_{m}$ (8)

where $A$ is the social network adjacency matrix, $D$ is the social network degree matrix, $I$ is the identity matrix with the same dimensions as $A$ , and $H$ is the community distribution matrix. Noting that $H^{T}H=I_{m}$ , our goal is Eq. (8). Let $L_{P}=\left(I-D^{-1}A\right)^{T}P\left(I-D^{-1}A\right)$ , $P=\textit{Diag}(p_{i})$ , and let $p$ be an auxiliary vector of the $l_{2,1}$ norm. $\text{Tr}\left(H^{T}L_{P}H\right)$ denotes the trace of the matrix. In fact, it minimizes $\text{Tr}\left(H^{T}L_{P}H\right)$ . $D^{-1}A$ is a Laplace matrix. $(I-D^{-1}A)$ is the normalized Laplacian matrix.

We adjust the value of H to minimize the objective function. DMI measures a reasonable degree of difference between a node and its neighbors.

$\displaystyle\textit{DMI}\left(v_{i}\right)=\frac{1}{d_{i}}\sum_{j\in\text{% neighbor}(i)}\left(A_{ij}-\frac{d_{i}d_{j}}{m_{e}}\right).$ (9)

where $d_{i}$ represents the degree of node $v_{i}$ , and $m_{e}$ is the number of edges in the social network graph.

IMI considers that the neighbors of node $v_{i}$ will affect the detection of the communities of $v_{i}$ neighbors.

$\displaystyle\text{IMI}\left(v_{i}\right)=\sum_{j\in Nb\left(v_{i}\right)}\sum% _{k\in\text{Neighbor }\left(v_{j}\right),}r_{k}\left(1-\frac{d_{j}d_{k}}{2m_{e% }}\right).$ (10)

where $r_{k}$ represents the probability of staying in the second-order neighbor $v_{k}$ after taking the node $v_{i}$ as the starting point for two random walks.

The random walk normalized Laplacian $L_{a}\textit{sym}=D^{-1}L$ is one of the commonly used Laplace matrices, such as the combinatorial Laplacian $L=\textit{DA}$ and the symmetric normalized Laplacian $L_{a}\textit{sym}=D^{-1/2}\textit{LD}^{-1/2}$ . The Laplacian matrix we use is a symmetric matrix that can be eigen-decomposed.

$\displaystyle\min_{H}\text{Tr}\left(H^{T}L_{P}H\right),\text{ s.t }H^{T}H=I_{m}$ (11)

The optimal solution for Eq. (11) can be calculated by solving the eigenvector problem of the matrix.

The elements of ${p}$ of $P$ are computed as follows:

$\displaystyle p_{i}=\frac{1}{2\sqrt{\left\|\mathbf{h}_{i}-\frac{\mathbf{1}}{% \mathrm{d}_{i}}\sum_{\left(\mathbf{v}_{i},\mathbf{v}_{j}\right)\in\mathbf{E}}% \mathbf{h}_{j}\right\|_{2}^{2}+\varepsilon}}.$ (12)

where $h_{i},h_{j}\in H$ and $\mathbf{E}$ is the neighbor set of node $v_{i}$ , $\left\|\mathbf{h}_{i}-\frac{\mathbf{1}}{\mathrm{d}_{i}}\sum_{\left(\mathbf{v}_% {i},\mathbf{v}_{j}\right)\in\mathbf{E}}\mathbf{h}_{j}\right\|_{2}^{2}$ reflects the harmony and difference between nodes $v_{i}$ and neighbor $v_{j}$ s in detecting communities. $\varepsilon$ represents a small constant value that prevents this term from being 0.

We find the optimal solution to calculate the eigenvector of matrix $\mathbf{R}$ by using Eq. (13).

$\displaystyle\mathbf{R}=H^{T}\left(I-D^{-1}A\right)^{T}P\left(I-D^{-1}A\right)H$ (13)

We select a node $x$ as the cluster center and calculate the distances from all nodes to $x$ using Eq. (14).

$\displaystyle E=\sum_{i=1}^{c}\sum_{x\in C_{i}}\left\|x-\mu_{i}\right\|_{2}^{2}$ (14)

We summarize the above steps to get the algorithm as show:

In Algorithm 4.1, step 2 $\backsim$ 8 calculates the value $H$ of nodes by harmony and diversity, and compute the auxiliary matrix Q. We regarded the eigenvectors of matrix $R$ as the category of spectral clustering approaches. We obtain H by calculating the objective function in step 6. Steps 2 $\backsim$ 8 repeat to compute $H$ from times $t$ to $t+1$ until convergence. Steps 10 $\backsim$ 15 cluster $H$ to obtain the clustering social network communities via K-means. We selected the initialized $k$ samples as the initial cluster centers. For each sample $x$ in the dataset, we calculate $x$ ’s distance to the c cluster centers and assign them to the cluster corresponding to the cluster center with the smallest distance. For each category, we recalculate the cluster centers until $k$ clusters are formed. We repeat the above two steps until a certain stop condition is reached.

[H] Q-HAM.[1] Adjacency matrix $A$ ; Number of communities c. communities c; value H of nodes. Initialize compute the asymmetric Laplacian matrix $L_{a}\textit{sym}=D^{-1}L$ . not converge Set $P_{i}$ by Eq. (12). Compute $R_{i}$ by Eq. (13) Compute objective matrix $L_{p}=L^{T}_{\textit{asym}}PL^{T}_{\textit{asym}}$ . Compute $H_{i+1}$ corresponding to the first m smallest eigenvalues. $i=i+1$ Ascend each node according to $\left\|H_{i}\right\|_{2}$ via Eq. (11). select a sample as the cluster center Calculate the shortest distance between each sample and the current existing cluster center by Eq. (14). Calculate the probability of each sample being selected as the next cluster. select c cluster centers. get the communities list. To illustrate our algorithm better, we consider Fig. 4 as an example.

Figure 4.

Example of social network graph.

The adjacency matrix is A, and the number of communities is two. The adjacency matrix $A$ , degree matrix $D$ , and Laplacian matrix $L$ are as follows:

$\displaystyle A=\left(\begin{array}[]{cccccc}0&1&1&0&0&0\\ 1&0&1&0&0&0\\ 1&1&0&1&0&0\\ 0&0&1&0&1&1\\ 0&0&1&1&0&1\\ 0&0&0&1&1&0\\ \end{array}\right)D^{-1}=\left(\begin{array}[]{cccccc}0.5&0&0&0&0&0\\ 0&0.5&0&0&0&0\\ 0&0&0.33&0&0&0\\ 0&0&0&0.33&0&0\\ 0&0&0&0&0.5&0\\ 0&0&0&0&0&0.5\\ \end{array}\right)$ $\displaystyle L=\left(\begin{array}[]{cccccc}0.5&1&1&0&0&0\\ 1&0.5&1&0&0&0\\ 1&1&0.33&1&0&0\\ 0&0&1&0.33&1&1\\ 0&0&1&1&0.5&1\\ 0&0&0&1&1&0.5\\ \end{array}\right)$

We combined the example in Fig. 4 to calculate the DMI values of the six nodes by using Eq. (9) are 9.02914483, 9.02914483, 8.57158047, 8.57158047, 9.02914483, 9.02914483. We combined the example in Fig. 4 to calculate the IMI values of the six nodes by using Eq. (10) are 9.80953963, 9.80953963, 9.44462166, 9.44462166, 9.80953963, 9.80953963.

$\displaystyle P=\left(\begin{array}[]{cccccc}20.2541&0&0&0&0&0\\ 0&20.2541&0&0&0&0\\ 0&0&19.8892&0&0&0\\ 0&0&0&19.8892&0&0\\ 0&0&0&0&20.2541&0\\ 0&0&0&0&0&20.2541\\ \end{array}\right)$

Then transform the matrix Laplace by $L_{p}=L^{T}_{\textit{asym}}PL^{T}_{\textit{asym}}$ .

$\displaystyle L_{p}=\left(\begin{array}[]{cccccc}1.4051&-0.9400&-0.5674&0.1022% &0&0\\ -0.9400&1.4051&-0.5674&0.1022&0&0\\ -0.5674&-0.5674&1.5441&-0.6137&0.1022&0.1022\\ 0.1022&0.1022&-0.6137&1.5441&-0.5674&-0.5674\\ 0&0&0.1022&-0.5674&1.4051&-0.9400\\ 0&0&0.1022&-0.5674&-0.9400&1.4051\\ \end{array}\right)$

We repeat the computation of the value $H$ s of every node until the minimum value of $m=2$ is found. Algorithm 4.1 finally outputs the $H$ values of every node in Fig. 4 as follows:

$\displaystyle H=\left[0.61233552\ \ 0.6123\ \ 0.5000\ \ 0.5000\ \ 0.6123\ \ 0.% 6123\right]$

We obtain a vector corresponding to the community number for every node: 0, 0, 0, 1, 1, and 1.

4.2 Transfer social network graph to edge graph

At present, many ranking node research tasks, such as structural hole spanners and opinion leaders, are widely adopted to extract valuable information and knowledge. However, approaches for analyzing edge influences are seldom considered. In this paper, inspired by the idea of the PageRank algorithm, we propose an edge PageRank to mine shortcuts (these edges without direct mutual friends) that are located among communities and play an important role in the spread of public opinion. In the page rank algorithm, the information is disseminated in one direction indicated by the direct edges. We can compute the ranking scores of every node using its in-degrees and out-degrees. In the Q-HAM model, the centering vector of the community in the social network graph is the average vector of all node vectors. The H value of a node reflects the importance of a node in a social network. The indicator $H$ s of the two neighbor nodes determines the information flow direction in the social network graph. We transferred the social network graphs into edge graphs according to the information flow direction. We consider the nodes and edges of the graphs of the social networks as edges and nodes of the edge graphs, respectively. Figure 5 reflects the process of the edge graph.

Figure 5.

The social network map into edge graph.

Definition 2. (Edge graph). Given a social network graph $G=<V,E>$ , its edge graph is a directed graph $G_{e}=<V_{e},E_{e}>$ , where $E$ of $G=<V,E>$ constructs the node set $V_{e}$ of $G=<V,E>$ , $V_{e}=\{f(E)\}$ , $f$ is a one-to-one map function from $E$ to $V_{e}$ . $V$ of $G=<V,E>$ constructs the edge set $E_{e}$ of $G_{e}=<V_{e},E_{e}>$ , $E_{e}=\{g(V)\}$ , and $g$ is a one-to-many map function from $V$ to $E_{e}$ . Let $d_{e}$ be an edge in $E_{e}$ , and we can determine the direction of the edge $d_{e}$ in $G_{e}$ by using $H(A,B)$ in Eq. (4.2).

Figure 6.

The example of edge direction.

Considering that $d_{e}$ is an edge in the edge graph $G_{e}=<V_{e},E_{e}>$ , we can find the corresponding node $b\in V$ of $d_{e}$ in the given social network graph $G=<V,E>$ . There are at least two neighbor edges $(a,b)\in E$ and $(b,c)\in E$ of $b$ . $A\in V_{e}$ and $B\in V_{e}$ are the two endpoints of $d_{e}$ in $G_{e}$ . We map the two edges $(a,b)$ and $(b,c)$ in $G$ into the two nodes $A$ and $B$ in $G_{e}$ , that is, $f(((a,b))=A$ , $f(((b,c))=B$ , $g(b)=d_{e}$ .

$\displaystyle H(A)=h(a)-h(b)$ $\displaystyle H(B)=h(b)-h(c)$ (15) $\displaystyle H(A,B)=H(A)-H(B)$

where $H(A)$ represents the potential energy of node $A$ in $G_{e}$ . $H(A)-H(B)$ represents the potential energy difference between the two edges $A$ and $B$ . This reflects the direction of the information flow of edge $(A,B)$ in the edge graph $G_{e}$ . When $H(A,B)>0$ , the direction of the edge $d_{e}$ is from nodes $A$ to $B$ in $G_{e}$ ; $H(A,B)<0$ , the direction of the edge $d_{e}$ is from nodes $B$ to $A$ in $G_{e}$ ; $H(A,B)=0$ , the direction of the edge $d_{e}$ is bidirectional from nodes $A$ to $B$ and nodes $B$ to $A$ in $G_{e}$ . We summarize the above thoughts in the following steps:

First, we reconstruct a new directed graph $G_{e}(V_{e},E_{e})$ , where $V_{e}$ is an edge in graph $G$ and a node in graph $G_{e}$ by $V_{e}=\{f(E)\}$ , where $E_{e}$ is an edge in graph $G_{e}$ by $E_{e}=\{g(V)\}$ . The direction of the edge is determined by the value of $H(A,B)$ , from low to high potential energy.

Second, we use $A_{e}[i][j]=1$ to represent the direction of edge $(i,j)$ as $i$ to $j$ . $i,j\in V_{e}$ . $A_{e}[i][j]=0$ indicates that there is no directed edge from node $i$ to node $j$ .

Finally, we obtain an adjacency matrix $A_{e}$ . We add each row of the adjacency matrix $A_{e}$ to obtain the out-degree of each node, and added each column of the adjacency matrix $A_{e}$ to obtain the in-degree of each node.

To sum up, we get the Algorithm 6 as shown:

[H] Social network graph converted to the graph edge graph[1] the social network graph $G$ ; the Q-HAM of each node $h(n)$ ; Edge Graph $G_{e}$ ; Initialize the Adjacency Matrix $A_{e}\Leftarrow\mathbf{0}$ , $A_{e}\in\mathbb{R}^{n\times n}$ . Get $m$ ; $m$ is the number of edges of $G$ Get $H$ by using Eq. (4.2) $H$ is the HAM of each edge. each edge $e\in E$ $V_{e}\Leftarrow V_{e}\cup f(e)$ ; each edge $v\in V$ $E_{e}\Leftarrow E_{e}\cup g(v)$ ; $i\in V_{e}<m$ $j\in\textit{neighbor(i)}<\textit{length(neighbor)}$ $H(i))\geqslant h(j))$ Set $A_{e}[i][j]=$ 1 The direction of edge is edge $i$ to edge $j$ Set $A_{e}[j][i]$ = 1 The direction of edge is edge $j$ to edge $i$ $A_{e},G_{e}$

To illustrate the proposed algorithm better, we built a six-node social network graph. There are six nodes $a$ , $b$ , $c$ , $d$ , $e$ , and $f$ in the social network graph in Fig. 5, represented in blue color. There are edges $(a,b)$ , $(b,c)$ , $(a,c)$ , $(c,d)$ , $(d,e)$ , $(d,f)$ , and $(e,f)$ , which are shown in red. We converted the social network graph into an edge graph in Fig. 5. Among them, the seven nodes $A$ , $B$ , $C$ , $D$ , $E$ , $F$ , and $G$ , which are shown in red, correspond to $(a,c)$ , $(a,b)$ , $(b,c)$ , $(c,d)$ , $(d,f)$ , $(d,g)$ , $(f,g)$ . Table 1 illustrates the correspondence between the nodes and edges.

Table 1

Social network graph mapping into edge graph

Edges in social network graph mapping into nodes in edge graph		Nodes in social network graph mapping into edges in edge graph
$V$	$E_{e}$	$E$	$V_{e}$
$(a,c)$	$A$	$a$	$(A,B)$
$(a,b)$	$B$	$b$	$(B,C)$
$(b,c)$	$C$	$c$	$(A,C),(A,D),(C,D)$
$(c,d)$	$D$	$d$	$(D,E),(D,F),(E,F)$
$(d,e)$	$E$	$e$	$(E,F)$
$(d,f)$	$F$	$f$	$(F,G)$
$(e,f)$	$G$

4.3 Edge Rank (ER) for shortcuts detection

Shortcuts often play an important role in the community. Single bridges between communities often have a large amount of information dissemination. However, communities are sometimes closely connected, and the connection between communities will show a small amount of information dissemination. Therefore, in social networks with more communities, we detect shortcuts with the most important or least importance.

To solve the ranking problem of important pages on the Internet, Brin and Larry Page proposed the PageRank algorithm[37] to evaluate the importance of web pages by using hyperlink relationships between pages. That is, the more linked pages, the more important are the linked pages. We use the idea of PageRank to sort the nodes in the edge graph and discover shortcuts in the social network graph, and compute the importance $ER(v_{i})$ of node $v_{i}$ . The calculation adopted an iterative method using Eq. (16)

$\displaystyle\text{ER}\left(v_{e}^{i}\right)^{t+1}=\frac{1-\alpha}{N}+\alpha% \sum_{v_{e}^{j}}\frac{\text{ER}\left(v_{e}^{j}\right)^{t}}{L\left(v_{e}^{j}% \right)}$ (16)

Where $v_{e}^{i},v_{e}^{j}\in V_{e}$ . $L(v_{e}^{j})$ is the number of out-degrees of node $v_{e}^{j}$ , and $N$ is the number of all edges of the edge graph. $\alpha$ is the damping factor( 0.85).

The Eq. (16) is an important foundation for finding shortcuts in a social network graph. It can be used to evaluate the importance of nodes in the edge graph of a social network graph. We denote the importance of node $v_{e}^{i}$ as $ER(v_{e}^{i})$ in a directed graph. It iteratively computes the importance of every node in the edge graph until they have minor changes. The nodes in the edge graph represent the edges of social networks. We obtain the importance of every node in the edge graph that can restore the importance of every edge in social networks. Figure 7 is a classic example for restoring the importance of every edge. $f^{-1}$ and $g^{-1}$ are the inverse mapping of mappings $f$ and $g$ in transferring the social network graph to the edge graph. $f^{-1}$ realizes the mapping from $V_{e}$ to $E$ . $g^{-1}$ maps from $E_{e}$ to $V$ . The importance of every node’s edge graph becomes the importance of every node in the social network graph. The shortcuts of social networks are the edges with maximal or minimum importance.

The ER algorithm can efficiently and accurately identify the most influential edges in a social network graph to discover shortcuts in the social network graph.

Based on the above, we conclude the main idea of shortcut detection in the social network in the following steps:

First, we use adjacency matrix $A_{e}$ to represent the relationship between the edges in this social network graph. If there is an edge between node $i$ and node $j$ , then $A_{ij}=1$ ; otherwise, $A_{ij}=0$ . If the total number of edges is $N$ , then the adjacency matrix $A_{e}\in\mathbb{R}^{N\times N}$ . We divide each row by the sum of non-zero numbers in that row; that is, (the sum of non-zero numbers in each row is the number of edges ). This records the probability of each node’s information dissemination to other nodes, that is, the value in row $i$ and column $j$ represents the probability that the information goes from node $i$ to node $j$ . Until the last two results were approximately the same. Further, $\left\|\textit{ER}_{t+1}-\textit{ER}_{t}\right\|_{1}$ is approximately equal to $\epsilon$ , and the calculation stops at this time. The final ER is the ER value of each node in the edge graph.

Second, we restore the edge graph $G_{e}$ into the social network graph $G$ and assign the importance value to each edge in the social network graph. We map the nodes $V_{e}$ in edge graph $G_{e}$ to the edges $E$ of the social network graph $G$ by using the map function $E=f^{-1}(V_{e})$ . We map the edge $E_{e}$ in edge graph $G_{e}$ to the edges $V$ of the social network graph $G$ by using the map function $V=g^{-1}(E_{e})$ . We endow the importance $\textit{ER}(v_{e})$ of node $v_{e}\in V_{e}$ in the edge graph to the edge $e\in E$ social network $G$ , and the edge $e\in E$ social network $G$ has an importance $\textit{ER}(e)$ .

Finally, we sort the edges using the ER value. The importance of every node’s edge graph becomes the importance of every edge in the social network graph. We consider that the shortcuts of social networks are the edges with the least importance or the most important. However, the top rankings are not necessarily shortcuts. We filter out this edge according to the definition of shortcut, which is no common neighbor node(neighborhood overlap is 0).

To sum up, we get the Algorithm 4.3 as show:

[H] Edge Rank[1] edge graph $G_{e}$ ; adjacency matrix $A_{e}$ ; Edge Rank value of each edge $er\in\textit{ER}$ ; initialization $P_{0}\leftarrow\frac{e}{n}$ . initialization in-degree and out-degree relationship matrices. $i\in G_{e}<\textit{length}(V_{e})$ $j\in\textit{neighbor(i)}<\textit{length(neighbor(i))}$ $\textit{ER}_{ij}=\frac{1}{\textit{length(neighbor(i))}}$ $t\leftarrow 1$ $\text{ER}\left(v_{e}^{i}\right)^{t+1}\Leftarrow\text{ER}\left(v_{e}^{i}\right)% ^{t}$ by using Eq. (16), $\left\|\textit{ER}^{t+1}-\textit{ER}^{t}\right\|_{1}\leqslant\epsilon$ each edge $e_{e}\in E_{e}$ $V\Leftarrow V\cup f^{-1}(e_{e})$ ; each edge $v_{e}\in V_{e}$ $E\Leftarrow E\cup g^{-1}(V_{e})$ ; $\textit{ER}(e)\Leftarrow\textit{ER}(v_{e})$ ; map the importance $ER(v_{e})$ of node $v_{e}\in V_{e}$ in edge graph to the edge $e\in E$ social network $G$ . sort the edges by $\textit{ER}(e)$ and choose top-K(maximal or minimum importance) edges as the shortcuts by restricting nodes without common neighbors. $\textit{ER}_{e}$ , shortcuts list;

Figure 7.

Examples of the importance of edges in social network.

In this example, there are only seven nodes. If A is the disseminator of information, then the information will jump to B, C, and D with a probability of $\frac{1}{3}$ . Here, 3 indicates that A has three outgoing edges. If a node has k outgoing edges, the probability of jumping to any outgoing edge is $\frac{1}{k}$ . Similarly, the probability of A to B and C is $\frac{1}{3}$ , and the probability of D to C is 0. Generally, the transition matrix is used to express the jump probability of information. If N is used to indicate the number of nodes, the transition matrix $S$ is a square matrix of $N\times N$ , and if there is $k$ out-degree on node $j$ , then for each out-degree edge point for node $i$ , $S[i][j]=\frac{1}{k}$ . $S[i][j]=0$ for other nodes. The transition matrix corresponding to the above example is as follows:

$\displaystyle S=\left(\begin{array}[]{ccccccc}0&\frac{1}{2}&\frac{1}{3}&0&0&0&% 0\\ \frac{1}{3}&0&\frac{1}{3}&0&0&0&0\\ \frac{1}{3}&\frac{1}{2}&0&0&0&0&0\\ \frac{1}{3}&0&\frac{1}{3}&0&\frac{1}{3}&\frac{1}{3}&0\\ 0&0&0&0&0&\frac{1}{3}&\frac{1}{2}\\ 0&0&0&0&\frac{1}{3}&0&\frac{1}{2}\\ 0&0&0&0&\frac{1}{3}&\frac{1}{3}&0\\ \end{array}\right)$

In the initial condition, suppose that the probability of importance on each node is equal, that is, $\frac{1}{n}$ , so the probability distribution of the initial test is an n-dimensional column vector $\textit{ER}_{0}$ with all values $\frac{1}{n}$ . We use $\textit{ER}_{0}$ to multiply the transition matrix $S$ , and the probability distribution vector $\textit{MER}_{0}$ of Internet users after the first step is obtained, and $(N\times N)\times(N\times 1)$ still obtains an $N\times 1$ matrix. The following is the calculation process for $\textit{ER}_{1}$ :

$\displaystyle\textit{ER}_{1}=\left(0.85\left[\begin{array}[]{ccccccc}0&\frac{1% }{2}&\frac{1}{3}&0&0&0&0\\ \frac{1}{3}&0&\frac{1}{3}&0&0&0&0\\ \frac{1}{3}&\frac{1}{2}&0&0&0&0&0\\ \frac{1}{3}&0&\frac{1}{3}&0&\frac{1}{3}&\frac{1}{3}&0\\ 0&0&0&0&0&\frac{1}{3}&\frac{1}{2}\\ 0&0&0&0&\frac{1}{3}&0&\frac{1}{2}\\ 0&0&0&0&\frac{1}{3}&\frac{1}{3}&0\\ \end{array}\right]+0.15\times\frac{1}{7}\times\vec{e}\vec{e}^{T}\right)\left[% \begin{array}[]{c}\frac{1}{7}\\ \frac{1}{7}\\ \frac{1}{7}\\ \frac{1}{7}\\ \frac{1}{7}\\ \frac{1}{7}\\ \frac{1}{7}\\ \end{array}\right]$

We note that $S[i][j]$ in matrix $S$ is not zero, which means that an edge is used to point to $i$ from $j$ . The first row of $S$ is multiplied by $\textit{ER}_{0}$ , which means that the probability of accumulating all nodes to node $A$ is 0.1375. After obtaining $\textit{ER}_{1}$ , we use $\textit{ER}_{1}$ to multiply right by $M$ to obtain $\textit{ER}_{2}$ . Finally, ER converges, that is, Eq. (16) In the example above, iterate continuously, and finally $\textit{ER}=[0.1375,0.1239,0.1375,0.2018,0.1375,0.1375,0.1239]^{T}$ :

$\displaystyle\left[\begin{array}[]{c}0.1399\\ 0.1197\\ 0.1399\\ 0.2006\\ 0.1399\\ 0.1399\\ 0.1197\\ \end{array}\right]\rightarrow\left[\begin{array}[]{c}0.1363\\ 0.1251\\ 0.1363\\ 0.2044\\ 0.1363\\ 0.1363\\ 0.1251\\ \end{array}\right]\rightarrow\left[\begin{array}[]{c}0.1380\\ 0.1235\\ 0.1380\\ 0.2007\\ 0.1380\\ 0.1380\\ 0.1235\\ \end{array}\right]\rightarrow\left[\begin{array}[]{c}0.1374\\ 0.1240\\ 0.1374\\ 0.2022\\ 0.1374\\ 0.1374\\ 0.1240\\ \end{array}\right]\rightarrow\left[\begin{array}[]{c}0.1375\\ 0.1239\\ 0.1375\\ 0.2018\\ 0.1375\\ 0.1375\\ 0.1239\\ \end{array}\right]$

We can obtain Table 2 from the above calculation process. Clearly, edges (c, d) have a higher ER value. This edge was responsible for information dissemination between the two communities. This edge is an important bridge between the two communities. Edge(c, d) can be seen as a shortcut between the two communities.

Table 2

Example $p_{e}$ of edges

Node	A	B	C	D	E	F	G
Edge	(a, c)	(a, b)	(b, c)	(c, d)	(d, e)	(d, f)	(e, f)
$p_{e}$	0.1375	0.1239	0.1375	0.2018	0.1375	0.1375	0.1239

5. Experimental analysis

In this section, we adopt a widely used comprehensive benchmark to demonstrate the performance of our proposed Edge Rank (ER) method on real-world social networks. We use the proposed ER method to detect shortcut discoveries in large real networks.

5.1 Experimental evaluation

Precision represents the proportion of structural nodes in the top nodes of the sorted results. This study uses the real community labels of the nodes to infer which edges are shortcuts.

$\displaystyle\textit{precision(R)}=\frac{E_{lb}}{R}$ (17)

where $E_{lb}$ is the labelled edge of the shortcut. R represents the first R-edge after sorting.

Mean average precision (MAP) considers the importance of the ranking result position. When the two sorting algorithms have the same accuracy for finding the top-k shortcuts, the algorithm that arranges the edge sequence higher should be better.

$\displaystyle\text{MAP}=\frac{1}{|G|}\sum_{j=1}^{|G|}\frac{1}{m_{i}}\sum_{k=1}% ^{m_{j}}\text{ precsion }\left(E_{jk}\right)$ (18)

$g_{i}\in G$ For the sorting algorithm, the j sampled sub-graph of the current dataset is processed and placed in the top-k edges, and $E_{j}$ is the set of edges before all sorting positions.

5.2 Datasets

We continuously use three real-world social network datasets, DBLP, YouTube, and Orkut[38], which are obtained from different topics, which are public social network datasets with community labels. The three datasets are described as follows:

DBLP[38] is a computer science bibliography that provides a comprehensive list of computer science research papers. If two authors jointly publish at least one paper, they can establish a co-author network. The name of the published journal or the location of the conference can be regarded as the label of a community, and the authors who publish articles in a certain journal or conference constitute this community. We extracted the DBLP dataset into three sub-datasets (DBLP-L, DBLP-M, and DBLP-S) according to their size.

YouTube [38] is a video-sharing website that includes social networks. In the YouTube social network, users establish online friendships with each other, and users can create groups that allow other users to join. The name of the user-defined group is treated as a community label.

Orkut[38] is a free online social network in which users can add friends to each other. Orkut also allows users to create a friend group, and other members can join the group. User-defined friend groups were considered as community labels.

Table 3
Summary of experimental

Types of data	Nodes	Edges	Cross-edges	Mean degree
DBLP-M	1629	15379	261	17.8
Orkut	1544	17273	220	23.7
YouTube	1627	13103	85	14.3
DBLP-S	843	1819	37	4.3
DBLP-L	3150	14840	361	9.4

5.3 Baseline methods

•
Undirected graph[39]: We use undirected edge graph to rank the edge. We treat the edges as bidirectional and then use PageRank to rank the nodes.
•
PageRank[24]: We calculate the PageRank value of each node in the social network. We used the ordering of nodes to determine the direction of the edges in the edge graph.
•
Hits[40]: After the user enters the keyword, the algorithm calculates two values (Hub score and Authority Score) for the returned matching page. These two scores are interdependent and affect each other. The so-called pivot value refers to the sum of the authority scores of all export links on the page that point to the page. Authority scores refer to the sum of the hubs on the page where all the imported links are located.
•
Neighborhood overlap[41]: It is a feature of the attributes of an edge when judging whether a point can pass through an edge to reach another point more conveniently in a social network.

5.4 performance analysis based on karate

We divided the karate dataset into two groups. The edges are marked as follows. shortcuts :(26, 24), (1, 12), (2, 31), (20, 34), (3, 10), (3, 28), (3, 29), (1, 32), (14, 34), (28, 25), (10, 34)

cross-edge: (1,32),(1,9),(2,31),(3,9),(3,10),(3,28),(3,29),(3,33),(20,34),(14,34).

According to our ER algorithm, the edge parameters are shown in Fig. 8. It can be seen that the first five edges are (3, 14) (3, 9) (34, 20) (2, 20) (2, 3). It is easy to figure out that nodes 3 and 20 play a connecting role in the two communities. Therefore, the edges around the two points were ranked higher. Then, we use the Jaccard index to filter out the edges that do not have a common node. We call this an important shortcut-like edge (20, 34).

5.5 Shortcut detection

We sample six sub-graphs from three real datasets, the top-20 edges have a good effect, which can reflect the relationship with shortcuts to a certain extent.

We given different K values, and the values of MAP and precision for different datasets are also different. It can be observed that the top numerical shortcuts can achieve better performance. Among them, the performance of the dataset in the DBLP datasets was better than that of the other two datasets. This is because there are more shortcuts in the DBLP dataset, which can be better detected.

Table 4
Top-20 edge to predict shortcuts

Types of data	1	2	3	4	5	6	MAP	Precision
DBLP	0.738	0.347	0.824	0.456	0.210	0.322	0.48283	0.650
Orkut	0.539	0.556	0.879	0.699	0.669	0.163	0.584167	0.7
YouTube	0.501	0.190	0.467	0.511	0.373	0.155	0.366167	0.517

Figure 8.

Karate datasets: The karate nodes, labeled in red and blue colors, represent the two communities. The red and blue edges represent the edges within the communities, and the purple edges represent the edges between the cross-communities.

Figure 9.

Comparison of utilities of inferred edges.

5.5.1 Compare data sets of different sizes

To verify the effectiveness of the ER algorithm, we conducted experiments on datasets of different scales. DBLP-S has 843 nodes and 1819 edges. DBLP-L has 3247 nodes and 14840 edges. These two datasets are a subset of a large dataset with more than 30,000 nodes sampled from the community set.

Figure 10.

Precision of shortcut detection in different data sets.

Figure 11.

MAP of shortcut detection in different data sets.

Figures 10 and 11 to control sequence show the precision and map under different data sets, respectively. It can be seen that the ER method we proposed has a good performance in each data set. Although the various methods under the Orkut data set perform poorly, we believe that the network structure of this data set is too dense and there are few shortcuts.

We can get the first 9 edges with the smallest edge ER value((1, 7), ‘0.0138’), ((2, 3), ‘0.0138’), ((5, 6), ‘0.0138’), ((1, 10), ‘0.0150’), ((3, 4), ‘0.0161’), ((1, 11), ‘0.0171’), ((4, 14), ‘0.0172’), ((10, 2), 1‘0.0183’), ((7, 6), ‘0.0189’) by using Table 4. It can be clearly seen that these edges shortcut some cross-community connecting edges. These edges play an important role in connecting the communities.

Figure 11 can be more convenient to open the ER value to the edge. The system first needs to collect data from social media to form a social network graph. Subsequently, the result of obtaining H values of nodes and detecting communities is obtained by the HAM algorithm. The ER algorithm obtains the community labels and uses it as the input of the HAM algorithm. The ER algorithm can then calculate the social influence values of the edges.

From this experiment, we can see that the effectiveness of our model will be affected as the dataset increases.

5.6 Information diffusion

$\phi$ can be calculated using Eq. (7), which can measure the information-dissemination ability of the edges.

Figure 12.

Comparison of utilities of inferred edges.

To verify whether the detected shortcuts are indeed more useful for information dissemination, we compared the utility of four sets of edges detected by different methods, namely, the shortcuts detected by ER, and the shortcuts detected by the Jaccard index, PageRank, hits, and ER. For convenience, the utility is calculated by adding a fixed number of detected shortcuts.

Figure 13.

Shortcuts in different methods: the shortcuts are represented by read lines. Nodes distinguish different communities by different color.

The comparison results for the three networks are presented in Fig. 12. When deleting the same number of detected shortcuts, the ER always obtains a higher utility value than neighborhood overlap. This verifies that shortcuts are more useful for information dissemination than ordinary links. The shortcuts detected by ER have higher utility in DBLP and YouTube, but slightly lower utility in Orkut. This shows that the size of the network may have an impact on the quality of the detected shortcuts.

We compare the overlap of neighborhoods and find that the deleted edges have little effect on the average short path in the network. Our analysis should be that there are some clustered network structures in the network, and these deleted edges cannot play a role in network propagation.

5.7 Visualization

Finally, we visualized the shortcut detection results of different methods. Figure 13 shows the comparison of five methods on DBLP1 dataset. We marked 7% of the sorted edges in red. Figure 13 represents the visualization result obtained by our model. Figure 13(b–e) shows the results of the other four approaches.

6. Conclusion and future work

We proposed an ER algorithm based on HAM to order edges and discover that some of these rules make the edges with higher ranks related to our shortcuts. To evaluate the proposed approach, we applied it to three real-world datasets. Experiments showed that our proposed ER algorithm can detect shortcut edges well. Furthermore, we verified the importance of these shortcut edges in the network, that is, calculating the average short path in the network by deleting these edges. The identification of edges can help in public opinion detection scientifically.

In the future, we intend to introduce more features into our method, such as adding structural holes for analysis to improve the performance of our method. More evaluations and experiments will be conducted to improve the suggestion that methods with better recognizability will adapt to more complex situations.

Footnotes

Acknowledgments

This work was supported by the National Natural Science Foundation (Grant Nos. 61872298, 61802316, and 61902324) and the Sichuan Regional Innovation Cooperation Project (Grant No. 2021YFQ008).

References

Easley

D.A.

and Kleinberg

J.M.

, Networks, Crowds, and Markets – Reasoning About a Highly Connected World, Cambridge University Press, 2010.

Chen

Zhang

Yang

and Li

, ibridge: Inferring bridge links that diffuse information across communities, Knowl Based Syst. 192 (2020), 105249.

and Zhou

, Link prediction in complex networks: A survey, Physica A: Statistical Mechanics and its Applications 390(6) (2011), 1150–1170.

Huang

and Zeng

D.D.

, A link prediction approach to anomalous email detection, In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Taipei, Taiwan, October 8–11, 2006, pp. 1131–1136.

Lei

and Ruan

, A novel link prediction algorithm for reconstructing protein-protein interaction networks by topological similarity, Bioinform 29(3) (2013), 355–364.

Sarukkai

, Link prediction and path analysis using markov chains, Comput Networks 33(1-6) (2000), 377–386.

Wang

Satuluri

and Parthasarathy

, Local probabilistic models for link prediction, In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007), October 28–31, 2007, Omaha, Nebraska, USA, 2007, pp. 322–331.

Liu

Zhang

Q.-M.

and Zhou

, Link prediction in complex networks: A local nave bayes model, EPL (Europhysics Letters) 96(4) (Nov 2011), 48007.

Bhanodia

P.K.

Khamparia

and Pandey

, Supervised shift k-means based machine learning approach for link prediction using inherent structural properties of large online social network, Comput Intell 37(2) (2021), 660–677.

10.

Tao

Yan

and Lin

, Exploring evolution of dynamic networks via diachronic node embeddings, IEEE Trans Vis Comput Graph 26(7) (2020), 2387–2402.

11.

Zhu

Lin

Shi

Qiu

and Niu

, Recommending scientific paper via heterogeneous knowledge embedding based attentive recurrent neural networks, Knowl Based Syst 215 (2021), 106744.

12.

Karypis

and Kumar

, Multilevel k-way partitioning scheme for irregular graphs, J Parallel Distributed Comput 48(1) (1998), 96–129.

13.

Newman

M.E.J.

, Fast algorithm for detecting community structure in networks, Phys Rev E Stat Nonlin Soft Matter Phys 69(6 Pt 2) (2004), 066133.

14.

Fan

Wang

Shi

Lin

and Wang

, One2multi graph autoencoder for multi-view graph clustering, In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20–24, 2020, ACM/IW3C2, 2020, pp. 3070–3076.

15.

Newman

M.E.J.

and Girvan

, Finding and evaluating community structure in networks, Physical Review E 69(2 Pt 2) (2004), 026113.

16.

Luo

Bian

Yan

Liu

Huan

and Zhang

, Local community detection in multiple networks, In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23–27, 2020, ACM, 2020, pp. 266–274.

17.

Zhu

and Ghahramani

, Learning from labels and unlabeled data with label propagation, Tech Report 3175(2004) (2002), 237–244.

18.

Luo

and Du

, Detecting community structure and structural hole spanner simultaneously by using graph convolutional network based auto-encoder, Neurocomputing 410 (2020), 138–150.

19.

Zhou

Luo

and Hu

, Detection of key figures in social networks by combining harmonic modularity with community structure-regulated network embedding, Inf Sci 570 (2021), 722–743.

20.

Cao

Shen

and Yu

P.S.

, Joint community and structural hole spanner detection via harmonic modularity, In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016, pp. 875–884.

21.

Gleich

D.F.

, Pagerank beyond the web, SIAM Rev 57(3) (2015), 321–363.

22.

Gupta

Goel

Lin

Sharma

Wang

and Zadeh

, WTF: the who to follow service at twitter, In 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13–17, 2013, pp. 505–514.

23.

Senanayake

Piraveenan

and Zomaya

, The pagerank-index: Going beyond citation counts in quantifying scientific impact of researchers, PloS One 10(8) (2015), e0134794.

24.

Bahmani

Chowdhury

and Goel

, Fast incremental and personalized pagerank, Proc. VLDB Endow 4(3) (2010), 173–184.

25.

Zhu

and Li

, Distributed pagerank computation based on iterative aggregation-disaggregation methods, In Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, Germany, October 31–November 5, 2005, 2005, pp. 578–585.

26.

Berkhin

, Survey: A survey on pagerank computing, Internet Math 2(1) (2005), 73–120.

27.

Sargolzaei

and Soleymani

, Pagerank problem, survey and future research directions, In International Mathematical Forum, Volume 5, Citeseer, 2010, pp. 937–956.

28.

Carchiolo

Longheu

Malgeri

and Mangioni

, Gain the best reputation in trust networks, In Intelligent Distributed Computing V – Proceedings of the 5th International Symposium on Intelligent Distributed Computing – IDC 2011, Delft, The Netherlands - October 2011, volume 382, 2011, pp. 213–218.

29.

Carchiolo

Longheu

Malgeri

and Mangioni

, Trust assessment: a personalized, distributed, and secure approach, Concurr Comput Pract Exp 24(6) (2012), 605–617.

30.

Carchiolo

Longheu

Malgeri

and Mangioni

, Users’ attachment in trust networks: reputation vs. effort, Int J Bio Inspired Comput 5(4) (2013), 199–209.

31.

Kamvar

S.D.

Schlosser

M.T.

and Garcia-Molina

, The eigentrust algorithm for reputation management in P2P networks, In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20–24, 2003, pp. 640–651.

32.

Tian

Liu

Wang

Song

Fan

and Wang

, A novel page ranking algorithm based on triadic closure and hyperlink-induced topic search, Intell Data Anal 19(5) (2015), 1131–1149.

33.

Ahn

Y.Y.

Bagrow

J.P.

and Lehmann

, Link communities reveal multiscale complexity in networks, Nature 466(7307) (2010), 761.

34.

Clauset

Moore

and Newman

M.E.J.

, Hierarchical structure and the prediction of missing links in networks, Nature 453(7191) (2008), 98.

35.

Mensah

D.N.A.

Gao

and Yang

L.W.

, Approximation algorithm for shortest path in large social networks, Algorithms 13(2) (2020), 36.

36.

Bakshy

Rosenn

Marlow

and Adamic

L.A.

, The role of social networks in information diffusion, In Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16–20, 2012, pp. 519–528.

37.

Bryan

and Leise

, The §25,000,000,000 eigenvector: The linear algebra behind google, SIAM Rev 48(3) (2006), 569–581.

38.

Yang

and Leskovec

, Defining and evaluating network communities based on ground-truth, Knowl Inf Syst 42(1) (2015), 181–213.

39.

Grolmusz

, A note on the pagerank of undirected graphs, Inf Process Lett 115(6-8) (2015), 633–634.

40.

and Duan

, Influence model of paper citation networks with integrated pagerank and HITS, In: Shen

Barthès

J.A.

Luo

Shi

and Zhang

, editors, 24th IEEE International Conference on Computer Supported Cooperative Work in Design, CSCWD 2021, Dalian, China, May 5–7, 2021, pp. 1081–1086.

41.

Choumane

Awada

and Harkous

, Core expansion: a new community detection algorithm based on neighborhood overlap, Soc Netw Anal Min 10(1) (2020), 30.

Cross-community shortcut detection based on network representation learning and structural features

Abstract

Keywords

1. Introduction

2.1 Community detection

3. Formalization

5.1 Experimental evaluation

Table 3 Summary of experimental

5.5 Shortcut detection

Table 4 Top-20 edge to predict shortcuts

6. Conclusion and future work

Footnotes

Acknowledgments

References

Table 3
Summary of experimental

Table 4
Top-20 edge to predict shortcuts