Abstract
As social networks continue to expand, an increasing number of people prefer to use social networks to post their comments and express their feelings, and as a result, the information contained in social networks has grown explosively. The effective extraction of valuable information from social networks has attracted the attention of many researchers. It can mine hidden information from social networks and promote the development of social network structures. At present, many ranking node approaches, such as structural hole spanners and opinion leaders, are widely adopted to extract valuable information and knowledge. However, approaches for analyzing edge influences are seldom considered. In this study, we proposed an edge PageRank to mine shortcuts (these edges without direct mutual friends) that are located among communities and play an important role in the spread of public opinion. We first used a network-embedding algorithm to order the spanners and determine the direction of every edge. Then, we transferred the graphs of social networks into edge graphs according to the ordering. We considered the nodes and edges of the graphs of the social networks as edges and nodes of the edge graphs, respectively. Finally, we improved the PageRank algorithm on the edge graph to obtained the edge ranking and extracted the shortcuts of social networks.
The experimental results for five different sizes of social networks, such as email, YouTube, DBLP-L, DBLP-M, and DBLP-S, verify whether the inferred shortcut is indeed more useful for information dissemination, and the utility of three sets of edges inferred by different methods is compared, namely, the edge inferred by ER, the edge inferred by the Jaccard index. The ER approach improves by approximately 10%, 9.9%, and 8.3% on DBLP, YouTube, and Orkut. Our method is more effective than the edge ranked by the Jaccard index.
Introduction
Many complex network systems can be described as graphs, where nodes represent individuals and links represent interactions between individuals. As one of the most intensively studied networks, social networks play an important role in connecting people with each other and disseminating various types of information. The connection relationships between individuals construct graph edges. We usually divide social networks into communities using graph theories. These edges and nodes in the graphs of social networks are divided into intro-community edges, nodes, and cross-community edges, nodes. Many studies have focused on cross-community edges, nodes that play an important role in diffusing information among different communities. Many studies have focused on key cross-community nodes, such as structural hole spanner detection. However, few studies have focused on key cross-community edge detection, such as shortcut detection and bridges. According to Easley and Kleinberg[1], if two individuals of an edge do not have a direct common friend, the edge is called a bridge or shortcut in a broad sense. If we remove a shortcut, then there are other paths in which its length is more than 3 from an endpoint of the shortcut to its other endpoint. In Fig. 1, the edge (1, 2) constructing a shortcut is not only a path connecting between nodes 1 and 2. In addition to path
Shortcuts and structural hole spanners among communities C1, C2, and C3. The graph has three communities and three shortcuts. Each community has six nodes. The internal nodes in communities connect with each other, and there are three structural hole spanners and three shortcuts.
Shortcut detection is a typical link prediction task for social networks [3]. Link prediction, one of the most important tasks in social network analysis and mining, studies the formation of missing links or new links based on current and historical social networks. Link predictions are broadly used in some applications, such as recommendation[4], pre-warning systems[5], and biomedical discovery[5]. Researchers have extensively studied effective link prediction methods for different types of social networks and in different application scenarios. Some simple heuristics, such as common neighbors and the Katz index, work well in practice and are scalable to large social networks. Other early studies also achieved good performance, including those based on Markov chains [6], probability graphical methods[7, 8], and supervised learning[9]. Recently, studies have predicted links in dynamic social networks[10] and heterogeneous social networks[11].
The two endpoints of shortcuts directly touch two different communities in social networks. This feature can be used to obtain information that is originally far from you. We use these features to focus on finding shortcuts to social networks. In this study, we propose an approach to detect shortcuts in social networks. Our main contributions are as follows.
We use an improved Q-HAM to detect community and calculate the H value of each node. In contrast to other node rankings, such as PageRank, the nodes we detect are nodes that connect the community, rather than nodes with higher influence. To apply the ranking method of nodes to edges, we propose a transformed social network graph. We propose edge rank (ER), an edge quantification method. This method borrows from PageRank. The traditional PageRank algorithm calculates the influence of nodes, and we apply its improvement to edges; in other people’s research on the edge, we found that few people study the shortcut. We found relevant literature on weak ties and discovered the relationship between weak ties and shortcuts. We use a small network to illustrate how to quantify edges. Then, experiments on finding shortcuts were performed on three real datasets by comparing PageRank, hits, etc. Then, the flow of information dissemination was used to verify its effectiveness.
The remainder of this paper is organized as follows. Section 2 introduces related work. In Section 4, we propose an approach for shortcut detection. Section 5 discusses the experimental evaluation. In Section 6, we present our conclusions and directions for future research. In response to the shortcut problem in the social network, we studied earlier and the lack of corresponding achievements that can be used for reference. There are some related works that mainly include networking embedding for network representation, Q-HAM for determining cross-community edges by using community detection, and the PageRank algorithm for sorting edges.
Community detection
Community detection, also known as community discovery, is a technology used for social network aggregation behavior. Community detection is a method of network clustering. This can be understood as a collection of nodes with the same features.
Community detection is mainly based on the graph-theory division method. Karypis proposed a classic graph-partitioning community detection algorithm[12]. The core idea of the algorithm is to first divide the network randomly to obtain two initial communities, and then use a greedy algorithm to continuously exchange nodes in the two communities until the optimal partition is obtained. However, this method relies too much on the initial random partition, the detection results of different initial communities are different, and the algorithm complexity is high. Newman later proposed the concept of modularity, which has a significant impact on community detection. The researchers then proposed a series of community detection methods based on modularity. This makes the modularity-based method the most mainstream community detection method. Modularity is a metric used to evaluate the stability of network communities, and can also measure the quality of community detection results. For a well-divided community network, modularity can be expressed as
where the adjacency matrix
Algorithm based on the idea of aggregation. This method uses a bottom-up approach to discover communities. The main idea is to treat each node as a community at the beginning, examine all neighbors for each node, and then add the neighbor with the highest modularity to the community to which the node belongs until the modularity no longer changes. The most typical methods are the Newman fast algorithm, CNM algorithm, and so on[13]. However, these methods cannot control the size of the community, and the sizes of the divided communities are very uneven. Fan et al.[14] made the first attempt to employ a deep learning technique for attributed multi view graph clustering, and proposed a novel task-guided One2Multigraph auto-encoder clustering framework. Classification Algorithm This type of method uses a top-down approach to discover communities. The main idea is that at the beginning, all nodes are grouped into the same community, and then the edge decomposition communities with high betweenness are successively deleted until each node represents a community. The most typical method of this type is the GN algorithm proposed by Newman[15]. However, the time complexity of this type of method is too high, and large-scale networks are difficult to scale. Luo et al.[16] proposed a novel random-walk model. Optimization Algorithm This type of method assumes that the higher the modularity, the better the result of community detection. Therefore, the key to this type of method is to design an objective function for modularity by using the greedy algorithm, ant colony algorithm, simulated annealing algorithm, and optimize the function, and obtain the optimal modularity.
Raghavan et al.[17] proposed a semi-supervised learning community detection algorithm based on label propagation. The basic idea is to give nodes different labels at the beginning, and then use the label propagation algorithm to spread the labels and nodes with the same label from a community. In addition, Rosvall et al. introduced information theory for community detection, which can obtain high-quality community division results. Hu [5] developed a novel framework called HeGAN for HIN embedding. Luo et al.[18] proposed a deep learning framework to solve both community and SH detection, and Du et al.[19] improved the harmonic modularity algorithm based on Q-modularity gain, which can infer community partitions and rank SH nodes. We draw on the experience of predecessors and regard community detection and node ranking as a joint task.s
The main idea of harmonic modularity is that the community indicator vector of a node
For nodes whose neighbors are in the same community, the modulus of its community indicator vector
Solving the learning problem 3 is an NP-hard problem, so HAM introduces
where
PageRank was originally used to rank web pages on the Internet [21]. PageRank is essentially the application of the random walk model on the Markov chain: the nodes of the Markov chain are web pages, and the arc is the link that connects one page to another. Walker represents a general web surfer that moves from one page to another with a certain probability according to the network structure, occasionally “gets bored” and jumps to random nodes in the network. The steady-state probability vector of the random walk process saves the PageRank value of each node, which can be used to determine the global ranking.
Before describing in detail the normalization issues by showing possible problems, we briefly introduce the web ranking indicators that comprise the foundation for modern search engines. As mentioned earlier, the advantages and disadvantages of the PageRank algorithm were thoroughly analyzed. Although Pagerank was proposed a long time ago, it is still the backbone of multiple technologies, not limited to the web domain. For example, in [22], personalized PageRank is cited as an algorithm that may be used in Twitter’s “following” architecture. In [21], the author showed how the mathematical principles behind PageRank are used in a large number of applications that are not limited to web pages. In [23], another PageRank extension appeared as a temporary replacement for the publication h-index.
Other studies have either focused on providing shorter run times or PageRank’s adaptation to different fields. For example, Bahmani et al. [24] proposed a Monte Carlo technique to perform fast calculations based on random walk algorithms (such as PageRank). There is at least one distributed PageRank [25] version, which can better handle the ever-increasing number of pages to be ranked. One of the main disadvantages of the basic PageRank is that a large link matrix cannot be completely stored in the main memory. It is in speed changes. As I/O operations are slowed down, the distributed version of the algorithm solves this problem. Another method [24] introduced PageRank into the big data world using the MapReduce algorithm. Other types of optimization methods can be found in surveys [26, 27]. An important extension of PageRank, which runs on trusted network domains [28, 29, 30] is EigenTrust [31]. [32] proposed TrustRank algorithm. Because the application scenario involves a trust network, the entities involved are slightly changed: Web pages are replaced with network nodes, and links are replaced with arcs. However, the basic mathematics remained unchanged. However, none of these methods apply PageRank to the edges.We improved PageRank so that it can be applied on the edge.
Formalization
Community Detection: Community is an important feature of many social networks. Links within the same community are dense, whereas links between different communities are sparse. Given a network
Cross-link across non-overlapping communities. Let
Examples of edges across communities.
We define a utility function based on the average shortest path to quantify and verify the effectiveness of shortcuts in information dissemination.
The average shortest path distance
For large-scale networks, the impact of a single edge on information diffusion is too small to be calculated. Therefore, we define the utility function
where
In Eq. (7), the numerator represents an increment in the average shortest path distance in the network after the first
In a social network, shortcuts are often weak ties and strong ties usually form small friend circles or communities, while weak ties gradually connect these communities into large networks [1]. Easley and Kleinberg’s experiments indicate that strong connections require more time to maintain interactions among users. It has a crowding effect on social time, making a person’s social relation network smaller and smaller. Because of the lack of weak ties, such as shortcuts, the information among a group of good friends (strong ties) will not be able to go out of that small friend circle. In other words, the information transmission between two different communities is transmitted from one community to another by weak ties, and weak ties play a very important role in information transmission. Building a large social network needs to shortcut among social network communities, so the speed of obtaining information is faster. Bakshy[36] mentioned that people are more likely to share information posted by close friends who interact frequently. However, most people’s information comes from other people who do not often contact. It is important for us to obtain useful information by using weak ties, such as shortcuts. To find shortcuts in social networks, we propose a general framework for shortcut detection, as shown in Fig. 3
Construct graph: First, we construct the social network graph from social media, such as QQ, Microblog, WeChat, BBS, etc. For example, for Microblog, we consider the users who they post and repost as the nodes of the social network graph. When a user posts, the other users repost; then, we consider the relation between the posters and reposters as the edges of the social network graph.
Our proposed framework of shortcut detection. Q-HAM Function: Second, we use Q-HAM to process the social network structure to get values H of nodes. The value H of nodes can reflect the degree of harmony of the nodes. The H value of these nodes connected to different communities is higher, and the H value of these nodes connected to the same communities is lower. It can detect communities from a social network graph. Computing Transform Graph: Fourthly, we transform the social network graph into edge graphs by using the detected communities and Edge Rank (ER): Finally, we improve the PageRank algorithm to computing the ordering scores of nodes in the edge graph. The nodes in the edge graph represent the edges of the social network graph. Then, we rank the edges in the social network graph. We select the edges with top-k ordering scores in the social network graph as shortcuts.

In Fig. 3, Q-HAM, Transform Graph and ER are the three most important modules. We compute the harmonic of the nodes and detect communities using Q-HAM. We use the results of the H value and community to transform a social network graph into an edge graph in the transform graph module. Traditional PageRank can only sort nodes. In the ER module, an improved PageRank is suitable for computing the ordering score of every edge in the social network graph. The following is a detailed introduction to the three modules.
Communities are common in social networks. Compared with their connections (ties) among nodes of outside communities, the connections among nodes in the same community are denser than connections among nodes of outside communities.
Community detection algorithms usually maximize the modularity degrees as their objection. However, the maximization of the modularity degree cannot distinguish between the external and internal nodes. In addition, the vector representation between internal nodes and their neighbors in the community should be as harmonious as possible. The distribution vectors of the inter-community nodes are in harmony. The distribution vectors of the inter-community and outer-community nodes are diverse. We can order the nodes and detect communities. The objective functions of the HAM can balance the harmony and diversity of the distribution vectors.
ComSHAE [18] analyzed the similarity between HAM and random walk-based spectral clusters to improve the method by using HAM. To overcome the shortcomings of HAM, Luo et al. adopted the direct modularity increment (DMI) and indirect modularity increment (IMI) to improve the objective function of HAM [20] to measure the harmony and diversity between inter-community and outer-community nodes. Because this is an NP-hard problem, we used Eq. (8), instead of using Eq. (3). This achieves the same goal.
where
We adjust the value of H to minimize the objective function. DMI measures a reasonable degree of difference between a node and its neighbors.
where
IMI considers that the neighbors of node
where
The random walk normalized Laplacian
The optimal solution for Eq. (11) can be calculated by solving the eigenvector problem of the matrix.
The elements of
where
We find the optimal solution to calculate the eigenvector of matrix
We select a node
We summarize the above steps to get the algorithm as show:
In Algorithm 4.1, step 2
[H] Q-HAM.[1] Adjacency matrix
Example of social network graph.
The adjacency matrix is A, and the number of communities is two. The adjacency matrix
We combined the example in Fig. 4 to calculate the DMI values of the six nodes by using Eq. (9) are 9.02914483, 9.02914483, 8.57158047, 8.57158047, 9.02914483, 9.02914483. We combined the example in Fig. 4 to calculate the IMI values of the six nodes by using Eq. (10) are 9.80953963, 9.80953963, 9.44462166, 9.44462166, 9.80953963, 9.80953963.
Then transform the matrix Laplace by
We repeat the computation of the value
We obtain a vector corresponding to the community number for every node: 0, 0, 0, 1, 1, and 1.
At present, many ranking node research tasks, such as structural hole spanners and opinion leaders, are widely adopted to extract valuable information and knowledge. However, approaches for analyzing edge influences are seldom considered. In this paper, inspired by the idea of the PageRank algorithm, we propose an edge PageRank to mine shortcuts (these edges without direct mutual friends) that are located among communities and play an important role in the spread of public opinion. In the page rank algorithm, the information is disseminated in one direction indicated by the direct edges. We can compute the ranking scores of every node using its in-degrees and out-degrees. In the Q-HAM model, the centering vector of the community in the social network graph is the average vector of all node vectors. The H value of a node reflects the importance of a node in a social network. The indicator
The social network map into edge graph.
The example of edge direction.
Considering that
where
First, we reconstruct a new directed graph
Second, we use
Finally, we obtain an adjacency matrix
To sum up, we get the Algorithm 6 as shown:
[H] Social network graph converted to the graph edge graph[1] the social network graph
To illustrate the proposed algorithm better, we built a six-node social network graph. There are six nodes
Social network graph mapping into edge graph
Shortcuts often play an important role in the community. Single bridges between communities often have a large amount of information dissemination. However, communities are sometimes closely connected, and the connection between communities will show a small amount of information dissemination. Therefore, in social networks with more communities, we detect shortcuts with the most important or least importance.
To solve the ranking problem of important pages on the Internet, Brin and Larry Page proposed the PageRank algorithm[37] to evaluate the importance of web pages by using hyperlink relationships between pages. That is, the more linked pages, the more important are the linked pages. We use the idea of PageRank to sort the nodes in the edge graph and discover shortcuts in the social network graph, and compute the importance
Where
The Eq. (16) is an important foundation for finding shortcuts in a social network graph. It can be used to evaluate the importance of nodes in the edge graph of a social network graph. We denote the importance of node
The ER algorithm can efficiently and accurately identify the most influential edges in a social network graph to discover shortcuts in the social network graph.
Based on the above, we conclude the main idea of shortcut detection in the social network in the following steps:
First, we use adjacency matrix
Second, we restore the edge graph
Finally, we sort the edges using the ER value. The importance of every node’s edge graph becomes the importance of every edge in the social network graph. We consider that the shortcuts of social networks are the edges with the least importance or the most important. However, the top rankings are not necessarily shortcuts. We filter out this edge according to the definition of shortcut, which is no common neighbor node(neighborhood overlap is 0).
To sum up, we get the Algorithm 4.3 as show:
[H] Edge Rank[1] edge graph
Examples of the importance of edges in social network.
In this example, there are only seven nodes. If A is the disseminator of information, then the information will jump to B, C, and D with a probability of
In the initial condition, suppose that the probability of importance on each node is equal, that is,
We note that
We can obtain Table 2 from the above calculation process. Clearly, edges (c, d) have a higher ER value. This edge was responsible for information dissemination between the two communities. This edge is an important bridge between the two communities. Edge(c, d) can be seen as a shortcut between the two communities.
Example
In this section, we adopt a widely used comprehensive benchmark to demonstrate the performance of our proposed Edge Rank (ER) method on real-world social networks. We use the proposed ER method to detect shortcut discoveries in large real networks.
Experimental evaluation
Precision represents the proportion of structural nodes in the top nodes of the sorted results. This study uses the real community labels of the nodes to infer which edges are shortcuts.
where
Mean average precision (MAP) considers the importance of the ranking result position. When the two sorting algorithms have the same accuracy for finding the top-k shortcuts, the algorithm that arranges the edge sequence higher should be better.
We continuously use three real-world social network datasets, DBLP, YouTube, and Orkut[38], which are obtained from different topics, which are public social network datasets with community labels. The three datasets are described as follows:
DBLP[38] is a computer science bibliography that provides a comprehensive list of computer science research papers. If two authors jointly publish at least one paper, they can establish a co-author network. The name of the published journal or the location of the conference can be regarded as the label of a community, and the authors who publish articles in a certain journal or conference constitute this community. We extracted the DBLP dataset into three sub-datasets (DBLP-L, DBLP-M, and DBLP-S) according to their size.
YouTube [38] is a video-sharing website that includes social networks. In the YouTube social network, users establish online friendships with each other, and users can create groups that allow other users to join. The name of the user-defined group is treated as a community label.
Orkut[38] is a free online social network in which users can add friends to each other. Orkut also allows users to create a friend group, and other members can join the group. User-defined friend groups were considered as community labels.
Summary of experimental
Summary of experimental
Undirected graph[39]: We use undirected edge graph to rank the edge. We treat the edges as bidirectional and then use PageRank to rank the nodes. PageRank[24]: We calculate the PageRank value of each node in the social network. We used the ordering of nodes to determine the direction of the edges in the edge graph. Hits[40]: After the user enters the keyword, the algorithm calculates two values (Hub score and Authority Score) for the returned matching page. These two scores are interdependent and affect each other. The so-called pivot value refers to the sum of the authority scores of all export links on the page that point to the page. Authority scores refer to the sum of the hubs on the page where all the imported links are located. Neighborhood overlap[41]: It is a feature of the attributes of an edge when judging whether a point can pass through an edge to reach another point more conveniently in a social network.
We divided the karate dataset into two groups. The edges are marked as follows. shortcuts :(26, 24), (1, 12), (2, 31), (20, 34), (3, 10), (3, 28), (3, 29), (1, 32), (14, 34), (28, 25), (10, 34)
cross-edge: (1,32),(1,9),(2,31),(3,9),(3,10),(3,28),(3,29),(3,33),(20,34),(14,34).
According to our ER algorithm, the edge parameters are shown in Fig. 8. It can be seen that the first five edges are (3, 14) (3, 9) (34, 20) (2, 20) (2, 3). It is easy to figure out that nodes 3 and 20 play a connecting role in the two communities. Therefore, the edges around the two points were ranked higher. Then, we use the Jaccard index to filter out the edges that do not have a common node. We call this an important shortcut-like edge (20, 34).
Shortcut detection
We sample six sub-graphs from three real datasets, the top-20 edges have a good effect, which can reflect the relationship with shortcuts to a certain extent.
We given different K values, and the values of MAP and precision for different datasets are also different. It can be observed that the top numerical shortcuts can achieve better performance. Among them, the performance of the dataset in the DBLP datasets was better than that of the other two datasets. This is because there are more shortcuts in the DBLP dataset, which can be better detected.
Top-20 edge to predict shortcuts
Top-20 edge to predict shortcuts
Karate datasets: The karate nodes, labeled in red and blue colors, represent the two communities. The red and blue edges represent the edges within the communities, and the purple edges represent the edges between the cross-communities.
Comparison of utilities of inferred edges.
To verify the effectiveness of the ER algorithm, we conducted experiments on datasets of different scales. DBLP-S has 843 nodes and 1819 edges. DBLP-L has 3247 nodes and 14840 edges. These two datasets are a subset of a large dataset with more than 30,000 nodes sampled from the community set.
Precision of shortcut detection in different data sets.
MAP of shortcut detection in different data sets.
Figures 10 and 11 to control sequence show the precision and map under different data sets, respectively. It can be seen that the ER method we proposed has a good performance in each data set. Although the various methods under the Orkut data set perform poorly, we believe that the network structure of this data set is too dense and there are few shortcuts.
We can get the first 9 edges with the smallest edge ER value((1, 7), ‘0.0138’), ((2, 3), ‘0.0138’), ((5, 6), ‘0.0138’), ((1, 10), ‘0.0150’), ((3, 4), ‘0.0161’), ((1, 11), ‘0.0171’), ((4, 14), ‘0.0172’), ((10, 2), 1‘0.0183’), ((7, 6), ‘0.0189’) by using Table 4. It can be clearly seen that these edges shortcut some cross-community connecting edges. These edges play an important role in connecting the communities.
Figure 11 can be more convenient to open the ER value to the edge. The system first needs to collect data from social media to form a social network graph. Subsequently, the result of obtaining H values of nodes and detecting communities is obtained by the HAM algorithm. The ER algorithm obtains the community labels and uses it as the input of the HAM algorithm. The ER algorithm can then calculate the social influence values of the edges.
From this experiment, we can see that the effectiveness of our model will be affected as the dataset increases.
Comparison of utilities of inferred edges.
To verify whether the detected shortcuts are indeed more useful for information dissemination, we compared the utility of four sets of edges detected by different methods, namely, the shortcuts detected by ER, and the shortcuts detected by the Jaccard index, PageRank, hits, and ER. For convenience, the utility is calculated by adding a fixed number of detected shortcuts.
Shortcuts in different methods: the shortcuts are represented by read lines. Nodes distinguish different communities by different color.
The comparison results for the three networks are presented in Fig. 12. When deleting the same number of detected shortcuts, the ER always obtains a higher utility value than neighborhood overlap. This verifies that shortcuts are more useful for information dissemination than ordinary links. The shortcuts detected by ER have higher utility in DBLP and YouTube, but slightly lower utility in Orkut. This shows that the size of the network may have an impact on the quality of the detected shortcuts.
We compare the overlap of neighborhoods and find that the deleted edges have little effect on the average short path in the network. Our analysis should be that there are some clustered network structures in the network, and these deleted edges cannot play a role in network propagation.
Finally, we visualized the shortcut detection results of different methods. Figure 13 shows the comparison of five methods on DBLP1 dataset. We marked 7% of the sorted edges in red. Figure 13 represents the visualization result obtained by our model. Figure 13(b–e) shows the results of the other four approaches.
Conclusion and future work
We proposed an ER algorithm based on HAM to order edges and discover that some of these rules make the edges with higher ranks related to our shortcuts. To evaluate the proposed approach, we applied it to three real-world datasets. Experiments showed that our proposed ER algorithm can detect shortcut edges well. Furthermore, we verified the importance of these shortcut edges in the network, that is, calculating the average short path in the network by deleting these edges. The identification of edges can help in public opinion detection scientifically.
In the future, we intend to introduce more features into our method, such as adding structural holes for analysis to improve the performance of our method. More evaluations and experiments will be conducted to improve the suggestion that methods with better recognizability will adapt to more complex situations.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation (Grant Nos. 61872298, 61802316, and 61902324) and the Sichuan Regional Innovation Cooperation Project (Grant No. 2021YFQ008).
