Abstract
Real world graphs are massive in size and often prohibitively expensive to analyze. Of the possible solutions, sampling is extracting a representative subgraph from a large graph that faithfully represents the actual graph. The prior research has developed several sampling methods but the samples produced by these methods fail to match important properties of the original graph and work poorly in maintaining its topology. We observed that the existing methods do not explore the neighborhood of sampled nodes fairly and hence yield suboptimal samples. In this paper, we introduce a novel approach in which we keep a list of candidate nodes that is populated with all the neighbors of nodes that have been sampled so far. With this approach, we can balance the depth and breadth of graph exploration to produce better samples. We evaluate the effectiveness of our approach using several real world datasets and show that it surpasses the existing state-of-the-art approaches in maintaining the properties of the original graph and retaining its structure. We also calculate Kolmogorov-Smirnov Distance and Jensen-Shannon Distance for quantitative evaluation of our approach.
Introduction
We are surrounded by networks all around us including technological networks (e.g., the Internet, telephone networks), social networks (e.g., Facebook, Twitter), information networks (e.g., World Wide Web, citation networks), biological networks (e.g., biochemical networks, food webs) and many more. The network graphs or simply graphs, a formal representation of these networks, provide us a structural model that makes it possible to analyze and understand different properties of a network. By analyzing the graphs, we extract valuable information that could help us to make not only business-oriented decisions but also technology-oriented decisions. When it comes to the graph analysis, it would be simple if the graph size is small, but unfortunately real world graphs are too massive to efficiently manage and analyze. The real world graphs could have billions of nodes and edges and fully analyzing such graphs requires lots of resources in terms of required memory, computational power and the processing time, making it prohibitively expensive to study such enormous graphs.
Sampling small subgraphs from large graphs is one of the possible solutions, provided that the small subgraph is a good representation of the large graph. The sampled subgraph is supposed to maintain the properties of the graph (e.g., degree, clustering coefficient, path length etc.) and match the distributions of these properties as measured in the actual graph. Given a large graph G
The research community has proposed several sampling approaches in the past. These sampling approaches could be divided into two categories according to their primary goals. In the first category, the main goal is to estimate the properties of the graphs and therefore, the focus is on the quick exploration and estimation of some of the characteristics of those networks that are hard to analyze in whole or have a high rate of churn. The works in [3, 4, 5] fall in this category, where the online social networks were studied and explored. In the second category, the primary goal is to create a representative subgraph that matches the properties of the original graph. Sampled subgraphs can be used in place of large networks that are too costly to handle and analyze in full. The need to crawl large social networks and/or test network protocols is the main motivation behind this research. For example, Snowball Sampling [6] or Random Walks [7] are often used to collect data from large online social networks. However, previous approaches have had limited success in building subgraphs accurately and reproducing key properties of the actual graph especially in terms of the distribution of the various properties.
Our work falls in the second category and we focus on building a representative subgraph
We introduce the concept of Candidate List, i.e., a reservoir of candidate nodes to be included in the sample graph. We present a new framework for sampling large graphs. This framework provides us the foundations of using the Candidate List by tuning its two parameters. We also present theoretical analysis of this framework and conduct some experiments to highlight its important features. We propose two novel sampling algorithms based on our framework that contrast on the basis of the probability of selecting a node from the Candidate List. We present the key intuition behind them and examine their features. We compare our algorithms with the existing state-of-the-art algorithms using real world data sets and show that our algorithms outperform the existing algorithms. Lastly, we perform statistical evaluation of our algorithms using Root Mean Square Error, Kolmogorov-Smirnov distance and Jensen-Shannon distance.
The rest of the paper is organized as follows. We start with the related work in Section 2. In Section 3, we discuss graphs in general and their important properties. In Section 4, we present the graph sampling problem and categories of sampling methods. In Section 5, we present our approach to sampling in detail and the main intuition behind it. In Section 6, we formally present the List Sampling framework and propose two sampling algorithms. In Section 7, we present the evaluation criteria and ten real world datasets used for the evaluation. In Section 8, we study the List Sampling framework in detail, perform theoretical analysis and conduct some experiments to highlight its features. In Section 9, we evaluate our algorithms against two existing methods, perform statistical tests and show that our approach outperforms these existing methods. We conclude the paper in section 10 with some future work directions.
Graph sampling has been used in different flavors in many diverse fields of research including statistics, social science, data mining and machine learning, to name a few. In the past, the work was centered on graph compression and studying the properties of sampled networks. The work in [8, 9] focuses on using sampling to reduce the graph for better visualization. The work in [6, 10, 7] studies the properties of samples of complex networks produced by traditional sampling algorithms such as node sampling, edge sampling and random walk based sampling. The work in [10] shows that the sampled subgraphs of a scale free network are not scale free. Conversely, the work in [11] shows that the samples produced by traceroute sampling follows the power law. The work figures out the bias, in estimating the different properties of a graph, due to sampling.
A lot of research has been done on Internet measurement, i.e., measuring the topology of large-scale networks, such as Peer-to-Peer Networks, the World Wide Web and Online Social Networks. The work in [12] applied a variation of Random Walk called Re-Weighted Random Walk(RWRW) to sample Peer-to-Peer networks. The work in [5, 13] used Random Walks with jumps and multiple dependent Random Walks to sample and estimate large networks. In [14], the authors compared different Random Walk based techniques for sampling web pages. The interested readers are referred to the survey [15] that compares different methods for crawling large networks.
There has been little work on sampling large graphs and finding a representative graph to estimate the properties of the actual graph. Leskovec and Faloutsos analyze various sampling algorithms for large graphs and propose Forest Fire Sampling (FFS) [1]. FFS is a partial breadth first sampling method in which we pick a seed node at random and then burn a fraction of its outgoing edges along with the nodes on the other end. More recently, Nasreen et al. proposed Totally Induced Edge Sampling (TIES) in [2]. The primary difference between TIES and the random Edge Sampling (ES) is the graph induction step. In TIES, we augment all the existing edges between the sampled nodes by including other edges between the set of sampled nodes in addition to those sampled in the edge sampling step. For static graph sampling, FFS and TIES are considered state-of-the-art algorithms and produce better samples than their traditional counterparts. The authors in [16] analyze the biasness of Breadth First Sampling (BFS) and Depth First Samling (DFS) and introduce Random First Sampling (RFS) where we sample a node uniformly at random from the list of visited but unsampled nodes. However, the main focus of the paper is to find the minimum crawl size to estimate the properties of the graph. In [17], the authors propose Metropolis algorithms that refine the subgraphs produced by randomly selecting the nodes. The main idea is to replace the sampled nodes with other potentially good nodes to better match the properties of the actual graph. The work assumes that we have already estimated the properties of the actual graph which might not be a realistic assumption. Furthermore, the algorithms are computationally expensive as they need to search for possible good nodes and most importantly, the uniform random selection of nodes imposes limits on its practical use. The work in [18] provides a detailed study on the nature of biases in network sampling. The authors explored seven common sampling strategies including Breadth First Sampling, Depth First Sampling, Random Walk, Forest Fire Sampling, Degree Sampling, Sample Edge Count and Expansion Sampling. The work investigates connections between certain biases and graph representativeness. Interestingly, the authors point out that such bias could be beneficial for some applications.
The sampling methods discussed above work on undirected graphs but there are also a few methods that sample directed graphs. The authors in [19] transform a directed graph to an undirected graph by introducing reverse links and then apply Metropolis-Hastings Random Walk to sample the graph uniformly. Similarly, the work presented in [20] introduces backward edge traversals and degree-proportional jumps to work with random walks and random jumps to extract samples from directed graphs.
In structured data mining and machine learning, the focus has been on developing methods to extract small subgraphs and use them to learn models [21] and study complex network processes [22]. In data mining, domain knowledge plays an important role as discussed in [23] where the authors highlight the advantage of using domain knowledge within the discovery process. In [24] the authors develop a decision support expert system by mining a large dataset of a telecommunications operator. Interestingly, the authors use the domain knowledge in the prepossessing phase and extract a sample based on the area and activity of customers for further analysis. However, it is still an open question how domain knowledge can play some role (if any) in extracting good samples. In structured decision making, Influence Diagrams (IDs) graphically represent the relationships between different entities. Influence Diagrams are acyclic directed graphs modeling decision problems under uncertainty and are used as a tool for decision support systems design. The work in [25] presents a sampling algorithm for solving influence diagrams and reduces the problem of solving an ID by first transforming it into a Bayesian Network. The authors in [26] apply importance sampling [27] for selecting actions in Influence Diagrams. Fundamentally, graph sampling and sampling in Influence Diagrams have different goals. In graph sampling, we aim at producing a small representative graph that has the same properties as that of the original graph while sampling in Influence Diagrams is focused on selecting an optimal strategy or action that yields the highest expected gain.
Graphs
Formally, we represent a network as a graph
Graph properties
There is a broad range of measures, also called properties, which characterize a graph. Some properties are local to a node or an edge, e.g., local clustering coefficient, while others are global, e.g., assortativity of a graph. A number of graph properties have been explored by the research community [28], however, we limit ourselves to study the following six properties of a graph. Moreover, we consider only undirected graphs in this work.
where
where the terms
Global clustering coefficient gives an indication of the clustering in the whole network. It is often called transitivity.
In Statistics, sampling is the process of selecting a subset of individuals from a population of interest. Generally, the population is huge and through sampling we extract a subset of individuals that fairly represent the whole population and by studying the sample we may fairly generalize our results back to the population.
Graph sampling
Graph sampling (also called network sampling) is the technique to extract a subset of nodes and/or edges from the original graph. There is a variety of applications of graph sampling, e.g., visualize social networks [29], scale down big Internet graphs [30], graph sparsification [31] and network crawling [32, 33, 4]. The purpose of sampling depends on its application. In some cases, the whole graph is known in advance and sampling is performed to pick a smaller subgraph for quick visualization. In other cases, the graph is unknown and unbounded and the purpose of sampling is to explore the graph.
Formally, let
Categories of sampling methods
In this section, we discuss previous sampling techniques that could be broadly divided into three categories: Node Sampling, Edge Sampling and Traversal Based Sampling
Node sampling
In Node Sampling (NS), we select the nodes at random from the actual graph and then induce the edges among the sampled nodes. In the basic algorithm of node sampling, we select nodes uniformly at random from the actual graph to construct the sample node set
Edge sampling
Similar to selecting the nodes at random, we can also select edges at random. In classic Edge Sampling (ES), we focus on selecting the edges at random to form the edge set
Traversal based sampling
The basic idea of traversal based sampling is that we first select a seed node at random and then explore the nodes in its neighborhood. Breadth First Sampling (BFS) and Depth First Sampling (DFS) are two classic examples of traversal based sampling that explore the graph breadth-wise and depth-wise respectively. In Random Node Neighbor (RNN) sampling, we select a node uniformly at random and then include all of its neighbors along with the corresponding edges to the sample graph. Interestingly, RNN under-estimates the degree of nodes and if we include all the edges among the sampled nodes, it over-estimates the degree of nodes. Another example is Snowball Sampling (SS) [7] where we select a seed node uniformly at random and then perform partial breadth first search. In Random Walk (RW) sampling, we select a seed node at random and then perform a random walk on the graph. In [1], the authors propose the Forest Fire Sampling (FFS) method. FFS is a partial breadth first sampling method in which we pick a seed node at random and then burn a fraction of its outgoing edges along with the nodes on the other end. The fraction of nodes to be burned is selected from the geometric distribution with mean pf/(1-pf) where pf is the forward burning ratio with a recommended value of pf
Our approach to sampling
In this section, we present some basics of our approach, main intuition and the concepts of Candidate List and graph induction. Since we utilize the list of discovered but yet unselected nodes, we call our approach List Sampling (LS).
List sampling
The basic idea of List Sampling (LS) is to keep a list of nodes that have been discovered by the sampling algorithm but have not yet been sampled. This small reservoir of, discovered but yet unsampled, nodes provides us numerous possibilities of i) how we select the nodes and ii) how many nodes we select simultaneously from this pool of potential candidates. List Sampling is a framework that serves as a new approach to utilize the already discovered nodes in sampling instead of rediscovering the nodes repeatedly and adding them to the sample. In LS, we visit the already discovered nodes and also add the new nodes to the pool from the neighborhood of the discovered nodes.
Breadth First Sampling (BFS) and Depth First Sampling (DFS) are two classic approaches that utilize the discovered nodes in a predefined manner. In BFS, we add all the neighbors of the sampled node in a list and then visit the nodes in the list in a FIFO (first in first out) manner, i.e., sample the oldest node in the list. This FIFO approach samples the nodes close to the seed node and explores the breadth of the graph very well but compromises the depth of the graph and hence underestimates the diameter of the graph. The nodes sampled in BFS belong to a small region around the starting node and does not explore the graph in depth. Furthermore, it has been shown in the previous studies [6, 4, 16, 34] that BFS is biased to high degree nodes and this bias produces overestimated sample graphs. In DFS, we deepen the sampling to the leaf nodes in the original graph and then recursively backtrack to the inner nodes and then back to the leaf nodes in the graph. This back and forth sampling of leaf nodes samples the nodes far away from the seed node and explores the depth of the graph very well but compromises the breadth of the graph. The sampled nodes in DFS belong to a particular region far away from the seed node and do not produce good samples. In other words, if we think of a graph as a two dimensional plane, BFS explores the graph in breadth dimension and DFS explores it in depth dimension as shown in Fig. 1. In order to explore the breadth and depth of the graph, we need a two dimensional approach and list sampling provides a framework to such a two dimensional sampling of graphs.
List sampling vs. Breadth first vs. depth first sampling.
In List Sampling, we sample the nodes from the list of the discovered nodes in a random fashion instead of sampling the nodes in a predefined manner. This random first sampling traverses the graph in both breadth and depth dimensions as shown in Fig. 1 and removes any bias generated by the breadth or depth first sampling. It balances the depth and breadth of the sample graph and produces better samples than the previous approaches. Moreover, we also introduce the concept of multiple selection from the list of discovered nodes. Since the list of discovered nodes grows quickly, selecting multiple nodes simultaneously from the list could balance the two dimensional exploration of the graph by selecting nodes in both dimensions at a time and could help to find better samples with relatively smaller sample sizes.
List Sampling has two basic steps. The first step is the node selection step in which we select nodes to form the node set
List Sampling is based on the selection of nodes from a pool of possible candidates instead of selecting the nodes from the whole graph. We exploit two key observations. First, selecting the nodes in a predefined manner as in BFS or DFS confines the sampling to the area either close to the seed node or far away from it and results in overestimation or underestimation of some properties of the graph. Second, the unvisited neighborhood of the sampled nodes grows quickly but is visited slowly because we visit exactly one neighboring node at a time, this results in poor estimation of the topology of the graph. Finding the middle ground led us to introduce the concept of
A Candidate List (CL) is populated by adding all the direct neighbors of the selected nodes and depopulated when we select (and delete) a node from it to form the sample set
A small graph of 9 nodes.
Let’s suppose we sample the graph of Fig. 2 such that we select a node uniformly at random together with all of its neighbors. If we select node B, then
The interested readers are referred to [35] for understanding the role of the graph induction step in network sampling. The paper compares algorithms such as Random Walk and Forest Fire Sampling with and without induction. The authors conclude that a clear difference exists between the techniques with subgraph induction and without it. Generally, the techniques with the induction step improve the performance of the corresponding techniques without it.
List sampling framework
In this section we formally present two algorithms, under the title List Sampling (LS), that differ on the basis of how we select the nodes from the CL. In principle, LS provides us a framework with two key parameters. i) the probability, denoted as p, of a node to be selected from the CL ii) the number of nodes, denoted as k, sampled simultaneously from the CL. We represent the List Sampling framework as
The first parameter p, the probability of a node to be selected from the CL, offers us the flexibility of selecting a node with a certain probability from the pool of available nodes and this probability could be based on any characteristic of a node e.g., its degree. There are numerous possibilities to define the value of p, e.g., we can select nodes uniformly at random from the CL or we can select a node from the CL with a probability directly or inversely proportional to its degree. We call this parameter a biasing parameter because it biases our selection of nodes from the CL and we can either prefer high degree or low degree nodes for sampling.
The second parameter k, the number of nodes selected simultaneously from the CL, helps an algorithm to explore the neighborhood properly. We assume that if there are many potential candidates in the CL and we select only one node at a time, we cannot explore the neighborhood extensively. We call this parameter a multiple selection parameter. There are countless possibilities to select k (
We calculate k as the square root of the size of the Candidate List. We will discuss the intuition behind this value of k in the analysis section (Subsection 8.3).
Supposedly, the Candidate List is a good trade-off between sampling by randomly selecting the nodes and sampling by exploring the neighborhood. We have two key intuitions. First, selecting the nodes at random from the Candidate List simulates the same effect as that of selecting the nodes randomly from the graph and helps us in overcoming the degree bias of BFS and DFS. Second, since the Candidate List does not hold the neighboring nodes of the current node but of all the nodes sampled so far, therefore, when we select multiple nodes repeatedly from the CL, it is highly likely that we explore the neighborhood of each node accurately. That helps us in retaining the structure and connectedness of the graph. Intuitively, the graph induction step further strengthens the overall connectivity of the sampled graph and helps in maintaining its topology.
The simplest possible algorithm with List Sampling framework
Algorithm 1
In the first algorithm, we select nodes uniformly at random from the Candidate List (CL) without considering the degree of the nodes. Let us suppose that the CL contains r number of nodes, i.e., CL
The main intuition behind this algorithm is the belief that randomly selecting the nodes picks both low degree and high degree nodes with equal probability and hence their average degree comes out to be close to the actual degree of the graph. We select a seed node at random from the node set V of the actual graph and add all of its neighbors to the CL. We then select k number of nodes uniformly at random from the CL, add them to the sample set
In the second algorithm, we bias our selection to the low degree nodes when we select nodes from the Candidate List. The main intuition behind this algorithm is the observation that sampling, in general, is intrinsically biased to pick high degree nodes from the actual graph in one way or another. In this algorithm, the probability of a node being selected from the Candidate List of size r depends on the degree d of a node and its frequency f. The frequency f of a node tells us how many of its neighbors have already been sampled. If the frequency of r nodes in the CL is represented as
Where
Intuitively, we adopt the idea that a node should be given some privilege of increasing its probability depending on how many of its neighbors have been sampled so far. We call this algorithm List Sampling 2 (
All the algorithms LS0, LS1 and LS2 are presented in algorithm #1 as List Sampling Framework. Please note that LS0 and LS1 differ in the value of k, the number of nodes selected from the CL while LS1 and LS2 differ in the value of p, the probability to select the nodes from the CL. By comparing these algorithms with one another, we can see the effect of changing the value of p and k in
The worst case running time of List Sampling as given in algorithm #1 can be calculated as follows. Let N be the total number of nodes and M be the total number of edges in the original graph G. It takes O(N) time to calculate the normalization factor and O(N) time to find the normalized probabilities of nodes in the Candidate List. The while loop can run N number of times at maximum and therefore, the time to build the CL comes out to be of the order of
Algorithm #1 implements List Sampling in a naive way so that the reader could grasp the idea easily. However, a careful implementation with Probability Proportional to Size or Roulette Wheel Sampling can reduce the time significantly. In these implementations, we can use the fact that the CL could be sorted in O(logN) time and a binary search thereafter can select a node from the list in O(logN) time. Similarly, the issue of constantly updating the normalization factor and normalized probabilities could be handled with a data structure called Binary Indexed Tree where an array of cumulative sum form could be updated in O(logN) time. Based on these facts, we believe that a careful implementation of List Sampling would cost O(NlogN+M) time to extract a sample.
Evaluation criteria and data sets
The question, “how to measure the goodness of a sampling algorithm?”, has no single answer and there is no single criteria to evaluate the results of a sampling algorithm. However, the research community [1, 2] has used some properties e.g., degree along with some statistical tests to measure the performance of a sampling algorithm. In this paper we measure the performance of our algorithms by comparing six properties of sampled graphs with the original graph, draw the distributions of properties where possible and perform three statistical tests for quantitative measures.
Point statistics
A point statistic is a single value statistic that shows the value of a property at a single point. We vary the sampling fraction
For example, we measure the average degree (Eq. (1)) of the sample graph and the original graph and find the scaling ratio (Eq. (11)) of degree by dividing the average degree of the sample graph
A distribution is a multivalued statistic and shows thedistribution of a property in a graph. For example, the degree distribution shows the fraction of nodes that have degree greater than or less than a particular value. We find and plot the Empirical Cumulative Distribution Function (ECDF) of degree, clustering coefficient and path lengths of sample graphs at
[h] InputInput OutputOutput VarVariables InitInitialize
Original Graph G
Calculate the probability of nodes accordingly
num
List Sampling Framework
Distance measures for quantitative comparison
A good sample has properties that closely approximate the original graph. Therefore, the distance between the property in sample graph
Root Mean Square Error: We use the common measure for the quality of estimation by Root Mean Square Error (RMSE), given as
where
Kolmogorov Smirnov Distance: In statistics, the Kolmogorov Smirnov (KS) test is a nonparametric test used to compare a sample with the reference probability distribution. It quantifies a distance
where
Jensen-Shannon Distance: In probability theory, the Jensen-Shannon Divergence measures the similarity between two probability distributions. The Jensen-Shannon Divergence is based on Kullback-Leibler Divergence (KLD) but in opposition to KLD, the JS Divergence is symmetric and its square root is a true metric often referred to as Jensen-Shannon Distance (JSD). The JS Divergence is calculated as
where
Characteristics of data sets used in the experiments
In our experiments we use ten real networks available publically. We use one autonomous network Skitter [36], two collaboration networks DBLP [36] and CiteSeer [37] and seven social networks Gowalla [38], Amazon [38], FourSquare [37], YouTube [38], Lastfm [37], Hyves [39] and Flicker [36]. All the data sets used in this work are clean, not contaminated and there are no missing values in these datasets. Table 1 summarizes the characteristics of these networks.
Analysis of list sampling
In this section, we study LS analytically and compare it with some of the previous sampling methods in order to show the characteristics of LS that lead it to produce better samples than the previous approaches. In LS, a node is sampled in two steps. In the first step, it is added to the CL and in the second step it is picked from the CL. The first step, adding a node to the CL, depends on the probability that a neighbor of the node has been sampled and added to the sample set
Formally, let
where
Now we can re-write Eq. (15) as
where
Formally, if
The seed node
(a, b) fraction of nodes of degree d added to and picked from the Candidate List , (c, d) fraction of nodes of degree d added to the sample graph Gs.
For further analysis, we reconsider Eq. (15) that calculates the probability that a random node
In the above paragraph, we intuited that if
and
where
For comparison with TIES and FFS, we also find the fraction of nodes of degree d sampled and added to the sample graph
The value of
As discussed above, LS samples a node in two steps and this two step approach makes it superior to previous approaches like TIES and FFS. Nasreen et al. has shown in their work [2] that TIES favors high degree nodes. Basically, TIES is based on Edge Sampling (ES) and ES is inherently biased to select high degree nodes. In addition, TIES also performs the graph induction step in which it augments all the edges between the nodes in
In the next Subsections 8.1 through 8.5, we perform some experiments to supplement the above analysis and highlight different features of LS. We compare TIES and FFS (where required) with LS1 only for ease of explanation but both LS1 and LS2 show similar results.
Original vs. sampled degree in (a, b) Gowalla and (c, d) YouTube networks.
We illustrate the effect of selecting high degree nodes in the original graph through an experiment. We extract samples at
In contrast, LS1 picks a lesser number of high degree nodes, the two step sampling approach further reduces this bias and when this moderate overestimation is combined with the induction step, LS1 produces better samples than TIES. The results of this experiment shows that it is wise to overestimate the degree in the original graph but this overestimation should be moderate otherwise we will produce substandard samples. Since FFS does not perform induction and all the neighbors of a node in the original graph are not sampled, therefore, FFS underestimates the degree in the sample graph.
Sampled vs. induced edges in (a) Gowalla and (b) YouTube networks.
We know that both TIES and LS perform the induction step and induces additional edges between the nodes in
The graph shows that in TIES, on average, almost 90% edges in
Effect of value of k on the degree and clustering coefficient of sample graphs.
The second parameter
The graphs show that as we select more nodes from the CL simultaneously, the scaling ratio increases. The graphs also show that there is no magic value for k that might work for all data sets. We see that for Gowalla network
Size of the candidate list
One may argue that selecting k number of nodes simultaneously from the Candidate List and adding their neighbors to it could foster the growth of the CL and hence we need lots of space to store the CL. We intuit that i) after a few iterations the CL grows bit by bit because most of the nodes are rediscovered and ii) LS selects both high and low degree nodes from the CL, as opposed to BFS that favors high degree nodes, and therefore after every iteration, a manageable number of nodes are added to the CL.
Size of the Candidate List for all datasets, 
We conduct a simple experiment where we extract samples at
We performed time complexity analysis in Section 6.4 and concluded that the worst case running time of LS as implemented in algorithm#1 is
The experiment is performed using an Intel core i7 processor @ 3.40 GHz with 64 GB of main memory. We present the results in Table 2 and the time is calculated in minutes. As expected, TIES is the most time efficient method because it samples random edges in
Running time (in minutes) of different sampling algorithms
Running time (in minutes) of different sampling algorithms
In this section, we evaluate and compare List Sampling LS0, LS1 and LS2 with three algorithms, FFS [1], TIES [2] and Random Degree Node (RDN) sampling. It has already been shown in [1, 2] that these methods outperform the traditional algorithms like Node Sampling (NS), Edge Sampling (ES) and Random Walks (RW) etc., therefore, we compare List Sampling with FFS, TIES and RDN only.
The Forest Fire Sampling (FFS) is a partial breadth first sampling method in which we pick a seed node at random and then burn a fraction of its outgoing edges along with the nodes on the other end. The fraction of nodes to be burned is selected from the geometric distribution with mean pf/(1-pf) where pf is the forward burning ratio with a recommended value of pf
From the presentation point of view, we make two groups of these algorithms. In group A, we compare LS0, LS1, LS2 and FFS because all these algorithms apply traversal based methods to extract samples. In group B, we compare LS1, LS2, TIES and RDN. Here, we evaluate two versions of LS with two algorithms that randomly select nodes or edges from the graph by exposing the whole graph as opposed to LS that mines a portion of the graph to extract samples.
Point Statistics of all networks [Group A]: (D1–D10) Degree , (C1–C10) Clustering Coefficient, (P1–P10) Path Length.
Point Statistics of all networks [Group B]: (D1–D10) Degree, (C1–C10) Clustering Coefficient, (P1–P10) Path Length.
We vary the sampling fraction
RMSE values of degree, clustering coefficient and path length
RMSE values of degree, clustering coefficient and path length
In order to measure the farness of sampled and original graph properties, we calculate the Root Mean Square Error (RMSE) for degree, clustering coefficient and path length of point statistics presented in Figs 8 and 9 above. We present the results in Table 3 where a value is averaged over all sampling fractions. In the case of degree and clustering coefficient, on average, LS2 generates slightly less error than LS0 and LS1 while LS1 is a little better than LS0 and LS2 in estimating the path length. All three versions of LS perform considerably better than FFS, TIES and RDN. We see that the previous methods generate significant error in estimating the properties of the original graph.
Distributions of all networks at 
Distributions of all networks at 
We present the distributions of degree, clustering coefficient and path length in Figs 10 and 11 for all datasets for group A and B methods respectively. We plot the Empirical Cumulative Distribution Function (ECDF) at a 10% sampling fraction along with the distributions of the actual graphs. These distributions convey more information than point statistics. We selected
Kolmogorov-Smirnov (KS) Distance of all networks.
The KS Distance measures the maximum distance between the two distributions. We present the average KS Distance for all networks in Fig. 12 where a bar value is calculated by averaging over 10 readings and 5 sample sizes. The last bar set shows the average of all datasets. With a few exceptions, LS outperforms TIES, RDN and FFS in all the networks across all the metrics. The exceptions are; DBLP network in degree metric, Skitter network in clustering coefficient metric and CiteSeer, FourSquare and Hyves networks in path length metric. On average, LS shows lower KS distance than TIES, RDN and FFS in all the networks across all three metrics. We also see that LS2 gives a lower distance than LS0 and LS1 in degree and clustering coefficient while in path length it is the opposite case.
Jensen-Shannon (JS) Distance of all networks.
While the KS distance measures the maximum distance between the two distributions, the JS Divergence calculates the similarity measure between the two distributions across the entire range of values and its square root is known as the JS Distance. We compute the average JS Distance for all the networks. We present the results in Fig. 13 where a bar value is calculated by averaging over 10 readings and 5 sample sizes. The last bar set shows the average of all datasets. With only two exceptions, DBLP in degree and Hyves in path length, we see that the average JS Distance of LS is less than that of TIES, RDN and FFS in all datasets. The reason is that LS follows the distributions better than other methods and hence gives lower JS distance. On average, LS2 slightly performs better than LS0 and LS1 in degree and clustering coefficient metrics.
Effective diameter, assortativity and global clustering coefficient
While the properties of a graph such as degree, clustering coefficient and path length are local to a node, the properties such as diameter, assortativity and global clustering coefficient represent the overall structure of a graph. The 90% effective diameter tells us the diameter below which 90% of all shortest path lengths fall. Assortativity quantifies the tendency of nodes in a graph to connect to others that are similar in some way. It is also referred to as degree mixing and ranges from
90% Effective Diameter, Assortativity and Global Clustering Coefficient of all networks [Group A]: (E1–E10), 90% Effective Diameter, (A1–A10) Assortativity, (G1–G10) Global Clustering Coefficient.
90% Effective Diameter, Assortativity and Global Clustering Coefficient of all networks [Group B]: (E1–E10), 90% Effective Diameter, (A1–A10) Assortativity, (G1–G10) Global Clustering Coefficient.
The first two rows (E1–E10) of Figs 14 and 15 show the 90% effective diameter of all networks at different sampling fractions. We present it as the scaling ratio to its original value, i.e., sampled values divided by the original value of the network. We see that LS performs better than TIES, RDN and FFS in most of the networks. TIES and RDN stand second while FFS gives very high values of diameter. The second and third rows (A1–A10) of Figs 14 and 15 show the assortativity values and we see that LS is more consistent than TIES, RDN and FFS as these methods overestimate or underestimate the values in some networks. Although FFS also maintains assortativity mixing for some networks, it does not perform well in retaining other properties. TIES and RDN perform uniform edge and node sampling respectively by exposing the whole graph and hence they have more chances of picking nodes of varying degree from the graph and hence maintaining their mixing patterns but still fail in some networks. LS, on the other side, explores a limited region of a graph but the assortativity results show that it can still maintain the overall structure and mixing patterns of nodes well, proving its supremacy over the existing approaches. Similarly, in Figs 14 and 15 (G1–G10) that show the values of the global clustering coefficient of all the networks, we see that LS is better than TIES, RDN and FFS in retaining the ratio of triangles in the original graph. It means that LS not only preserves node properties e.g., degree, clustering coefficient but it also maintains the overall structure of the graph.
We calculate the Root Mean Square Error (RMSE) for a 90% effective diameter, assortativity and global clustering coefficient and present the results in Table 4 where a value is averaged over all sampling fractions. In case of the 90% effective diameter, all three versions of LS perform considerably better than FFS, TIES and RDN. In case of assortativity, all methods give close results while in estimating global clustering coefficient, LS and FFS are substantially closer and better than TIES and RDN.
RMSE values of 90% effective diameter, assortativity and gloabl clustering coefficient
RMSE values of 90% effective diameter, assortativity and gloabl clustering coefficient
The results show that LS outperforms state-of-the-art FFS, TIES and RDN algorithms in all the networks across all metrics and statistical tests. Our algorithms give good and consistent values at almost all sampling fractions in point statistics of degree, clustering coefficient and path length and closely estimate their distributions compared to FFS, TIES and RDN. In addition, the statistical tests performed for the quantitative evaluation of sampling algorithms also prove the supremacy of our sampling approach over the existing approaches across all the three metrics. Moreover, LS also performs better in maintaining the overall structure of the graph as shown by the assortativity and global clustering coefficient results. We summarize the results in the following points:
The three versions of List Sampling i.e., LS0, LS1 and LS2 extract good samples from the original graph. The samples extracted by these methods are both qualitatively and quantitatively superior than the samples produced by the previous methods discussed in this paper. Of the three versions of LS presented in this paper, it is very hard to prefer one over the other because all three methods produce reasonably close results. However, it seems that LS2 is marginally better than LS0 and LS1 in terms of degree and clustering coefficient measurements while LS1 seems to measure path lengths better than its two counterparts. For measuring assortativity and global clustering coefficient, we could not find a big difference in their performances. FFS is also a traversal based method and performs well in maintaining the overall structure of the graph as seen in assortativity and global clustering coefficient results. However, it underestimates the degree and clustering coefficient and overestimates the path lengths. Both TIES and RDN rely on randomly picking the edges/nodes from the graph and are biased to pick high degree nodes. The selection of high degree nodes leads them to produce overestimated samples in terms of degree. And the random selection results in high path lengths (although induction step compensates it and as a result they have better results than FFS). TIES and RDN perform reasonably good in estimating clustering coefficient and assortativity.
The current sampling methods fail to accurately match the distributional properties of the original graph and preserve its topology. Intuitively, we observe that the neighborhood of a node being sampled is not explored properly by the current methods and as a result some methods overestimate the properties of the original graph while others underestimate them. We find a middle ground by introducing the novel concept of a pool of candidate nodes and pick nodes from this reservoir instead of the whole graph and produce good samples. We present a framework with two parameters and by tuning these two parameters, we can design new algorithms. In principle, this framework provides us the basic infrastructure and guidelines for designing new sampling approaches with flexibility. We present two algorithms based on this framework and prove their efficacy over the current state-of-the-art methods by using real world data sets and demonstrate that our algorithms produce better samples than the previous sampling methods. We also perform statistical tests for the quantitative evaluation of our algorithms and show that our algorithms improve by a factor of 2x to 5x on average when compared to the existing methods.
In the future, we intend to extend this framework for directed and labeled graphs. One possibility of sampling directed graphs is to consider only the out-going edges when adding neighbors of a node to the candidate list. For example, if a node has three out-going and two in-coming edges, we add its three neighbors that are on the other end of the out-going links, to the candidate list and proceed this way to maintain its out-going degree distribution. A similar technique could be worked out for in-coming degree distribution. Another direction to work on in the future is to sample the dynamic and streaming graphs where new nodes/edges are constantly arriving in the graph. A possible solution could be looking at the new neighbors of a sampled node as they arrive in the network and sample them with a certain probability.
Footnotes
Acknowledgments
This research was supported by the ICT R&D program of MSIP/IITP, [R7117-16-0219, Development of Predictive Analysis Technology on Socio-Economics using Self-Evolving Agent-Based Simulation embedded with Incremental Machine Learning] and by Korea Institute of Science and Technology (KIST) under the project “Development of Tangible Social Media Technologies for Smart Aging”.
