List sampling for large graphs

Abstract

Real world graphs are massive in size and often prohibitively expensive to analyze. Of the possible solutions, sampling is extracting a representative subgraph from a large graph that faithfully represents the actual graph. The prior research has developed several sampling methods but the samples produced by these methods fail to match important properties of the original graph and work poorly in maintaining its topology. We observed that the existing methods do not explore the neighborhood of sampled nodes fairly and hence yield suboptimal samples. In this paper, we introduce a novel approach in which we keep a list of candidate nodes that is populated with all the neighbors of nodes that have been sampled so far. With this approach, we can balance the depth and breadth of graph exploration to produce better samples. We evaluate the effectiveness of our approach using several real world datasets and show that it surpasses the existing state-of-the-art approaches in maintaining the properties of the original graph and retaining its structure. We also calculate Kolmogorov-Smirnov Distance and Jensen-Shannon Distance for quantitative evaluation of our approach.

Keywords

Graph sampling big graphs social network analysis

1. Introduction

We are surrounded by networks all around us including technological networks (e.g., the Internet, telephone networks), social networks (e.g., Facebook, Twitter), information networks (e.g., World Wide Web, citation networks), biological networks (e.g., biochemical networks, food webs) and many more. The network graphs or simply graphs, a formal representation of these networks, provide us a structural model that makes it possible to analyze and understand different properties of a network. By analyzing the graphs, we extract valuable information that could help us to make not only business-oriented decisions but also technology-oriented decisions. When it comes to the graph analysis, it would be simple if the graph size is small, but unfortunately real world graphs are too massive to efficiently manage and analyze. The real world graphs could have billions of nodes and edges and fully analyzing such graphs requires lots of resources in terms of required memory, computational power and the processing time, making it prohibitively expensive to study such enormous graphs.

Sampling small subgraphs from large graphs is one of the possible solutions, provided that the small subgraph is a good representation of the large graph. The sampled subgraph is supposed to maintain the properties of the graph (e.g., degree, clustering coefficient, path length etc.) and match the distributions of these properties as measured in the actual graph. Given a large graph G $=$ (V, E), a sampling method selects a subset of nodes ( $V_{s}\subset V$ ) and edges ( $E_{s}\subset E$ ) to form a subgraph $G_{s}=(V_{s},E_{s})$ . In accordance with the previous work [1, 2], we assume that a good representative subgraph $G_{s}$ of graph $G$ is one that closely approximates different parameters of the graph (e.g., average degree, average clustering coefficient) and keeps the structure of $G_{s}$ close to that of $G$ .

The research community has proposed several sampling approaches in the past. These sampling approaches could be divided into two categories according to their primary goals. In the first category, the main goal is to estimate the properties of the graphs and therefore, the focus is on the quick exploration and estimation of some of the characteristics of those networks that are hard to analyze in whole or have a high rate of churn. The works in [3, 4, 5] fall in this category, where the online social networks were studied and explored. In the second category, the primary goal is to create a representative subgraph that matches the properties of the original graph. Sampled subgraphs can be used in place of large networks that are too costly to handle and analyze in full. The need to crawl large social networks and/or test network protocols is the main motivation behind this research. For example, Snowball Sampling [6] or Random Walks [7] are often used to collect data from large online social networks. However, previous approaches have had limited success in building subgraphs accurately and reproducing key properties of the actual graph especially in terms of the distribution of the various properties.

Our work falls in the second category and we focus on building a representative subgraph $G_{s}$ from the original graph G such that $G_{s}$ has characteristics very close to those of G. We know that in Breadth First Search (BFS) and Depth First Search (DFS), we explore the neighboring nodes one by one in a predefined manner and when we apply these approaches to graph sampling, we either overestimate or underestimate the properties of the original graph. Intuitively, we can expect that if the list of discovered but yet unselected nodes is available, there is abundant flexibility of i) how we select the nodes from this list and ii) how many nodes we select simultaneously. This intuition led us to introduce the novel idea of selecting the nodes to be included in the sample graph from the pool of potential candidates. In this paper, we present a set of algorithms that vary in terms of how we select the nodes and how many nodes we select at a time from the list of the discovered but not yet sampled nodes. Our algorithms estimate the characteristics of the graph more accurately and outperform the state-of-the-art sampling approaches. The main outcomes of our work are

•
We introduce the concept of Candidate List, i.e., a reservoir of candidate nodes to be included in the sample graph.
•
We present a new framework for sampling large graphs. This framework provides us the foundations of using the Candidate List by tuning its two parameters. We also present theoretical analysis of this framework and conduct some experiments to highlight its important features.
•
We propose two novel sampling algorithms based on our framework that contrast on the basis of the probability of selecting a node from the Candidate List. We present the key intuition behind them and examine their features.
•
We compare our algorithms with the existing state-of-the-art algorithms using real world data sets and show that our algorithms outperform the existing algorithms.
•
Lastly, we perform statistical evaluation of our algorithms using Root Mean Square Error, Kolmogorov-Smirnov distance and Jensen-Shannon distance.

The rest of the paper is organized as follows. We start with the related work in Section 2. In Section 3, we discuss graphs in general and their important properties. In Section 4, we present the graph sampling problem and categories of sampling methods. In Section 5, we present our approach to sampling in detail and the main intuition behind it. In Section 6, we formally present the List Sampling framework and propose two sampling algorithms. In Section 7, we present the evaluation criteria and ten real world datasets used for the evaluation. In Section 8, we study the List Sampling framework in detail, perform theoretical analysis and conduct some experiments to highlight its features. In Section 9, we evaluate our algorithms against two existing methods, perform statistical tests and show that our approach outperforms these existing methods. We conclude the paper in section 10 with some future work directions.
2. Related work

Graph sampling has been used in different flavors in many diverse fields of research including statistics, social science, data mining and machine learning, to name a few. In the past, the work was centered on graph compression and studying the properties of sampled networks. The work in [8, 9] focuses on using sampling to reduce the graph for better visualization. The work in [6, 10, 7] studies the properties of samples of complex networks produced by traditional sampling algorithms such as node sampling, edge sampling and random walk based sampling. The work in [10] shows that the sampled subgraphs of a scale free network are not scale free. Conversely, the work in [11] shows that the samples produced by traceroute sampling follows the power law. The work figures out the bias, in estimating the different properties of a graph, due to sampling.

A lot of research has been done on Internet measurement, i.e., measuring the topology of large-scale networks, such as Peer-to-Peer Networks, the World Wide Web and Online Social Networks. The work in [12] applied a variation of Random Walk called Re-Weighted Random Walk(RWRW) to sample Peer-to-Peer networks. The work in [5, 13] used Random Walks with jumps and multiple dependent Random Walks to sample and estimate large networks. In [14], the authors compared different Random Walk based techniques for sampling web pages. The interested readers are referred to the survey [15] that compares different methods for crawling large networks.

There has been little work on sampling large graphs and finding a representative graph to estimate the properties of the actual graph. Leskovec and Faloutsos analyze various sampling algorithms for large graphs and propose Forest Fire Sampling (FFS) [1]. FFS is a partial breadth first sampling method in which we pick a seed node at random and then burn a fraction of its outgoing edges along with the nodes on the other end. More recently, Nasreen et al. proposed Totally Induced Edge Sampling (TIES) in [2]. The primary difference between TIES and the random Edge Sampling (ES) is the graph induction step. In TIES, we augment all the existing edges between the sampled nodes by including other edges between the set of sampled nodes in addition to those sampled in the edge sampling step. For static graph sampling, FFS and TIES are considered state-of-the-art algorithms and produce better samples than their traditional counterparts. The authors in [16] analyze the biasness of Breadth First Sampling (BFS) and Depth First Samling (DFS) and introduce Random First Sampling (RFS) where we sample a node uniformly at random from the list of visited but unsampled nodes. However, the main focus of the paper is to find the minimum crawl size to estimate the properties of the graph. In [17], the authors propose Metropolis algorithms that refine the subgraphs produced by randomly selecting the nodes. The main idea is to replace the sampled nodes with other potentially good nodes to better match the properties of the actual graph. The work assumes that we have already estimated the properties of the actual graph which might not be a realistic assumption. Furthermore, the algorithms are computationally expensive as they need to search for possible good nodes and most importantly, the uniform random selection of nodes imposes limits on its practical use. The work in [18] provides a detailed study on the nature of biases in network sampling. The authors explored seven common sampling strategies including Breadth First Sampling, Depth First Sampling, Random Walk, Forest Fire Sampling, Degree Sampling, Sample Edge Count and Expansion Sampling. The work investigates connections between certain biases and graph representativeness. Interestingly, the authors point out that such bias could be beneficial for some applications.

The sampling methods discussed above work on undirected graphs but there are also a few methods that sample directed graphs. The authors in [19] transform a directed graph to an undirected graph by introducing reverse links and then apply Metropolis-Hastings Random Walk to sample the graph uniformly. Similarly, the work presented in [20] introduces backward edge traversals and degree-proportional jumps to work with random walks and random jumps to extract samples from directed graphs.

In structured data mining and machine learning, the focus has been on developing methods to extract small subgraphs and use them to learn models [21] and study complex network processes [22]. In data mining, domain knowledge plays an important role as discussed in [23] where the authors highlight the advantage of using domain knowledge within the discovery process. In [24] the authors develop a decision support expert system by mining a large dataset of a telecommunications operator. Interestingly, the authors use the domain knowledge in the prepossessing phase and extract a sample based on the area and activity of customers for further analysis. However, it is still an open question how domain knowledge can play some role (if any) in extracting good samples. In structured decision making, Influence Diagrams (IDs) graphically represent the relationships between different entities. Influence Diagrams are acyclic directed graphs modeling decision problems under uncertainty and are used as a tool for decision support systems design. The work in [25] presents a sampling algorithm for solving influence diagrams and reduces the problem of solving an ID by first transforming it into a Bayesian Network. The authors in [26] apply importance sampling [27] for selecting actions in Influence Diagrams. Fundamentally, graph sampling and sampling in Influence Diagrams have different goals. In graph sampling, we aim at producing a small representative graph that has the same properties as that of the original graph while sampling in Influence Diagrams is focused on selecting an optimal strategy or action that yields the highest expected gain.

3. Graphs

Formally, we represent a network as a graph $G=(V,E)$ with a node (or vertex) set $V=\{v_{1},v_{2}$ , $v_{3},\ldots,v_{N}\}$ and an edge (or link) set $E=\{e_{1},e_{2},e_{3},\ldots,e_{M}$ } such that $N=|V|$ is the total number of nodes (also called the size of the graph) and $M=|E|$ is the total number of edges in the graph.

3.1 Graph properties

There is a broad range of measures, also called properties, which characterize a graph. Some properties are local to a node or an edge, e.g., local clustering coefficient, while others are global, e.g., assortativity of a graph. A number of graph properties have been explored by the research community [28], however, we limit ourselves to study the following six properties of a graph. Moreover, we consider only undirected graphs in this work.

Degree: The degree of a node in a graph is the number of edges connected to the node and the average degree of a graph is the sum of degrees of all the nodes in a graph divided by the size of the graph. We represent the degree of node $v$ as $d_{v}$ , therefore, the average degree of a graph, denoted by $d_{\textit{avg}}$ , is given by

$\displaystyle d_{\textit{avg}}=\frac{1}{N}\sum_{i=1}^{N}d_{i}$ (1)

Clustering coefficient: The clustering coefficient of a node defines the probability that two randomly selected neighbors of the node are connected. The local clustering coefficient of node $v$ , denoted by $c_{v}$ , is calculated as the ratio of the existing edges between its neighbors to all the links that could possibly exist between them, given by

$\displaystyle c_{v}=\frac{2*L_{v}}{d_{v}*(d_{v}-1)}$ (2)

where $L_{v}$ represents the number of existing links between neighbors of $v$ and $d_{v}*(d_{v}-1)/2$ gives the number of all possible links. The average clustering coefficient of a graph is the average of the local clustering coefficients of all the N nodes in a graph. We denote it by $c_{\textit{avg}}$

$\displaystyle c_{\textit{avg}}=\frac{1}{N}\sum_{i=1}^{N}c_{i}$ (3)

Path length: A path in a graph represents a route to get from one node to another by traversing edges in a graph and the length of a path is the number of edges traversed. In calculating the average path length, we find the shortest path lengths from a source node to all other nodes in a graph. Let the shortest path length between $v_{1}$ and $v_{2}$ be denoted by $p(v_{1},v_{2})$ then the average path length of the graph, denoted by $p_{\textit{avg}}$ , is given by

$\displaystyle p_{\textit{avg}}=\frac{1}{N*(N-1)}\sum_{i\neq j}p(v_{i},v_{j})$ (4)

90% effective diameter: The diameter of a graph is the length of the longest shortest path over all connected pairs of nodes. It is the shortest distance between the two most distant nodes in the graph. Since the diameter is prone to the effects of degenerate structures in the graph, the 90% effective diameter is considered as a more robust quantity then the diameter. The 90% effective diameter or 90-percentile effective diameter is defined as the distance in which 90% of all node pairs are located. The reduction to 90% closest pair of nodes subsides the effect of possible very long chains leaping out of the network.

Assortativity: Assortativity quantifies the tendency of nodes in a graph to connect to others that are similar in some way. In this paper, we use the degree of a node as a similarity measure and calculate the assortativity coefficient, denoted by A, as the Pearson correlation coefficient of degree between pairs of connected nodes, given by

$\displaystyle A=\frac{1}{\sigma_{q}^{2}}\left[\sum_{jk}jk(e_{j,k}-q_{j}q_{k})\right]$ (5)

where the terms $q_{j}$ and $q_{k}$ gives the distribution of the remaining degree and captures the number of edges leaving the node, other than the one that connects the pairs. The term $e_{jk}$ refers to the joint probability distribution and $\sigma_{q}^{2}$ is the normalization factor (See [28] for details).

Global clustering coefficient: The global clustering coefficient is based on triplets of nodes. A triplet consists of three nodes that are connected by either two (open triplet) or three (closed triplet) undirected ties. The global clustering coefficient (GCC) is the ratio of closed triplets to the total number of triplets (both open and closed) in a graph. We define GCC as

$\displaystyle\textit{GCC}=\frac{\textit{number of closed triplets}}{\textit{% number of connected triplets of nodes}}$ (6)

Global clustering coefficient gives an indication of the clustering in the whole network. It is often called transitivity.

4. Sampling

In Statistics, sampling is the process of selecting a subset of individuals from a population of interest. Generally, the population is huge and through sampling we extract a subset of individuals that fairly represent the whole population and by studying the sample we may fairly generalize our results back to the population.

4.1 Graph sampling

Graph sampling (also called network sampling) is the technique to extract a subset of nodes and/or edges from the original graph. There is a variety of applications of graph sampling, e.g., visualize social networks [29], scale down big Internet graphs [30], graph sparsification [31] and network crawling [32, 33, 4]. The purpose of sampling depends on its application. In some cases, the whole graph is known in advance and sampling is performed to pick a smaller subgraph for quick visualization. In other cases, the graph is unknown and unbounded and the purpose of sampling is to explore the graph.

Formally, let $G=(V,E)$ represent the graph where V is the set of nodes and E is the set of edges in the graph. We represent an edge $e\in E$ as a tuple $e(v_{i},v_{j})$ where $v_{i}$ and $v_{j}\in V$ . Given a sampling fraction $\phi$ , the goal is to extract a sample graph $G_{s}=(V_{s},E_{s})$ where $V_{s}\subset V$ and $E_{s}\subset E$ such that $|V_{s}|/|V|=\phi$ , that preserves the properties of the actual graph $G$ . The sample node set is $V_{s}=\{v_{1},v_{2},v_{3},\ldots,v_{n}\}$ such that $n=|V_{s}|$ is the size of $G_{s}$ and sample edge set is $E_{s}=\{e_{1},e_{2},e_{3},\ldots,e_{m}$ } such that $m=|E_{s}|$ gives the total number of edges in $G_{s}$ .

4.2 Categories of sampling methods

In this section, we discuss previous sampling techniques that could be broadly divided into three categories: Node Sampling, Edge Sampling and Traversal Based Sampling

4.2.1 Node sampling

In Node Sampling (NS), we select the nodes at random from the actual graph and then induce the edges among the sampled nodes. In the basic algorithm of node sampling, we select nodes uniformly at random from the actual graph to construct the sample node set $V_{s}$ and the sampled graph is then induced on $V_{s}$ by adding all the edges between the nodes in $V_{s}$ . Generally, NS under-estimates the degree of nodes and produces samples with a large fraction of zero degree nodes. The authors in [10] showed that NS does not retain power law distribution. Lee et al. [6] showed that NS selects nodes of varying degree but fails to preserve the original level of connectivity. In Random Degree Node (RDN) sampling, the probability of selecting a node is proportional to its degree and hence it produces a higher degree sample graph since there could be too many high degree nodes in the sample.

4.2.2 Edge sampling

Similar to selecting the nodes at random, we can also select edges at random. In classic Edge Sampling (ES), we focus on selecting the edges at random to form the edge set $E_{s}$ and include both the end nodes when a particular edge is sampled to form $V_{s}$ . The classic Edge Sampling only includes the edges that are sampled and no graph induction step is performed. Generally, Edge Sampling produces sparsely connected graphs and has path lengths bigger than that of the actual graph. In [2], the authors worked out a variation of ES called Totally Induced Edge Sampling (TIES). The primary difference between TIES and ES is the graph induction step. In TIES, we augment all the existing edges between the sampled nodes by including other edges between the set of sampled nodes in addition to those sampled in the edge sampling step. TIES mostly produces good samples and outperforms most of the existing algorithms but we will show in the evaluation section that for many graphs it creates high degree samples and the sampled distributions of some properties do not follow the actual distributions. It would be interesting to point out that sampling methods like NS, ES and TIES rely on uniform selection of nodes and/or edges from the whole graph. However, in real networks that are not fully known in advance and have to be crawled, it is very hard or nearly infeasible to generate a statistically valid node set or edge set where the nodes and/or edges are selected uniformly at random. This imposes a serious limitation on these methods for practical applications.

4.2.3 Traversal based sampling

The basic idea of traversal based sampling is that we first select a seed node at random and then explore the nodes in its neighborhood. Breadth First Sampling (BFS) and Depth First Sampling (DFS) are two classic examples of traversal based sampling that explore the graph breadth-wise and depth-wise respectively. In Random Node Neighbor (RNN) sampling, we select a node uniformly at random and then include all of its neighbors along with the corresponding edges to the sample graph. Interestingly, RNN under-estimates the degree of nodes and if we include all the edges among the sampled nodes, it over-estimates the degree of nodes. Another example is Snowball Sampling (SS) [7] where we select a seed node uniformly at random and then perform partial breadth first search. In Random Walk (RW) sampling, we select a seed node at random and then perform a random walk on the graph. In [1], the authors propose the Forest Fire Sampling (FFS) method. FFS is a partial breadth first sampling method in which we pick a seed node at random and then burn a fraction of its outgoing edges along with the nodes on the other end. The fraction of nodes to be burned is selected from the geometric distribution with mean pf/(1-pf) where pf is the forward burning ratio with a recommended value of pf $=$ 0.7 which means 2.33 nodes are burned on average. Generally, sampling by exploring the neighborhood is considered a state-of-the-art sampling technique and generates samples better than NS and ES but still the algorithms such as FFS, RW and SS lack in accuracy.

5. Our approach to sampling

In this section, we present some basics of our approach, main intuition and the concepts of Candidate List and graph induction. Since we utilize the list of discovered but yet unselected nodes, we call our approach List Sampling (LS).

5.1 List sampling

The basic idea of List Sampling (LS) is to keep a list of nodes that have been discovered by the sampling algorithm but have not yet been sampled. This small reservoir of, discovered but yet unsampled, nodes provides us numerous possibilities of i) how we select the nodes and ii) how many nodes we select simultaneously from this pool of potential candidates. List Sampling is a framework that serves as a new approach to utilize the already discovered nodes in sampling instead of rediscovering the nodes repeatedly and adding them to the sample. In LS, we visit the already discovered nodes and also add the new nodes to the pool from the neighborhood of the discovered nodes.

Breadth First Sampling (BFS) and Depth First Sampling (DFS) are two classic approaches that utilize the discovered nodes in a predefined manner. In BFS, we add all the neighbors of the sampled node in a list and then visit the nodes in the list in a FIFO (first in first out) manner, i.e., sample the oldest node in the list. This FIFO approach samples the nodes close to the seed node and explores the breadth of the graph very well but compromises the depth of the graph and hence underestimates the diameter of the graph. The nodes sampled in BFS belong to a small region around the starting node and does not explore the graph in depth. Furthermore, it has been shown in the previous studies [6, 4, 16, 34] that BFS is biased to high degree nodes and this bias produces overestimated sample graphs. In DFS, we deepen the sampling to the leaf nodes in the original graph and then recursively backtrack to the inner nodes and then back to the leaf nodes in the graph. This back and forth sampling of leaf nodes samples the nodes far away from the seed node and explores the depth of the graph very well but compromises the breadth of the graph. The sampled nodes in DFS belong to a particular region far away from the seed node and do not produce good samples. In other words, if we think of a graph as a two dimensional plane, BFS explores the graph in breadth dimension and DFS explores it in depth dimension as shown in Fig. 1. In order to explore the breadth and depth of the graph, we need a two dimensional approach and list sampling provides a framework to such a two dimensional sampling of graphs.

Figure 1.

List sampling vs. Breadth first vs. depth first sampling.

In List Sampling, we sample the nodes from the list of the discovered nodes in a random fashion instead of sampling the nodes in a predefined manner. This random first sampling traverses the graph in both breadth and depth dimensions as shown in Fig. 1 and removes any bias generated by the breadth or depth first sampling. It balances the depth and breadth of the sample graph and produces better samples than the previous approaches. Moreover, we also introduce the concept of multiple selection from the list of discovered nodes. Since the list of discovered nodes grows quickly, selecting multiple nodes simultaneously from the list could balance the two dimensional exploration of the graph by selecting nodes in both dimensions at a time and could help to find better samples with relatively smaller sample sizes.

List Sampling has two basic steps. The first step is the node selection step in which we select nodes to form the node set $V_{s}$ . In the second step, called graph induction, we induce all the existing edges present in the original graph between the nodes of $V_{s}$ to form the edge set $E_{s}$ .

5.2 Node selection

List Sampling is based on the selection of nodes from a pool of possible candidates instead of selecting the nodes from the whole graph. We exploit two key observations. First, selecting the nodes in a predefined manner as in BFS or DFS confines the sampling to the area either close to the seed node or far away from it and results in overestimation or underestimation of some properties of the graph. Second, the unvisited neighborhood of the sampled nodes grows quickly but is visited slowly because we visit exactly one neighboring node at a time, this results in poor estimation of the topology of the graph. Finding the middle ground led us to introduce the concept of Candidate List that holds the list of potential candidates from the neighborhood of the sampled nodes and we select nodes at random from this list.

A Candidate List (CL) is populated by adding all the direct neighbors of the selected nodes and depopulated when we select (and delete) a node from it to form the sample set $V_{s}$ . Consider the graph shown in Fig. 2. Let’s say, we start sampling this graph and pick a node at random as the starting node. Let A be this seed node. We add all the immediate neighbors of A to the Candidate List (CL). At this stage, our sample set $V_{s}=\{A\}$ and CL $=$ {B, C, D}. Next, we select a node from the CL, let it be node C, add it to $V_{s}$ , remove it from the CL and add all of its direct neighbors to the CL such that we have $V_{s}=$ {A, C} and CL $=$ {B, D, E, F}. We repeat the process and this time we select node B from the CL such that we have $V_{s}=$ {A, C, B} and CL $=$ {D, E, F, H}. Note that since node E is already in the CL and node A is in $V_{s}$ so we do not add them again. Any node that has already been sampled or is already member of the CL, it is not included in the CL. In other words, there is no repetition in the CL and already sampled nodes cannot be added.

Figure 2.

A small graph of 9 nodes.

5.3 Graph induction

Let’s suppose we sample the graph of Fig. 2 such that we select a node uniformly at random together with all of its neighbors. If we select node B, then $V_{s}=$ {B, A, E, H} and $E_{s}=$ {e(B,A), e(B,E), e(B,H)}. Basically, we select the outgoing edges from B and if there is any edge between the selected nodes, for example e(E,H), we do not add it to $E_{s}$ unless it is sampled by node E or H. However, in the graph induction step in this paper, we sample all the edges among the nodes of $V_{s}$ . Therefore, if we apply induction on $V_{s}$ we have $E_{s}=$ {e(B,A), e(B,E), e(B,H), e(E,H)}. In principle, after selecting $V_{s}$ nodes, we add each and every edge between these nodes that exist in the actual graph. This is called graph induction and plays an important role in retaining the structure of the graph.

The interested readers are referred to [35] for understanding the role of the graph induction step in network sampling. The paper compares algorithms such as Random Walk and Forest Fire Sampling with and without induction. The authors conclude that a clear difference exists between the techniques with subgraph induction and without it. Generally, the techniques with the induction step improve the performance of the corresponding techniques without it.

6. List sampling framework

In this section we formally present two algorithms, under the title List Sampling (LS), that differ on the basis of how we select the nodes from the CL. In principle, LS provides us a framework with two key parameters. i) the probability, denoted as p, of a node to be selected from the CL ii) the number of nodes, denoted as k, sampled simultaneously from the CL. We represent the List Sampling framework as $LS(p,k)$ .

The first parameter p, the probability of a node to be selected from the CL, offers us the flexibility of selecting a node with a certain probability from the pool of available nodes and this probability could be based on any characteristic of a node e.g., its degree. There are numerous possibilities to define the value of p, e.g., we can select nodes uniformly at random from the CL or we can select a node from the CL with a probability directly or inversely proportional to its degree. We call this parameter a biasing parameter because it biases our selection of nodes from the CL and we can either prefer high degree or low degree nodes for sampling.

The second parameter k, the number of nodes selected simultaneously from the CL, helps an algorithm to explore the neighborhood properly. We assume that if there are many potential candidates in the CL and we select only one node at a time, we cannot explore the neighborhood extensively. We call this parameter a multiple selection parameter. There are countless possibilities to select k ( $1\leqslant k\leqslant|CL|$ ) number of nodes from the CL but in this paper we discuss only one case where k is

$\displaystyle\,k=\lceil\sqrt{|CL|}\rceil$ (7)

We calculate k as the square root of the size of the Candidate List. We will discuss the intuition behind this value of k in the analysis section (Subsection 8.3).

Supposedly, the Candidate List is a good trade-off between sampling by randomly selecting the nodes and sampling by exploring the neighborhood. We have two key intuitions. First, selecting the nodes at random from the Candidate List simulates the same effect as that of selecting the nodes randomly from the graph and helps us in overcoming the degree bias of BFS and DFS. Second, since the Candidate List does not hold the neighboring nodes of the current node but of all the nodes sampled so far, therefore, when we select multiple nodes repeatedly from the CL, it is highly likely that we explore the neighborhood of each node accurately. That helps us in retaining the structure and connectedness of the graph. Intuitively, the graph induction step further strengthens the overall connectivity of the sampled graph and helps in maintaining its topology.

6.1 Base algorithm

The simplest possible algorithm with List Sampling framework $LS(p,k)$ is to set $k=1$ and select the nodes uniformly at random from the candidate list. In this algorithm, we select one node at a time from the CL and add all of its neighbors to the CL. The probability to select a node from the CL is uniform. With these values of p and k, this algorithm becomes identical to the Random First Sampling [17] except that the Random First Sampling technique does not perform the induction step and only includes the sampled edges. On the other side, we include all the existing edges in the original graph between the node set $V_{s}$ to form the edge set $E_{s}$ . We call this base algorithm List Sampling Zero (LS0).

6.2 Algorithm 1

In the first algorithm, we select nodes uniformly at random from the Candidate List (CL) without considering the degree of the nodes. Let us suppose that the CL contains r number of nodes, i.e., CL $=\{v_{1},v_{2},v_{3},\ldots,v_{r}\}$ , then the probability p of a node to be selected from the CL is given by

$\displaystyle\,p=\frac{1}{r}$ (8)

The main intuition behind this algorithm is the belief that randomly selecting the nodes picks both low degree and high degree nodes with equal probability and hence their average degree comes out to be close to the actual degree of the graph. We select a seed node at random from the node set V of the actual graph and add all of its neighbors to the CL. We then select k number of nodes uniformly at random from the CL, add them to the sample set $V_{s}$ , remove them from the CL and add all of their neighbors to the CL. The algorithm repeats this step until the target fraction $\phi$ of sampled nodes is achieved. After this, the graph induction step is invoked on the sample set $V_{s}$ to form the edge set $E_{s}$ . We call this algorithm List Sampling 1 (LS1).

6.3 Algorithm 2

In the second algorithm, we bias our selection to the low degree nodes when we select nodes from the Candidate List. The main intuition behind this algorithm is the observation that sampling, in general, is intrinsically biased to pick high degree nodes from the actual graph in one way or another. In this algorithm, the probability of a node being selected from the Candidate List of size r depends on the degree d of a node and its frequency f. The frequency f of a node tells us how many of its neighbors have already been sampled. If the frequency of r nodes in the CL is represented as $\{f_{1},f_{2},f_{3},...,f_{r}\}$ , the probability p of a node is calculated as

$\displaystyle\,p\propto\frac{1}{d}+\left(1-\left(1-\frac{1}{d}\right)^{f}\right)$ (9)

Where $f_{i}$ is the number of times the node $n_{i}$ is added to the CL. We do not repeat the nodes in the CL but if a node is already present in the CL and it is to be added again (as a node could be a neighbor of many nodes) we increase its frequency by one and calculate its probability p using Eq. (9). Initially, the frequency of all nodes is set to zero, that reduces the probability of a node to the inverse of its degree but as the neighbors of a node are sampled, its frequency and hence its probability increases. This equation biases our selection to the low degree nodes and also permits us to select high degree nodes in a good proportion because intuitively we assume that the frequency of high degree nodes will increase gradually and hence their probability will also increase. The normalized probability $p$ of a node is given by

$\displaystyle\,p=\frac{\frac{1}{d}+(1-(1-\frac{1}{d})^{f})}{\sum\limits_{i=1}^% {r}\frac{1}{d_{i}}+(1-(1-\frac{1}{d_{i}})^{f})}$ (10)

Intuitively, we adopt the idea that a node should be given some privilege of increasing its probability depending on how many of its neighbors have been sampled so far. We call this algorithm List Sampling 2 (LS2).

All the algorithms LS0, LS1 and LS2 are presented in algorithm #1 as List Sampling Framework. Please note that LS0 and LS1 differ in the value of k, the number of nodes selected from the CL while LS1 and LS2 differ in the value of p, the probability to select the nodes from the CL. By comparing these algorithms with one another, we can see the effect of changing the value of p and k in $LS(p,k)$ . Furthermore, these algorithms provide us the foundation to understand the List Sampling framework and opens the possibility of designing new algorithms by changing the values of p and k.

6.4 Time complexity and implementation notes

The worst case running time of List Sampling as given in algorithm #1 can be calculated as follows. Let N be the total number of nodes and M be the total number of edges in the original graph G. It takes O(N) time to calculate the normalization factor and O(N) time to find the normalized probabilities of nodes in the Candidate List. The while loop can run N number of times at maximum and therefore, the time to build the CL comes out to be of the order of $O(N^{2})$ . Next, it takes O(N) time to pick a node from the CL and selecting N number of nodes at maximum from the CL will cost $O(N^{2})$ time. Therefore, the time complexity of the core of algorithm#1 would be $O(N^{2})$ . The induction step is straight forward and takes O(M) in the worst case. Summing up, the run time complexity of List Sampling is of the order of $O(N^{2}+M)$ .

Algorithm #1 implements List Sampling in a naive way so that the reader could grasp the idea easily. However, a careful implementation with Probability Proportional to Size or Roulette Wheel Sampling can reduce the time significantly. In these implementations, we can use the fact that the CL could be sorted in O(logN) time and a binary search thereafter can select a node from the list in O(logN) time. Similarly, the issue of constantly updating the normalization factor and normalized probabilities could be handled with a data structure called Binary Indexed Tree where an array of cumulative sum form could be updated in O(logN) time. Based on these facts, we believe that a careful implementation of List Sampling would cost O(NlogN+M) time to extract a sample.

7. Evaluation criteria and data sets

The question, “how to measure the goodness of a sampling algorithm?”, has no single answer and there is no single criteria to evaluate the results of a sampling algorithm. However, the research community [1, 2] has used some properties e.g., degree along with some statistical tests to measure the performance of a sampling algorithm. In this paper we measure the performance of our algorithms by comparing six properties of sampled graphs with the original graph, draw the distributions of properties where possible and perform three statistical tests for quantitative measures.

7.1 Point statistics

A point statistic is a single value statistic that shows the value of a property at a single point. We vary the sampling fraction $\phi$ from 0.1 to 0.2 and plot the scaling ratio of a property $\Theta$ as the ratio of the value of that property in the sampled graph $\Theta_{S}$ to the value of that property in the actual graph $\Theta_{A}$ i.e.,

$\displaystyle\textit{Scaling Ratio}=\frac{\Theta_{S}}{\Theta_{A}}$ (11)

For example, we measure the average degree (Eq. (1)) of the sample graph and the original graph and find the scaling ratio (Eq. (11)) of degree by dividing the average degree of the sample graph $G_{s}$ at a sampling fraction $\phi$ with the average degree of the original graph G. Similarly, we find and plot the scaling ratio of the average clustering coefficient (Eq. (3)) and average path length (Eq. (4)). For Assortativity (Eq. (5)) and Globa l Clustering Coefficient (Eq. (6)) we do not calculate the scaling ratio, rather we plot the measured values in the sample and original graphs at different values of $\phi$ .

7.2 Distributions

A distribution is a multivalued statistic and shows thedistribution of a property in a graph. For example, the degree distribution shows the fraction of nodes that have degree greater than or less than a particular value. We find and plot the Empirical Cumulative Distribution Function (ECDF) of degree, clustering coefficient and path lengths of sample graphs at $\phi=$ 0.1.

[h] InputInput OutputOutput VarVariables InitInitialize

Original Graph G $=$ (V,E) where V is the set of nodes and E is the set of edges and sampling fraction $\phi$ . Sample Graph $G_{s}=$ ( $V_{s}$ , $E_{s}$ ) where $V_{s}\subset V$ is the set of sampled nodes and $E_{s}\subset E$ is the set of sampled edges. Candidate List $=$ CL $=$ $\{v_{1},v_{2},v_{3},...,v_{r}\}$ Probability of nodes $=$ $p$ = $\{p_{1},p_{2},p_{3},...,p_{r}\}$ Normalization factor $=$ $\sum\limits_{i=1}^{r}{p_{i}}$ Normalized probability of nodes $=$ p̂ $=$ $\{\hat{p}_{1},\hat{p}_{2},\hat{p}_{3},...,\hat{p}_{r}\}$ $T=$ To store the nodes temporarily $V_{s}=$ 0, $E_{s}=$ 0, CL $=$ 0, T $=$ 0

Node Selection:

$V_{s}=\{v\}$ $\backslash\backslash$ seed node randomly chosen from V $T=\{v\}$

$|V_{s}|<\phi\times|V|$ $j=$ 1:size (T)

Calculate the probability of nodes accordingly $CL=CL\cup\textsf{neighbors}(v_{j})$ $T=T-v_{j}$

$k=\lceil\sqrt{|CL|}\rceil$

$i=$ 1:k

num $=$ random(0,1) $\backslash\backslash$ uniformly at random

$r=$ 1: $|$ CL $|$ num $\leqslant$ $\hat{p}_{r}$ $V_{s}=V_{s}\cup v_{r}$ $T=T\cup v_{r}$ $CL=CL-v_{r}$ break $\textit{num $=$ num}-\hat{p}_{r}$

Graph Induction: $i=$ 1: $|$ E $|$ $e(v,w)=e_{i}$ $v\in V_{s}$ AND $w\in V_{s}$ $E_{s}=E_{s}\cup\{e_{i}\}$

List Sampling Framework

7.3 Distance measures for quantitative comparison

A good sample has properties that closely approximate the original graph. Therefore, the distance between the property in sample graph $G_{s}$ and the original graph G is used to evaluate the goodness of a sample quantitatively. Given the original graph G and sampled graph $G_{s}$ , we want to measure how far is $G_{s}$ from G. For scalar quantities, e.g., average degree and assortativity, we use Root Mean Square Error (RMSE). For vector quantities i.e., distributions, we use two statistical tests, Kolmogorov-Smirnov (KS) Distance and Jensen-Shannon Distance (JSD) for quantitative evaluation of sampling algorithms.

Root Mean Square Error: We use the common measure for the quality of estimation by Root Mean Square Error (RMSE), given as

$\displaystyle\,\textit{RMSE}=\sqrt{\frac{1}{n}\sum_{1}^{n}(\Theta_{S}-\Theta_{% A})^{2}}$ (12)

where $\Theta_{S}$ and $\Theta_{A}$ are sampled and original values respectively.

Kolmogorov Smirnov Distance: In statistics, the Kolmogorov Smirnov (KS) test is a nonparametric test used to compare a sample with the reference probability distribution. It quantifies a distance $D_{ks}$ between two Cumulative Distribution Functions (CDFs) given as

$\displaystyle\,D_{ks}=\textit{max}_{x}|F_{1}(x)-F_{2}(x)|$ (13)

where $F_{1}(x)$ and $F_{2}(x)$ are two CDFs. In short, the KS Distance measures the maximum distance between the two distributions.

Jensen-Shannon Distance: In probability theory, the Jensen-Shannon Divergence measures the similarity between two probability distributions. The Jensen-Shannon Divergence is based on Kullback-Leibler Divergence (KLD) but in opposition to KLD, the JS Divergence is symmetric and its square root is a true metric often referred to as Jensen-Shannon Distance (JSD). The JS Divergence is calculated as

$\displaystyle\,D_{JS}(P||Q)=\frac{1}{2}D_{KL}(P||M)+\frac{1}{2}D_{KL}(Q||M)$ (14)

where $D_{JS}$ and $D_{KL}$ are Jensen-Shannon and Kullback-Leibler Divergences respectively while P and Q are two Probability Distribution Functions (PDFs) and $M=\frac{1}{2}(P+Q)$

Table 1

Characteristics of data sets used in the experiments

Data set	Vertices	Edges	Average	Average	Average	90% effective	Assortativity	Global
			degree	clust. coeff.	path length	diameter		clust. coeff.
Gowalla	196,591	950,327	9.66	0.2367	4.62	5.68	$-$ 0.0292	0.0235
CiteSeer	227,320	814,314	7.16	0.6750	7.83	9.62	0.0696	0.4555
DBLP	317,080	1,049,866	6.62	0.6324	6.75	8.08	0.2665	0.1283
Amazon	334,863	925,872	5.529	0.3967	11.73	15.33	$-$ 0.0588	0.2050
FourSquare	639,014	3,214,986	10.06	0.1080	3.65	3.85	$-$ 0.2887	0.0016
YouTube	1,134,809	2,987,624	5.26	0.0808	5.55	6.50	$-$ 0.0369	0.0062
Lastfm	1,191,805	4,519,330	7.58	0.0726	4.58	4.95	$-$ 0.1359	0.0131
Hyves	1,402,673	2,777,419	3.96	0.0447	5.67	6.52	$-$ 0.0234	0.0015
Skitter	1,696,415	11,095,298	13.08	0.2581	5.04	5.94	$-$ 0.0814	0.0054
Flicker	1,715,255	15,550,782	18.13	0.1837	5.19	6.62	$-$ 0.01528	0.1120

7.4 Data sets

In our experiments we use ten real networks available publically. We use one autonomous network Skitter [36], two collaboration networks DBLP [36] and CiteSeer [37] and seven social networks Gowalla [38], Amazon [38], FourSquare [37], YouTube [38], Lastfm [37], Hyves [39] and Flicker [36]. All the data sets used in this work are clean, not contaminated and there are no missing values in these datasets. Table 1 summarizes the characteristics of these networks.

8. Analysis of list sampling

In this section, we study LS analytically and compare it with some of the previous sampling methods in order to show the characteristics of LS that lead it to produce better samples than the previous approaches. In LS, a node is sampled in two steps. In the first step, it is added to the CL and in the second step it is picked from the CL. The first step, adding a node to the CL, depends on the probability that a neighbor of the node has been sampled and added to the sample set $V_{s}$ . The second step, picking a node from the CL, is applicable to only those nodes that are available in the CL. This discussion shows that sampling a node in LS is a dependent event, a node could be sampled only if a neighbor of it has been sampled.

Formally, let $G=(V,E)$ be the original graph and $G_{s}=(V_{s},E_{s})$ be the sample graph such that $G_{s}\subset G$ . The probability that a node $u\in V$ is sampled and added to $V_{s}$ could be calculated as

$\displaystyle\,P_{u}=P_{u}^{+CL}*P_{u}^{-CL}$ (15)

where $P_{u}^{+CL}$ is the probability that $u$ is added to the CL and $P_{u}^{-CL}$ is the probability that it is picked from the CL. Now $P_{u}^{+CL}$ depends on the probability that a neighbor $v$ of $u$ is sampled. Therefore, $P_{u}^{+CL}=P_{v}$ and if $v\in V_{s}$ then $P_{u}^{+CL}=$ 1 otherwise $P_{u}^{+CL}=$ 0. Now the probability that node $v$ is sampled could be calculated as

$\displaystyle\,P_{v}=P_{v}^{+CL}*P_{v}^{-CL}$ (16)

Now we can re-write Eq. (15) as

$\displaystyle\,P_{u}=(P_{v}^{+CL}*P_{v}^{-CL})*P_{u}^{-CL}$ (17)

where $v$ is a neighbor of $u$ and $u,v\in V$ . We see that Eq. (17) is recursive in nature and in order to calculate the sampling probability of $u$ , we must first calculate the sampling probability of one of its neighbors $v$ and so on.

Formally, if $V_{s}=\{v_{1},v_{2},v_{3},\ldots,v_{n}$ } such that $v_{1}$ is the seed node, $v_{2}$ is a neighbor of $v_{1}$ , $v_{3}$ is a neighbor of $v_{2}$ and so on, then the probability of sampling a node $v_{i}(1\leqslant i\leqslant n)$ could be given as

$\displaystyle\,P_{v_{i}}=(((P_{v_{1}}^{+CL}*P_{v{{}_{1}}}^{-CL})*P_{v_{2}}^{-% CL})*P_{v_{3}}^{-CL})*...*P_{v_{i}}^{-CL}$ (18)

The seed node $v_{1}$ is selected at random from V and added to the CL therefore $P_{v_{1}}^{+CL}=$ 1 and being the only node in the CL, $P_{v_{1}}^{-CL}=1$ and then we add all of its neighbors to the CL. Now the probability that one of its neighbors $v_{2}$ is selected out of all is calculated using Eq. (8) in LS1 and using Eq. (10) in LS2 and this process continues until we have the required number of nodes in $V_{s}$ . It is worth to mention that LS samples are connected and from the seed node to any node $v_{i}$ there could exist multiple paths in the original graph and Eq. (18) gives the probability of sampling $v_{i}$ going through one of these paths.

Figure 3.

(a, b) fraction of nodes of degree d added to and picked from the Candidate List , (c, d) fraction of nodes of degree d added to the sample graph Gs.

For further analysis, we reconsider Eq. (15) that calculates the probability that a random node $u\in V$ is added to $V_{s}$ . The first part of this equation gives us the probability that $u$ is added to the CL and this probability depends on the degree $d_{u}$ of $u$ . If a node has a high degree, there are more chances that one of its neighbors could be sampled. In other words, if $u,v\in V$ such that $d_{u}>d_{v}$ then $P_{u}^{+CL}>P_{v}^{+CL}$ . However, it is not trivial to prove such inequality because once a node is added to the CL, the probability that it is picked from the CL may or may not depend on its degree (or any other characteristic in case) and unless a node is not sampled, its neighbors are not added to the CL. Intuitively, we assume that the first part $P_{u}^{+CL}$ of Eq. (15) is slightly biased to high degree nodes because having more neighbors implies more chances that one of the neighbors is sampled. The second part $P_{u}^{-CL}$ , on the other side, could help to balance the sampling process. We can pick nodes at random (in LS1), sampling both high and low degree nodes fairly or we can bias our selection to low degree nodes (in LS2) or we can design another probability to select nodes from the CL.

In the above paragraph, we intuited that if $u,v\in V$ such that $d_{u}>d_{v}$ then $P_{u}^{+CL}>P_{v}^{+CL}$ , i.e., the probability that a high degree node is added to the CL is greater than that of a low degree node. We perform a simple experiment on Gowalla and YouTube datasets and find the percentage of nodes of degree d added to and picked from the CL. Let $\bigvee_{d}$ be the number of nodes of degree d in the original graph G. Let ${\bigvee}^{+CL}_{d}$ be the number of such nodes added to the CL and $\bigvee^{-CL}_{d}$ be the number of such nodes picked from the CL. We define

$\displaystyle CL^{+}=\frac{\bigvee^{+CL}_{d}}{\bigvee_{d}}$

and

$\displaystyle CL^{-}=\frac{\bigvee^{-CL}_{d}}{\bigvee^{+CL}_{d}}$

where $CL^{+}$ gives us the fraction of nodes of degree d added to the CL and $CL^{-}$ tells us what fraction of nodes of degree d available in the CL were picked and added to the node set $V_{s}$ . We present the data in the form of bar graphs in Fig. 3 for both LS1 and LS2 where the bars represent the values of $CL^{+}$ and $CL^{-}$ in a certain range of d. The graphs clearly show that high degree nodes are added to the CL more frequently than low degree nodes but when it comes to pick the nodes from the CL, LS1 picks both high and low degree nodes almost uniformly while LS2 slightly favors low degree nodes but also picks high degree nodes at a good ratio. For example, $CL^{+}=$ 0.295 and $CL^{-}=$ 0.169 in LS1 when $1\leqslant d\leqslant 5$ and when $6\leqslant d\leqslant 10$ , $CL^{+}=$ 0.681 and $CL^{-}=$ 0.191 (Fig. 3a, Gowalla network). We see a similar trend in LS2. This proves our intuition that the first part of Eq. (15) favors high degree nodes to add them to the CL but the second part helps to balance this bias and picks nodes of all degrees from the CL thereby producing better samples than the previous approaches.

For comparison with TIES and FFS, we also find the fraction of nodes of degree d sampled and added to the sample graph $G_{s}$ . Let ${\bigvee}^{v\in G_{s}}_{d}$ be the number of nodes of degree d available in $G_{s}$ at the end of sampling. Please note that in this experiment the nodes in ${\bigvee}^{v\in G_{s}}_{d}$ belong to $G_{s}$ , i.e., $v\in G_{s}$ and we find their degree d in the original graph G (not in $G_{s}$ ). We define

$\displaystyle\Omega=\frac{\bigvee^{v\in G_{s}}_{d}}{\bigvee_{d}}$

The value of $\Omega$ tells us what fraction of nodes of degree d are picked from the original graph by a sampling method. The graphs in Fig. 3c and d show the values of $\Omega$ for Gowalla and YouTube networks respectively. We see that both TIES and FFS are biased to pick high degree nodes from the original graph. For example, for degree range $51\leqslant d\leqslant 100$ in Fig. 3c, $\Omega(\textit{LS1})=$ 0.312, $\Omega(LS2)=$ 0.187, $\Omega(\textit{TIES})=$ 0.621 and $\Omega(\textit{FFS})=$ 0.620. It is clear that both TIES and FFS are highly biased to high degree nodes compared to LS1 and LS2, both TIES and FFS pick approximately 2x and 3.4x number of nodes picked by LS1 and LS2 respectively in this degree range.

As discussed above, LS samples a node in two steps and this two step approach makes it superior to previous approaches like TIES and FFS. Nasreen et al. has shown in their work [2] that TIES favors high degree nodes. Basically, TIES is based on Edge Sampling (ES) and ES is inherently biased to select high degree nodes. In addition, TIES also performs the graph induction step in which it augments all the edges between the nodes in $V_{s}$ leading it to overestimate the degree and clustering coefficient of a graph. In TIES, the probability of sampling a node is proportional to its degree and as a result, TIES picks high degree nodes in the original graph and then augments additional edges in the induction step, that leads it to produce overestimated samples. Up to this point, LS is comparable to TIES because intuitively we also assume that the first part of Eq. (15) is biased to pick high degree nodes from the original graph. But interestingly, the second part of Eq. (15) eclipses this bias positively, thereafter producing better samples than TIES. FFS is essentially a partial BFS, and similar to BFS it is also biased to sample high degree nodes in the original graph. However, FFS does not perform induction and as a result, the degree of sample graph is underestimated.

In the next Subsections 8.1 through 8.5, we perform some experiments to supplement the above analysis and highlight different features of LS. We compare TIES and FFS (where required) with LS1 only for ease of explanation but both LS1 and LS2 show similar results.

Figure 4.

Original vs. sampled degree in (a, b) Gowalla and (c, d) YouTube networks.

8.1 Original vs. sampled degree

We illustrate the effect of selecting high degree nodes in the original graph through an experiment. We extract samples at $\phi=$ 0.1 for Gowalla and YouTube networks and draw the degree distribution of nodes in the original and sampled graphs in Fig. 4. The original degree of node $v\in V_{s}$ means its degree in G and sampled degree means its degree in $G_{s}$ . Figure 4a and c shows the original degree of nodes in Gowalla and YouTube networks respectively for TIES, FFS and LS1. We see that all the methods overestimate the degrees in the original graphs, meaning that all methods are biased to pick high degree nodes from the original graph. However, LS1 picks less number of high degree nodes compared to TIES and FFS (see Fig. 3c and d for clarity. It shows the histogram of values in Fig. 4a and c). Now when this overestimation is combined with the induction step, TIES produces overestimated samples as shown in the corresponding sampled degree of nodes in Fig. 4b and d.

In contrast, LS1 picks a lesser number of high degree nodes, the two step sampling approach further reduces this bias and when this moderate overestimation is combined with the induction step, LS1 produces better samples than TIES. The results of this experiment shows that it is wise to overestimate the degree in the original graph but this overestimation should be moderate otherwise we will produce substandard samples. Since FFS does not perform induction and all the neighbors of a node in the original graph are not sampled, therefore, FFS underestimates the degree in the sample graph.

Figure 5.

Sampled vs. induced edges in (a) Gowalla and (b) YouTube networks.

8.2 Sampled vs. induced edges

We know that both TIES and LS perform the induction step and induces additional edges between the nodes in $V_{s}$ other than those added during the sampling process. We can say that both TIES and LS has two types of edges; sampled edges $E_{sam}$ , i.e., the edges added to $V_{s}$ during the sampling process and induced edges $E_{ind}$ i.e., the edges added to $V_{s}$ during the induction step and $E_{s}=E_{sam}+E_{ind}$ . We conduct an experiment on Gowalla and YouTube datasets and find the percentage of $E_{sam}$ and $E_{ind}$ in $E_{s}$ . We also calculate these values for BFS for comparison as we know that BFS is also biased to high degree nodes so it would be interesting to show the effect of the induction step on BFS. The results are shown in Fig. 5.

The graph shows that in TIES, on average, almost 90% edges in $E_{s}$ are induced edges and in BFS this number if approximately 85%. The reason is obvious, both BFS and TIES sample high degree nodes, thereby increasing the chances that many neighbors of a high degree node would be available in $G_{s}$ (this is easy to infer as having more neighbors means more chances that a neighboring node would be present in $V_{s}$ ) and we will induce many edges resulting in overestimated samples in terms of degree and clustering coefficient. In contrast, LS1 picks both high and low degree nodes and induces 65% edges on average in $E_{s}$ and hence could estimate the degree and clustering coefficient better than previous approaches.

Figure 6.

Effect of value of k on the degree and clustering coefficient of sample graphs.

8.3 Selecting multiple nodes from the candidate list

The second parameter $k$ in the List Sampling framework $LS(p,k)$ allows us to select multiple nodes simultaneously from the CL. In Section 6, we set the value of $k$ as the square root of the size of the CL as given in Eq. (7). In this section, we give the intuition behind this value of $k$ . We conduct an experiment on Gowalla and YouTube networks and extract samples at $\phi=0.1$ while varying the value of $k$ as the percentage of the size of the CL. We then calculate the scaling ratio of degree and clustering coefficient, i.e., the ratio of these values in $G_{s}$ to their counterparts in G. The results are presented in Fig. 6 as bar graphs with a 95% confidence interval.

The graphs show that as we select more nodes from the CL simultaneously, the scaling ratio increases. The graphs also show that there is no magic value for k that might work for all data sets. We see that for Gowalla network $k=0.2*|CL|$ seems to work better while for YouTube dataset $k=0.01*|CL|$ gives optimum results. From these and the results from other datasets (not shown in the paper), we intuit to use Eq. (7) to calculate the value of $k$ because this value gives better results for most of the data sets and does not overshoot the sampled values.

8.4 Size of the candidate list

One may argue that selecting k number of nodes simultaneously from the Candidate List and adding their neighbors to it could foster the growth of the CL and hence we need lots of space to store the CL. We intuit that i) after a few iterations the CL grows bit by bit because most of the nodes are rediscovered and ii) LS selects both high and low degree nodes from the CL, as opposed to BFS that favors high degree nodes, and therefore after every iteration, a manageable number of nodes are added to the CL.

Figure 7.

Size of the Candidate List for all datasets, $\phi=$ 0.1.

We conduct a simple experiment where we extract samples at $\phi=$ 0.1 for all 10 datasets using the LS1 algorithm. We calculate the size of the CL, i.e., the number of entries in the CL after a node is sampled and its neighbors are added to the CL. The results are presented in Fig. 7. The X-axis shows the number of sampled nodes, i.e., $V_{s}=\phi*|V|$ as $V_{s}$ is populated one by one and the Y-axis shows the size of the CL after adding the neighbors of a sampled node to the CL (we present it as a ratio i.e., $|CL|$ / $|V|$ ). For comparison, we also calculate the size of the CL (or FIFO list) in BFS as we implement BFS in LS framework by using CL as a FIFO list. In half of the datasets, BFS has more entries in its CL than that of LS because BFS picks higher degree nodes and hence adds more neighbors to the CL. We see that initially the CL grows quickly but later it tends to smooth as the number of sampled nodes increases. The reason is simple,we encounter the already sampled or visited nodes and hence do not add a lot of entries to the CL. However, in LS1 we also store the degrees of the nodes (and also their frequencies in LS2) but we need a few extra bytes for storing this information and owing to the fact that the CL in LS1 contains significantly less number of entries compared to the size of the graph, we intuit that it would not overload the sampling job. As an illustration, lets suppose the node IDs are of 128 bits (16 bytes) and both degree and frequency need extra 16 bytes (8 bytes each), so that a single entry in the CL requires 32 bytes of space, then a CL of 100K entries would roughly need only 3MB space.

8.5 Time complexity results

We performed time complexity analysis in Section 6.4 and concluded that the worst case running time of LS as implemented in algorithm#1 is $O(N^{2}+M)$ where N and M are the number of nodes and edges respectively in the original graph. In this section, we measure the time taken by different sampling methods to extract samples at a 10% sampling fraction. Please note that this time does not include the time to read the data sets. This is purely the time to extract a sample i.e., time to populate $V_{s}$ and $E_{s}$ . We implement LS0, LS1 and LS2 as presented in algorithm#1 and present the result of LS1 only. Similarly, we present the results of TIES only as TIES and RDN time complexity is almost same, both pick edges/nodes at random in a constant time $O(1)$ and then induce the edges.

The experiment is performed using an Intel core i7 processor @ 3.40 GHz with 64 GB of main memory. We present the results in Table 2 and the time is calculated in minutes. As expected, TIES is the most time efficient method because it samples random edges in $O(1)$ time and then takes $O(E)$ for the induction step. FFS is a partial Breadth First Sampling method so its time complexity is of the order of $O(N+E)$ and stands next to TIES. LS takes more time than both TIES and FFS. However, the time taken by LS comes with the benefit of collecting good representative samples and we argue that this benefit is worth more than extracting inaccurate samples in a short time.

Table 2
Running time (in minutes) of different sampling algorithms

Data set	TIES	FFS	LS1
Gowalla	0.796	5.161	6.104
CiteSeer	0.797	4.285	5.084
DBLP	1.443	9.026	10.045
Amazon	1.211	8.756	9.191
FourSquare	9.998	59.447	63.541
YouTube	13.932	102.141	123.667
Lastfm	22.411	154.142	194.275
Hyves	13.158	128.689	186.123
Skitter	63.834	495.145	585.592
Flicker	142.337	556.027	669.587

9. Experimental evaluation

In this section, we evaluate and compare List Sampling LS0, LS1 and LS2 with three algorithms, FFS [1], TIES [2] and Random Degree Node (RDN) sampling. It has already been shown in [1, 2] that these methods outperform the traditional algorithms like Node Sampling (NS), Edge Sampling (ES) and Random Walks (RW) etc., therefore, we compare List Sampling with FFS, TIES and RDN only.

The Forest Fire Sampling (FFS) is a partial breadth first sampling method in which we pick a seed node at random and then burn a fraction of its outgoing edges along with the nodes on the other end. The fraction of nodes to be burned is selected from the geometric distribution with mean pf/(1-pf) where pf is the forward burning ratio with a recommended value of pf $=$ 0.7 which means 2.33 nodes are burned on average. The fundamental difference in FFS and LS is that FFS burns the neighborhood of the current node only whereas we keep on accumulating the nodes and may burn a node that was added to the list in a previous iteration. Totally Induced Edge Sampling (TIES) is a variation of Edge Sampling (ES) and the primary difference between TIES and ES is the graph induction step. In TIES, we populate the edge set $E_{s}$ by adding all the existing edges available in the original graph between the sampled nodes in addition to those sampled in the edge sampling step. In Random Degree Node (RDN) sampling, the probability of sampling a node is proportional to its degree and hence we expect many high degree nodes in the sample graph.

From the presentation point of view, we make two groups of these algorithms. In group A, we compare LS0, LS1, LS2 and FFS because all these algorithms apply traversal based methods to extract samples. In group B, we compare LS1, LS2, TIES and RDN. Here, we evaluate two versions of LS with two algorithms that randomly select nodes or edges from the graph by exposing the whole graph as opposed to LS that mines a portion of the graph to extract samples.

Figure 8.

Point Statistics of all networks [Group A]: (D1–D10) Degree , (C1–C10) Clustering Coefficient, (P1–P10) Path Length.

Figure 9.

Point Statistics of all networks [Group B]: (D1–D10) Degree, (C1–C10) Clustering Coefficient, (P1–P10) Path Length.

9.1 Point statistics

We vary the sampling fraction $\phi$ from 0.1 to 0.2 and present the average of 10 readings with 95% confidence intervals in Figs 8 and 9. We calculate and draw the scaling ratio (Eq. (11)) of the average degree, average clustering coefficient and average path length of sampled graphs as we vary the sampling fraction.

Degree: The first two rows of Figs 8D1–D10 and 9D1–D10 show the scaling ratio of the average degree on the Y-axis against the sampling fraction on the X-axis for all the networks for group A and B sampling methods respectively. In group A, we see that LS0, LS1 and LS2 give very similar results. In some networks, one method performs better than the other but overall all three versions of LS closely match with one another. FFS sampling performs poorly in all the networks and remarkably underestimates the degree in all the data sets. In group B, we see that both TIES and RDN overestimate the degree in the sampled graphs in most of the networks while both of our algorithms estimate the degree better than TIES and RDN in most of the networks at almost all the sampling fractions. Both TIES and RDN sample higher degree nodes more often and that’s why their results are very similar to each other. We also see that in some networks LS underestimates the degree because of the value of k as discussed in Section 8.3. In addition, our algorithms show comparatively consistent values at all sampling fractions whereas TIES and RDN show either increasing or decreasing trend as the sampling fraction increases. It would be worth to mention that although the probability to select a node from the CL in LS2 is inversely proportional to its degree, the frequency factor helps to pick moderately high degree nodes and as a result LS2 simulates the effect of uniform random selection of nodes from the CL and gives good results. This shows the flexibility of LS framework for designing the probability of node selection from the CL.

Clustering Coefficient: The third and fourth row of Figs 8C1–C10 and 9C1–C10 show the results of the average clustering coefficient for all data sets for group A and B methods respectively. In group A, LS methods outperform FFS as FFS always underestimates the CC of the graph. One reason is that FFS does not perform the induction step and hence produces samples with a low clustering coefficient compared to the original graph. In group B, we see that in half of the networks, TIES and RDN overestimate while LS underestimates the clustering coefficient. However, in three of these networks, i.e., CiteSeer, DBLP and Amazon, all sampling methods underestimate it, with LS still performing slightly better though. Intuitively, we conceive that the induction step plays an important role in estimating the clustering coefficient but TIES and RDN pick high degree nodes so they overestimate the values.

Path Length: The fifth and sixth row of Figs 8P1–P10 and 9P1–P10 show the results of the average path length for all the data sets for group A and B methods respectively. In almost all data sets, LS gives very good results and outperforms FFS. FFS remarkably estimates high values of the path length in all data sets because it can miss many existing paths between nodes as it does not induce all the existing edges between the sampled nodes. TIES and RDN are comparably good in some networks and perform better in estimating the path length than estimating the degree and clustering coefficient.

Table 3
RMSE values of degree, clustering coefficient and path length

Metric	Sampling	Gowalla	Cite-	DBLP	Amazon	Four-	You-	Lastfm	Hyves	Skitter	Flicker	Average
	method		seer			square	tube
Average	FFS	3.201	1.972	1.791	1.344	3.565	1.442	2.342	0.765	4.857	7.448	2.873
degree	TIES	4.635	0.331	0.385	1.217	10.039	3.738	6.612	0.951	5.715	23.450	5.707
	RDN	4.724	0.402	0.358	1.724	11.187	4.175	2.375	0.835	6.211	28.952	6.094
	LS0	1.232	0.811	0.789	0.764	1.321	0.088	0.941	0.341	0.614	2.198	0.910
	LS1	0.908	0.858	0.513	0.764	0.552	0.164	0.909	0.095	1.554	1.942	0.826
	LS2	1.142	0.755	0.744	0.479	0.640	0.431	0.875	0.114	1.714	0.958	0.785
Average	FFS	0.174	0.440	0.422	0.249	0.099	0.055	0.065	0.039	0.216	0.122	0.188
clustering	TIES	0.028	0.158	0.165	0.075	0.711	0.034	0.022	0.031	0.002	0.284	0.151
coefficient	RDN	0.013	0.238	0.276	0.172	0.364	0.031	0.019	0.032	0.008	0.065	0.121
	LS0	0.042	0.153	0.182	0.097	0.088	0.002	0.014	0.002	0.049	0.014	0.064
	LS1	0.035	0.149	0.181	0.095	0.087	0.003	0.013	0.002	0.041	0.002	0.062
	LS2	0.041	0.093	0.113	0.071	0.013	0.009	0.023	0.011	0.036	0.008	0.042
Average	FFS	3.411	7.081	5.646	9.831	1.812	2.919	3.623	4.396	4.370	2.184	4.528
path	TIES	0.764	0.345	0.538	5.755	0.881	1.181	0.636	0.225	0.688	0.520	1.153
length	RDN	0.953	0.483	0.685	6.876	1.023	1.478	0.624	0.284	0.605	1.661	1.467
	LS0	0.206	0.302	0.071	0.553	0.413	0.824	0.408	0.274	0.015	0.556	0.362
	LS1	0.225	0.339	0.098	0.505	0.350	0.806	0.206	0.267	0.053	0.302	0.315
	LS2	0.237	0.372	0.182	1.312	0.093	0.445	0.629	1.057	0.093	0.239	0.466

9.2 Root mean square error

In order to measure the farness of sampled and original graph properties, we calculate the Root Mean Square Error (RMSE) for degree, clustering coefficient and path length of point statistics presented in Figs 8 and 9 above. We present the results in Table 3 where a value is averaged over all sampling fractions. In the case of degree and clustering coefficient, on average, LS2 generates slightly less error than LS0 and LS1 while LS1 is a little better than LS0 and LS2 in estimating the path length. All three versions of LS perform considerably better than FFS, TIES and RDN. We see that the previous methods generate significant error in estimating the properties of the original graph.

Figure 10.

Distributions of all networks at $\phi=$ 0.1, [Group A] : (D1–D10) Degree, (C1–C10) Clustering Coefficient, (P1–P10) Path Length.

Figure 11.

Distributions of all networks at $\phi=$ 0.1 [Group B]: (D1–D10) Degree, (C1–C10) Clustering Coefficient, (P1–P10) Path Length

9.3 Distributions

We present the distributions of degree, clustering coefficient and path length in Figs 10 and 11 for all datasets for group A and B methods respectively. We plot the Empirical Cumulative Distribution Function (ECDF) at a 10% sampling fraction along with the distributions of the actual graphs. These distributions convey more information than point statistics. We selected $\phi=$ 0.1 to show the distributions because 10% is a reasonable sampling size. Other sampling fractions yield analogous results.

Degree Distribution: The first two rows of Figs 10D1–D10 and 11D1–D10 show the degree distributions for all the networks for group A and B methods respectively. In group A, we see that in most of the networks LS closely follows the original distributions whereas FFS underestimates the distributions and contains many low degree nodes in the sample. In group B, both TIES and RDN overestimate the distributions and it shows that TIES and RDN select high degree nodes of varying degree from the whole network. All versions of LS pick all types of nodes, high and low degree, in good proportion of each other and as a result closely follow the distribution of node degree in the original graph.

Clustering Coefficient Distribution: The third and fourth row of Figs 10C1–C10 and 11C1–C10 show the ECDF of clustering coefficient for all the datasets for group A and B methods respectively. LS follows the shape of the original distributions better than TIES, RDN and FFS. In two networks, Foursquare and Flicker, TIES and RDN perform poorly and fail to retain the structure of the original graph. FFS underestimates the distributions in all the graphs. FFS gives a high fraction of low degree nodes because it only burns 2.33 neighbors of a node on average and it also misses many edges between the nodes as it does not augment additional edges like TIES, hence it results in low clustering coefficient samples.

Path Length Distribution: The fifth and sixth row of Figs 10P1–P10 and 11P1–P10 show the distributions of path length for all the networks for group A and B methods respectively. In almost all networks, LS, TIES and RDN perform comparably though LS perform slightly better than others, as we will see in the distance measuring tests. In all the graphs, FFS overestimates the path length and collects samples that have a high fraction of long path lengths because it samples a fraction of neighbors and does not induce edges and therefore drops many existing paths.

Figure 12.

Kolmogorov-Smirnov (KS) Distance of all networks.

9.4 Kolmogorov-Smirnov distance

The KS Distance measures the maximum distance between the two distributions. We present the average KS Distance for all networks in Fig. 12 where a bar value is calculated by averaging over 10 readings and 5 sample sizes. The last bar set shows the average of all datasets. With a few exceptions, LS outperforms TIES, RDN and FFS in all the networks across all the metrics. The exceptions are; DBLP network in degree metric, Skitter network in clustering coefficient metric and CiteSeer, FourSquare and Hyves networks in path length metric. On average, LS shows lower KS distance than TIES, RDN and FFS in all the networks across all three metrics. We also see that LS2 gives a lower distance than LS0 and LS1 in degree and clustering coefficient while in path length it is the opposite case.

Figure 13.

Jensen-Shannon (JS) Distance of all networks.

9.5 Jensen-Shannon distance

While the KS distance measures the maximum distance between the two distributions, the JS Divergence calculates the similarity measure between the two distributions across the entire range of values and its square root is known as the JS Distance. We compute the average JS Distance for all the networks. We present the results in Fig. 13 where a bar value is calculated by averaging over 10 readings and 5 sample sizes. The last bar set shows the average of all datasets. With only two exceptions, DBLP in degree and Hyves in path length, we see that the average JS Distance of LS is less than that of TIES, RDN and FFS in all datasets. The reason is that LS follows the distributions better than other methods and hence gives lower JS distance. On average, LS2 slightly performs better than LS0 and LS1 in degree and clustering coefficient metrics.

9.6 Effective diameter, assortativity and global clustering coefficient

While the properties of a graph such as degree, clustering coefficient and path length are local to a node, the properties such as diameter, assortativity and global clustering coefficient represent the overall structure of a graph. The 90% effective diameter tells us the diameter below which 90% of all shortest path lengths fall. Assortativity quantifies the tendency of nodes in a graph to connect to others that are similar in some way. It is also referred to as degree mixing and ranges from $+$ 1 to $-$ 1, a network being assortative or disassortative respectively. The global clustering coefficient gives the ratio of closed triplets or triangles in a graph. We show the measured assortativity and global clustering coefficient values of $G_{s}$ with 95% confidence intervals and the actual values of G in Figs 14 and 15 for all the networks for group A and B methods respectively.

Figure 14.

90% Effective Diameter, Assortativity and Global Clustering Coefficient of all networks [Group A]: (E1–E10), 90% Effective Diameter, (A1–A10) Assortativity, (G1–G10) Global Clustering Coefficient.

Figure 15.

90% Effective Diameter, Assortativity and Global Clustering Coefficient of all networks [Group B]: (E1–E10), 90% Effective Diameter, (A1–A10) Assortativity, (G1–G10) Global Clustering Coefficient.

The first two rows (E1–E10) of Figs 14 and 15 show the 90% effective diameter of all networks at different sampling fractions. We present it as the scaling ratio to its original value, i.e., sampled values divided by the original value of the network. We see that LS performs better than TIES, RDN and FFS in most of the networks. TIES and RDN stand second while FFS gives very high values of diameter. The second and third rows (A1–A10) of Figs 14 and 15 show the assortativity values and we see that LS is more consistent than TIES, RDN and FFS as these methods overestimate or underestimate the values in some networks. Although FFS also maintains assortativity mixing for some networks, it does not perform well in retaining other properties. TIES and RDN perform uniform edge and node sampling respectively by exposing the whole graph and hence they have more chances of picking nodes of varying degree from the graph and hence maintaining their mixing patterns but still fail in some networks. LS, on the other side, explores a limited region of a graph but the assortativity results show that it can still maintain the overall structure and mixing patterns of nodes well, proving its supremacy over the existing approaches. Similarly, in Figs 14 and 15 (G1–G10) that show the values of the global clustering coefficient of all the networks, we see that LS is better than TIES, RDN and FFS in retaining the ratio of triangles in the original graph. It means that LS not only preserves node properties e.g., degree, clustering coefficient but it also maintains the overall structure of the graph.

9.7 Root mean square error

We calculate the Root Mean Square Error (RMSE) for a 90% effective diameter, assortativity and global clustering coefficient and present the results in Table 4 where a value is averaged over all sampling fractions. In case of the 90% effective diameter, all three versions of LS perform considerably better than FFS, TIES and RDN. In case of assortativity, all methods give close results while in estimating global clustering coefficient, LS and FFS are substantially closer and better than TIES and RDN.

Table 4
RMSE values of 90% effective diameter, assortativity and gloabl clustering coefficient

Metric	Sampling	Gowalla	Cite-	DBLP	Amazon	Four-	You-	Lastfm	Hyves	Skitter	Flicker	Average
	method		seer			square	tube
90%	FFS	4.409	9.558	7.087	13.285	2.851	4.162	4.666	5.982	5.691	5.990	6.368
effective	TIES	0.933	0.331	0.381	11.478	0.742	1.363	0.471	0.096	0.991	3.368	2.015
diameter	RDN	1.158	0.482	0.622	11.447	0.806	1.687	0.493	0.121	1.567	3.545	2.193
	LS0	0.221	0.402	0.135	0.991	0.118	0.884	1.023	0.529	0.571	1.646	0.652
	LS1	0.315	0.474	0.117	0.914	0.111	0.883	0.502	0.641	0.038	1.546	0.554
	LS2	0.390	0.590	0.249	2.005	0.041	0.326	0.972	1.675	0.238	1.516	0.801
Assortativity	FFS	0.009	0.041	0.151	0.021	0.069	0.009	0.011	0.005	0.027	0.011	0.036
	TIES	0.007	0.289	0.193	0.081	0.099	0.001	0.044	0.006	0.017	0.024	0.076
	RDN	0.011	0.278	0.218	0.069	0.118	0.002	0.046	0.008	0.069	0.073	0.089
	LS0	0.012	0.029	0.004	0.011	0.036	0.015	0.005	0.013	0.015	0.047	0.019
	LS1	0.011	0.025	0.028	0.008	0.044	0.009	0.007	0.010	0.015	0.021	0.018
	LS2	0.011	0.036	0.064	0.007	0.046	0.027	0.013	0.025	0.009	0.006	0.024
Global	FFS	0.005	0.257	0.046	0.054	0.001	0.002	0.009	0.001	0.006	0.091	0.047
clustering	TIES	0.044	0.243	0.368	0.095	0.003	0.014	0.021	0.017	0.015	0.213	0.103
coefficient	RDN	0.052	0.243	0.408	0.068	0.002	0.015	0.023	0.017	0.067	0.299	0.119
	LS0	0.009	0.089	0.171	0.073	0.001	0.003	0.008	0.001	0.002	0.017	0.037
	LS1	0.012	0.084	0.166	0.071	0.001	0.002	0.003	0.002	0.001	0.007	0.035
	LS2	0.026	0.092	0.134	0.041	0.001	0.007	0.003	0.006	0.006	0.012	0.033

9.8 Summary of results

The results show that LS outperforms state-of-the-art FFS, TIES and RDN algorithms in all the networks across all metrics and statistical tests. Our algorithms give good and consistent values at almost all sampling fractions in point statistics of degree, clustering coefficient and path length and closely estimate their distributions compared to FFS, TIES and RDN. In addition, the statistical tests performed for the quantitative evaluation of sampling algorithms also prove the supremacy of our sampling approach over the existing approaches across all the three metrics. Moreover, LS also performs better in maintaining the overall structure of the graph as shown by the assortativity and global clustering coefficient results. We summarize the results in the following points:

•
The three versions of List Sampling i.e., LS0, LS1 and LS2 extract good samples from the original graph. The samples extracted by these methods are both qualitatively and quantitatively superior than the samples produced by the previous methods discussed in this paper.
•
Of the three versions of LS presented in this paper, it is very hard to prefer one over the other because all three methods produce reasonably close results. However, it seems that LS2 is marginally better than LS0 and LS1 in terms of degree and clustering coefficient measurements while LS1 seems to measure path lengths better than its two counterparts. For measuring assortativity and global clustering coefficient, we could not find a big difference in their performances.
•
FFS is also a traversal based method and performs well in maintaining the overall structure of the graph as seen in assortativity and global clustering coefficient results. However, it underestimates the degree and clustering coefficient and overestimates the path lengths.
•
Both TIES and RDN rely on randomly picking the edges/nodes from the graph and are biased to pick high degree nodes. The selection of high degree nodes leads them to produce overestimated samples in terms of degree. And the random selection results in high path lengths (although induction step compensates it and as a result they have better results than FFS). TIES and RDN perform reasonably good in estimating clustering coefficient and assortativity.

10. Conclusion and future work

The current sampling methods fail to accurately match the distributional properties of the original graph and preserve its topology. Intuitively, we observe that the neighborhood of a node being sampled is not explored properly by the current methods and as a result some methods overestimate the properties of the original graph while others underestimate them. We find a middle ground by introducing the novel concept of a pool of candidate nodes and pick nodes from this reservoir instead of the whole graph and produce good samples. We present a framework with two parameters and by tuning these two parameters, we can design new algorithms. In principle, this framework provides us the basic infrastructure and guidelines for designing new sampling approaches with flexibility. We present two algorithms based on this framework and prove their efficacy over the current state-of-the-art methods by using real world data sets and demonstrate that our algorithms produce better samples than the previous sampling methods. We also perform statistical tests for the quantitative evaluation of our algorithms and show that our algorithms improve by a factor of 2x to 5x on average when compared to the existing methods.

In the future, we intend to extend this framework for directed and labeled graphs. One possibility of sampling directed graphs is to consider only the out-going edges when adding neighbors of a node to the candidate list. For example, if a node has three out-going and two in-coming edges, we add its three neighbors that are on the other end of the out-going links, to the candidate list and proceed this way to maintain its out-going degree distribution. A similar technique could be worked out for in-coming degree distribution. Another direction to work on in the future is to sample the dynamic and streaming graphs where new nodes/edges are constantly arriving in the graph. A possible solution could be looking at the new neighbors of a sampled node as they arrive in the network and sample them with a certain probability.

Footnotes

Acknowledgments

This research was supported by the ICT R&D program of MSIP/IITP, [R7117-16-0219, Development of Predictive Analysis Technology on Socio-Economics using Self-Evolving Agent-Based Simulation embedded with Incremental Machine Learning] and by Korea Institute of Science and Technology (KIST) under the project “Development of Tangible Social Media Technologies for Smart Aging”.

References

Leskovec

and Faloutsos

, Sampling from large graphs, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, 2006, pp. 631–636.

Ahmed

N.K.

Neville

and Kompella

, Network sampling: From static to streaming graphs, ACM Transactions on Knowledge Discovery from Data 8(2) (2013), 7:1–7:56.

Ahn

Y.-Y.

Han

Kwak

Moon

and Jeong

, Analysis of topological characteristics of huge online social networking services, in: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, 2007, pp. 835–844.

Gjoka

Kurant

Butts

C.T.

and Markopoulou

, Walking in facebook: A case study of unbiased sampling of osns, in: INFOCOM, 2010 Proceedings IEEE, 2010, pp. 1–9.

Ribeiro

and Towsley

, Estimating and sampling graphs with multidimensional random walks, in: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC ’10, 2010, pp. 390–403.

Lee

S.H.

Kim

P.-J.

and Jeong

, Statistical properties of sampled networks, Phys. Rev. E 73 (2006), 016102.

Yoon

Lee

Yook

S.-H.

and Kim

, Statistical properties of sampled networks by random walks, Phys. Rev. E 75 (2007), 046114.

Gilbert

A.C.

, Compressing network graphs, in: LinkKDD, 2004.

Rafiei

and Curial

, Effectively visualizing large networks through sampling, in: 16th IEEE Visualization Conference, VIS 2005, Minneapolis, MN, USA, 23–28 October 2005, pp. 375–382.

10.

Stumpf

M.P.H.

Wiuf

and May

R.M.

, Subnets of scale-free networks are not scale-free: Sampling properties of networks, Proceedings of the National Academy of Sciences 102(12) (2005), 4221–4224.

11.

Lakhina

Byers

J.W.

Crovella

and Xie

, Sampling biases in IP topology measurements, in: Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 1, 2003, pp. 332–341.

12.

Rasti

A.H.

Torkjazi

Rejaie

Duffield

N.G.

Willinger

and Stutzbach

, Respondent-driven sampling for characterizing unstructured overlays, in: INFOCOM, IEEE, 2009, pp. 2701–2705.

13.

Avrachenkov

Ribeiro

B.F.

and Towsley

D.F.

, Improving random walk estimation accuracy with uniform restarts, in: Vol. 6516 of Lecture Notes in Computer Science, WAW, R. Kumar and D. Sivakumar, eds, Springer, 2010, pp. 98–109.

14.

Baykan

Henzinger

M.R.

Keller

S.F.

Castelberg

S.D.

and Kinzler

, A comparison of techniques for sampling web pages, CoRR abs/0902.1604.

15.

Gjoka

Kurant

Butts

C.T.

and Markopoulou

, Practical recommendations on crawling online social networks, IEEE Journal on Selected Areas in Communications 29(9) (2011), 1872–1892.

16.

Doerr

and Blenn

, Metric convergence in social network sampling, in: Proceedings of the 5th ACM Workshop on HotPlanet, HotPlanet ’13, 2013, pp. 45–50.

17.

Hübler

Kriegel

H.P.

Borgwardt

and Ghahramani

, Metropolis algorithms for representative subgraph sampling, in: 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 283–292.

18.

Maiya

A.S.

and Berger-Wolf

T.Y.

, Benefits of bias: Towards better characterization of network sampling, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, 2011, pp. 105–113.

19.

Bar-Yossef

and Gurevich

, Random sampling from a search engine’s index, J. ACM 55(5) (2008), 24:1–24:74.

20.

Ribeiro

Wang

Murai

and Towsley

, Sampling directed graphs with random walks, in: 2012 Proceedings IEEE INFOCOM, 2012, pp. 1692–1700.

21.

Friedman

Getoor

Koller

and Pfeffer

, Learning probabilistic relational models, in: Proceedings of the 16th International Joint Conference on Artificial Intelligence – Volume 2, IJCAI’99, 1999, pp. 1300–1307.

22.

Bakshy

Rosenn

Marlow

and Adamic

, The role of social networks in information diffusion, in: Proceedings of the 21st International Conference on World Wide Web, WWW ’12, 2012, pp. 519–528.

23.

Anand

S.S.

Bell

D.A.

and Hughes

J.G.

, The role of domain knowledge in data mining, in: Proceedings of the Fourth International Conference on Information and Knowledge Management, CIKM ’95, 1995, pp. 37–43.

24.

Kopanas

Avouris

N.M.

and Daskalaki

, The role of domain knowledge in a large scale data mining project, in: Proceedings of the Second Hellenic Conference on AI: Methods and Applications of Artificial Intelligence, SETN ’02, 2002, pp. 288–299.

25.

Garcia-sanchez

and Druzdzel

M.J.

, An efficient sampling algorithm for influence diagrams, in: Proceedings of the Second European Workshop on Probabilistic Graphical Models, Leiden, The Netherlands, 2004, pp. 97–104.

26.

Ortiz

L.E.

and Kaelbling

L.P.

, Sampling methods for action selection in influence diagrams, in: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, 2000, pp. 378–385.

27.

Geweke

, Bayesian inference in econometric models using monte carlo integration, Econometrica 57(6) (1989), 1317–1339.

28.

and Lau

W.C.

, A survey and taxonomy of graph sampling., CoRR abs/1308.5865.

29.

Kurant

Gjoka

Wang

Almquist

Z.W.

Butts

C.T.

and Markopoulou

, Coarse-grained topology estimation via graph sampling, in: Proceedings of the 2012 ACM Workshop on Workshop on Online Social Networks, WOSN ’12, 2012, pp. 25–30.

30.

Krishnamurthy

Faloutsos

Chrobak

Cui

J.-H.

Lao

and Percus

A.G.

, Sampling large internet topologies for simulation purposes, Comput. Netw. 51(15) (2007), 4284–4302.

31.

Benczúr

A.A.

and Karger

D.R.

, Randomized approximation schemes for cuts and flows in capacitated graphs, SIAM Journal on Computing 44(2) (2015), 290–319.

32.

Nazi

Zhou

Thirumuruganathan

Zhang

and Das

, Walk, not wait: Faster sampling over online social networks, Proc. VLDB Endow. 8(6) (2015), 678–689.

33.

and Li

, Sampling online social networks by random walk, in: Proceedings of the First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research, HotSocial ’12, 2012, pp. 33–40.

34.

Kurant

Markopoulou

and Thiran

, On the bias of breadth first search (bfs) and of other graph sampling techniques, 2010, pp. 1–8.

35.

Blagus

Subelj

and Bajec

, Empirical comparison of network sampling techniques, CoRR abs/1506.02449.

36.

konect network dataset – KONECT, http://konect.uni-koblenz.de/.

37.

Rossi

R.A.

and Ahmed

N.K.

, The network data repository with interactive graph analytics and visualization, http://networkrepository.com.

38.

Leskovec

and Krevl

, SNAP Datasets: Stanford large network dataset collection, http://snap.stanford.edu/data.

39.

Zafarani

and Liu

, Social computing data repository at ASU, http://socialcomputing.asu.edu.

List sampling for large graphs

Abstract

Keywords

1. Introduction

3. Graphs

3.1 Graph properties

4.1 Graph sampling

4.2 Categories of sampling methods

4.2.1 Node sampling

4.2.2 Edge sampling

4.2.3 Traversal based sampling

5. Our approach to sampling

5.1 List sampling

6. List sampling framework

6.2 Algorithm 1

7. Evaluation criteria and data sets

7.1 Point statistics

7.3 Distance measures for quantitative comparison

8. Analysis of list sampling

8.4 Size of the candidate list

Table 2 Running time (in minutes) of different sampling algorithms

Table 3 RMSE values of degree, clustering coefficient and path length

9.6 Effective diameter, assortativity and global clustering coefficient

Table 4 RMSE values of 90% effective diameter, assortativity and gloabl clustering coefficient

Footnotes

Acknowledgments

References

Table 2
Running time (in minutes) of different sampling algorithms

Table 3
RMSE values of degree, clustering coefficient and path length

Table 4
RMSE values of 90% effective diameter, assortativity and gloabl clustering coefficient