New Survey Questions and Estimators for Network Clustering with Respondent-driven Sampling Data

Abstract

Respondent-driven sampling (RDS) is a popular method for sampling hard-to-survey populations that leverages social network connections through peer recruitment. Although RDS is most frequently applied to estimate the prevalence of infections and risk behaviors of interest to public health, such as HIV/AIDS or condom use, it is rarely used to draw inferences about the structural properties of social networks among such populations because it does not typically collect the necessary data. Drawing on recent advances in computer science, the authors introduce a set of data collection instruments and RDS estimators for network clustering, an important topological property that has been linked to a network’s potential for diffusion of information, disease, and health behaviors. The authors use simulations to explore how these estimators, originally developed for random walk samples of computer networks, perform when applied to respondent-driven samples with characteristics encountered in realistic field settings that depart from random walks. In particular, the authors explore the effects of multiple seeds, without replacement versus with replacement, branching chains, imperfect response rates, preferential recruitment, and misreporting of ties. The authors find that clustering coefficient estimators retain desirable properties in respondent-driven samples. This work takes an important step toward calculating network characteristics using nontraditional sampling methods, and it expands the potential of RDS to tell researchers more about hidden populations and the social factors driving disease prevalence.

Keywords

respondent-driven sampling RDS social networks clustering coefficient small-world model transitivity triad hidden populations HIV/AIDS sampling estimation

1. Introduction

Researchers in many fields are interested in populations that cannot be sampled by conventional methods because they are rare, lack a sampling frame, or have members who are unwilling to participate in traditional survey protocols. Such groups, known as hidden populations (Heckathorn 1997), are often marginalized and at high risk for infections such as HIV/AIDS. Respondent-driven sampling (RDS) is a set of methods for sampling and making inferences about hidden populations that has proliferated throughout the social sciences and public health (Malekinejad et al. 2008; White et al. 2012). RDS uses a without-replacement “link-tracing” approach, similar to snowball sampling, whereby respondents attempt to recruit a limited number of their personal network contacts in the target population until the desired sample size is attained. RDS offers a popular, quick, cost-effective, and anonymous approach for sampling understudied groups such as the homeless, drug users, or commercial sex workers that claims to provide asymptotically unbiased estimates of the population mean under limited conditions (Salganik and Heckathorn 2004; Volz and Heckathorn 2008). There are many concerns about the statistical properties of estimators for RDS data (Crawford et al. 2015; Fisher and Merli 2014; Gile and Handcock 2010; Goel and Salganik 2010; Lu et al. 2012, 2013; McCreesh et al. 2012; Merli, Moody, Smith, et al. 2015; Tomas and Gile 2011; Verdery, Mouw, et al. 2015). However, the continued development of estimators, diagnostics, and reporting protocols for use with such data are beginning to address these concerns (Baraff, McCormick, and Raftery 2016; Crawford 2016; Gile 2011; Gile and Handcock 2011; Gile, Johnston, and Salganik 2015; Lu 2013; McCreesh et al. 2013; Nesterko and Blitzstein 2015; Verdery, Merli, et al. 2015; White et al. 2015; Yamanis et al. 2013), though more work is needed.

Most RDS studies focus on prevalence estimation (i.e., estimation of the population mean or proportion of a focal attribute such as condom use) and avoid making inferences about other relevant estimands. We focus on network structure and, in particular, clustering. The structure of both social and contact networks is a key component of the risk environment for members of hidden populations (Rhodes and Simic 2005) with important implications for disease transmission (Schneider et al. 2012; Morris et al. 2009) and health behaviors (Centola and Macy 2007). Highly clustered risk networks, such as sexual contact networks or shared needle networks, can lead to more redundant paths, making disease transmission more likely (Moody 2002) and altering the relationship between concurrency and epidemic potential (Moody and Benton 2016). Clustering can also have benefits. Highly clustered friendship networks lead to normative reinforcement, and they can increase individual likelihoods of engaging in and spreading health-promoting behaviors such as joining an Internet-based health forum (Centola 2010), adopting modern contraceptives (Kohler, Behrman, and Watkins 2001), abstaining from illicit drugs (Silverman et al. 2007), getting tested for HIV (Karim et al. 2008), or avoiding unprotected sex (Lippman et al. 2010). Normative reinforcement through clustering can also drive unhealthy behaviors, such as sexual concurrency (Yamanis et al. 2015).

Despite its sociological and epidemiological importance, few studies of hidden populations using RDS have directly examined network structure. This is by design. Because field implementations of RDS require that samples be conducted without replacement and with maximal anonymity, typical respondent-driven samples have limited opportunity to measure network structure beyond recruiter-recruit relationships. Some have proposed using RDS to measure homophily (Wejnert 2010), or the tendency for people with similar attributes to be tied (McPherson, Smith-Lovin, and Cook 2001), but these approaches are flawed (Crawford et al. 2015), and there have been few developments since. Others have fit exponential random graph models to RDS data (Gile and Handcock 2011; Merli, Moody, Smith, et al. 2015), but learning about networks themselves was not the primary purpose of these studies. The ability of RDS studies to estimate network structure is important, however, because without closer attention to network characteristics that influence risk behaviors and sexually transmitted infections, RDS studies will be unable to offer a comprehensive picture of the dynamics driving epidemic transmission or other network diffusion processes.

In this article we focus on the performance of recently developed estimators of network clustering that can be applied to RDS data. Work in computer science has proposed clustering estimators for data obtained via random walk sampling (RWS) (Hardiman and Katzir 2013), which is an alternative link-tracing sampling design more appropriate for computer networks than human populations. RDS procedures depart from RWS in several important ways that require new data collection protocols to estimate network characteristics of interest from RDS surveys of human populations, and which may call into question whether such estimators will have favorable statistical properties when used with RDS data. We review these discrepancies in detail throughout the article. In Section 2 we discuss measures of network clustering, introduce their estimation in network censuses versus samples, and review how RDS differs from RWS. Throughout Section 2, we focus on RDS data collection strategies that could inform clustering estimators, which leads us to introduce two alternative survey question approaches for RDS. In Section 3 we describe the empirical data and simulation methods we use to evaluate whether our proposed survey questions and estimators of network clustering are appropriate for RDS data, focusing on bias, sampling variance (SV), and total error. Section 4 contains results from these simulations. In Section 5 we discuss how our proposed survey questions perform in six empirical RDS surveys. In Section 6 we summarize the contributions of this article and lay out additional directions for this research. Our results indicate that the estimators maintain reasonable properties with RDS data and that the questions have good empirical properties. These findings lead us to suggest that researchers add clustering questions and estimators to RDS protocols to further explore network structure. We conclude by focusing on the potential benefits of clustering estimation with RDS data.

2. Background

2.1. Initial Notation

The notation that follows guides our discussion throughout the article. For illustrative purposes, we rely on Figure 1, which shows (1) a hypothetical population (i.e., nodes A through I), (2) the social network linking its members (solid lines connecting nodes), (3) a hypothetical time-ordered RWS link-tracing sample starting from node A (dashed, directed, and numbered lines), and (4) a table counting relevant nodal statistics (on the right). Note that item 3 refers to a random walk sample rather than a respondent-driven sample; in a respondent-driven sample, node E would be ineligible to be sampled a second time because RDS is conducted without replacement. Below, we review this and other differences between RWS and RDS that together call into question whether clustering estimators designed for RWS can be applied to RDS.

Figure 1.

Example network with hypothetical random walk sampling and components needed to calculate local and global clustering coefficients for the whole network.

We characterize a social network of $n$ people as a graph $G$ with nodes $V$ representing people and undirected edges $E$ representing social ties. In Figure 1, we label nodes A through I and represent edges as undirected solid lines. We discuss the time-ordered, directed random walk steps shown with dashed and numbered lines in Sections 2.4 and 2.5 below. We represent the graph as an $n \times n$ adjacency matrix, $A$ , whose elements, $a_{ij},$ are 1 if there is a tie (edge) from person $i$ to person $j$ (i.e., when $i \leftrightarrow j$ ) and 0 otherwise. For instance, there is an edge in Figure 1 between nodes B and C (but not between nodes A and B). We follow standard practices in the RWS and RDS literatures (Hardiman and Katzir 2013; Lovász 1993; Volz and Heckathorn 2008) and consider an undirected graph with one component (see Lu et al. 2013 for the performance of RDS in directed networks). Because the network is undirected, the adjacency matrix $A$ is symmetric, and $a_{ij} = a_{ji}$ for all $i = 1, \dots, n$ and $j = 1, \dots, n$ . We set the diagonal of $A$ to 0 (i.e., $a_{ii} = 0$ for all $i = 1, \dots, n$ ).

For convenience, we define $d_{i} = \sum_{i = 1}^{n} a_{ij} = \sum_{j = 1}^{n} a_{ij}$ as the degree of person $i$ , meaning how many ties $i$ has in the network. In Figure 1, node A’s degree is 1 because he or she is linked to only one other node (E), while node B’s degree is 2 because he or she is linked to both E and C. In empirical RDS studies, researchers typically estimate degree by asking respondents questions such as “How many people do you know (you know their name and they know yours) who have exchanged sex for money in the past six months?” (World Health Organization 2013:147). Some have studied the effect of inaccurate degree reporting on RDS estimates (Lu 2013; Lu et al. 2012; Neely 2009), but we assume accurate degree reporting.

2.2. Clustering Coefficients

Watts and Strogatz (1998) introduced the clustering coefficient to characterize small-world networks (Milgram 1967). Small-world networks are (1) highly clustered, meaning that most ties between people appear in pockets of interconnection (see below), and (2) have short average path lengths, meaning that the minimum number of steps between network members is, on average, low (e.g., as embodied in the famous phrase “six degrees of separation”). Clustering coefficients measure the first criterion.

Watts and Strogatz (1998) originally proposed a global measure of the clustering coefficient, defined as

GCC = \frac{2 \sum_{i = 1}^{n} \sum_{j = 1}^{n} \sum_{k = 1}^{j - 1} a_{ij} a_{ik} a_{jk}}{\sum_{i = 1}^{n} d_{i} (d_{i} - 1)},

where i, j, and k index unique respondents (Hardiman and Katzir 2013; Newman, Strogatz, and Watts 2001; Watts and Strogatz 1998). The global clustering coefficient (GCC) summarizes the overall network clustering by dividing the count of triangles by the count of connected triplets, where triangles are defined as sets of three individuals ( $i$ , $j$ , and $k$ ) for whom cells $a_{ij}$ , $a_{ik}$ , and $a_{jk}$ in the adjacency matrix $A$ are all equal to 1, and connected triplets are defined as sets of three individuals ( $i$ , $j$ , and $k$ ), where cells $a_{ij}$ and $a_{ik}$ are equal to 1. Note that triplets are defined to avoid double counting so that person $i$ is a member of $\sum_{j = 1}^{n} \sum_{k = 1}^{j - 1} a_{ij} a_{ik}$ connected triplets and $\sum_{j = 1}^{n} \sum_{k = 1}^{j - 1} a_{ij} a_{ik} a_{jk}$ triangles. As such, triangles are a subset of connected triplets that are connected in cell $a_{jk}$ . A node’s number of connected triplets is a function of his or her degree; that is, node $i$ ’s number of connected triplets is $d_{i} (d_{i} - 1) / 2$ . The embedded table in Figure 1 holds triangle and connected triplet counts for each node. The GCC of this graph is $15 / 33 = 0.4545$ . It is important to note that equation (1) cannot be evaluated for most RDS studies without information on connections between unsampled peers. We introduce simple questions for RDS surveys that address this issue in Section 2.5 below.

Extensions to the clustering coefficient concept consider the average amount of clustering among each individual’s affiliates in the network. This second measure, the local clustering coefficient (LCC), is defined as

C_{LCC} = n^{- 1} \sum_{i = 1}^{n} \frac{2 \sum_{j = 1}^{n} \sum_{k = 1}^{j - 1} a_{ij} a_{ik} a_{jk}}{d_{i} (d_{i} - 1)} .

The LCC measures the average of each individual’s number of triangles divided by his or her connected triplets. In Figure 1, the LCC is obtained by first dividing triangles by connected triplets, then taking the average (when $d_{i} = 1$ , the value is set to 0). Thus, nodes A to C each contribute values of 0 to the LCC, while node D contributes a value of $0.111 = 1 / 1 * 1 / 9$ , node E contributes a value of $0.278 = 4 / 16 * 1 / 9$ , and so on. This graph’s LCC is 0.5767. As with the GCC, the LCC cannot readily be evaluated for many respondent-driven samples. The key difference between the clustering coefficient measures is that the GCC captures the totality of network members’ experience, which may be dominated by low clustering among high degree nodes (e.g., while the LCC captures the average experience of network members, where each person in the network is weighted equally).

Although clustering coefficients are recent additions to the social networks literature, they resemble other important network characteristics, in particular, transitivity, ego network density, and measures of clustering from the exponential random graph modeling framework. We omit detailed discussion of these alternative measures for the sake of brevity.

2.3. Measuring Clustering in Network Censuses and Samples

The calculation of many network-level statistics, including the clustering coefficient, assumes that researchers measure the entire adjacency matrix, $A$ , in terms of cells (edges) and rows or columns (nodes). In Figure 1, it would be assumed that the researcher measured all ties (solid, undirected lines) and nodes (labeled A–I). Collecting such saturated network data is challenging (Smith 2012), however, and often impossible for populations without clearly defined institutional boundaries (such as schools). In other settings, either intentionally or not, researchers do not collect data on all network members (node missingness), do not measure all relevant ties linking network members (edge missingness), or both.

When researchers cannot conduct a census of the network, they often turn to samples. There are many approaches to collecting sampled network data, including randomly drawn samples (Krivitsky, Handcock, and Morris 2011; Marsden 1987; McPherson, Smith-Lovin, and Brashears 2006; Smith 2012) and numerous link-tracing approaches (Goodman 1961; Heckathorn 1997; Mouw and Verdery 2012; Volz and Heckathorn 2008). We focus on the latter.

2.4. Hardiman and Katzir Estimators

Hardiman and Katzir (2013) introduced estimators for the LCC and GCC that use data gathered in a random walk sample, such as that shown in Figure 1. Intuitively, for vertices $x_{1}, x_{2}, \dots, x_{r}$ sampled via RWS, they estimate clustering with the presence of a tie between the vertices before and after the focal vertex. Typical RDS studies do not ask about the existence of this tie, though some have (see Section 5 below and online Appendix B), and in Section 2.5 we propose two question formats for RDS studies to assess its existence. More formally, for a step $k$ in a random walk, $X$ , let $ϕ_{k}$ represent whether a tie is present between the vertex before $x_{k}$ (i.e., $x_{k - 1}$ ) and the vertex after $x_{k}$ (i.e., $x_{k + 1}$ ). In the random walk depicted in Figure 1, for instance, $ϕ_{k}$ would be 0 the first time node E is sampled because nodes A and H are unconnected, but it would be 1 the second time node E is sampled because nodes F and I are connected. That is, $ϕ_{k} = a_{(x_{k - 1}, x_{k + 1})}$ for each $2 \leq k \leq r - 1$ , where $a_{ij}$ is the cell in the $i^{th}$ row and the $j^{th}$ column of the adjacency matrix, as before. Importantly, $ϕ_{k}$ is not calculated for the first and last nodes of the walk, because the former has no recruiter and the latter no recruit.

Next for the LCC, define a weighted sum of the $ϕ$ value as $Φ_{l} = (\frac{1}{r - 2}) \sum_{k = 2}^{r - 1} ϕ_{k} (\frac{1}{d_{x_{k}} - 1})$ . In this case, $d_{x_{k}}$ represents the degree of the vertex $x_{k}$ in the random walk, and $r$ is the length of the random walk. Thus, $Φ_{l}$ is the average of whether the previous vertex in the random walk ( $x_{k - 1}$ ) and the vertex that follows in the random walk ( $x_{k + 1}$ ) were tied, weighted by the probability of observing the current vertex. In RWS on an undirected, unweighted graph, the probability of observing a given vertex is the inverse of that vertex’s degree if the random walk is in the steady state, which is typically achieved if the walk is sufficiently long or started with steady-state probabilities (reviewed in greater depth in Verdery, Mouw, et al. 2015 and Lovász 1993). We note that this finding cannot be assumed to hold for the finite, branching, without-replacement samples conducted in RDS and that future research may investigate alternative weighting schemes. Finally, let $Ψ_{l} = (1 / r) \sum_{k = 1}^{r} (1 / d_{x_{k}})$ , representing the sum of sampled vertices’ reciprocal degrees. Hardiman and Katzir (2013) defined an estimator of the LCC as

{\hat{c}}_{LCC} = \frac{Φ_{l}}{Ψ_{l}} = \frac{(\frac{1}{r - 2}) \sum_{k = 2}^{r - 1} ϕ_{k} (\frac{1}{d_{x_{k}} - 1})}{(\frac{1}{r}) \sum_{k = 1}^{r} (\frac{1}{d_{x_{k}}})} .

Hardiman and Katzir (2013) also developed an estimator of the GCC. Letting $Φ_{g} = (1 / (r - 2)) \sum_{k = 2}^{r - 1} ϕ_{k} d_{k}$ and $Ψ_{g} = (1 / r) \sum_{k = 1}^{r} d_{x_{k}} - 1$ , they suggested the following measure for the GCC:

{\hat{c}}_{GCC} = \frac{Φ_{g}}{Ψ_{g}} = \frac{(\frac{1}{r - 2}) \sum_{k = 2}^{r - 1} ϕ_{k} d_{x_{k}}}{(\frac{1}{r}) \sum_{k = 1}^{r} d_{x_{k}} - 1} .

Hardiman and Katzir (2013) used both analytic proofs and simulation to show that their proposed estimators are asymptotically unbiased with minimal variance for large random walk samples and that they produce more consistent results at any given sample size than other approaches that query each sampled node’s full ego network (counting ego network reports in the sample size). Although RDS does not rely on simple random walks, researchers may wish to apply these estimators to respondent-driven samples. Section 2.5 discusses RDS departures from RWS with special attention to the empirical contexts in which RDS studies are conducted. Within it, we propose new survey questions that researchers could use to estimate clustering via the Hardiman and Katzir estimators. We examine how these questions perform in six empirical surveys in Section 6.

2.5. RDS Departures from RWS

The Hardiman and Katzir estimators cannot immediately be applied to RDS studies in the field, because they were developed for RWS, which differs considerably in core assumptions. Deviations of RDS from RWS have been shown in prior work to bias other estimators, such as that of the population mean (Gile 2011; Merli, Moody, Smith, et al. 2015; Tomas and Gile 2011) and SV (Verdery, Mouw, et al. 2015), so we should not expect that a naive application of Hardiman and Katzir’s (2013) clustering coefficient estimators will yield viable estimates from empirical respondent-driven samples.

Table 1 summarizes eight RDS departures from RWS that may affect clustering estimation. A random walk sample of a network begins with selecting a single “seed” node, typically with probability proportionate to the steady-state probability, $π_{i} = d_{i} / 2 m$ , where $d_{i}$ is the degree of node $i$ in the population, and $m = (1 / 2) \sum_{i} d_{i}$ is the number of edges in the population (Lovász 1993). By contrast, most RDS protocols recommend initiating the sample by identifying, often by convenience, 8 to 10 members of the hidden population who are willing to participate, have large personal networks with other members of the target population, and are diverse with respect to relevant focal attributes, such as years injecting drugs (World Health Organization 2013:71–82). A first consequence of this distinction is that random walk samples lead to a single chain in a network (as in the hypothetical chain depicted in Figure 1), whereas respondent-driven samples start from multiple points and yield multiple chains. A second consequence is that respondent-driven samples often exhibit seed dependence, whereas random walk samples do not (Gile and Handcock 2010).

Table 1.

Comparison of Features of RWS and RDS

	RWS	RDS
1. Number of seeds	One	Multiple
2. Seed selection	Proportional to steady state	Convenience
3. Branching	No	Yes
4. Replacement	Yes	No
5. Link-tracing efficacy	100%	<100%
6. Preferential recruitment	No, researcher controls	Yes, respondent controls
7. Sample size	Large (>10,000)	Small (<1,000)
8. Measurement of φ_k	Can be queried	Asked of respondent

Note: RDS = respondent-driven sampling; RWS = random walk sampling.

RWS and RDS also differ in their approach to tracing links. Random walk samples proceed without branching (i.e., having only one coupon), whereas respondent-driven samples almost always allow branching in practice through the distribution of two or three recruiting coupons to each respondent (Goel and Salganik 2009). Random walk samples are conducted with replacement, whereas RDS is conducted without replacement, which means that recruitment becomes competitive (Barash et al. 2016; Crawford 2016; Gile 2011; Gile and Handcock 2010; Heckathorn 1997). Other differences arise because RWS is researcher-driven (or algorithm-driven), while RDS is respondent-driven. In RDS, respondents must identify, approach, and successfully recruit peers, which can yield less than perfect link-tracing efficacy and introduce preferential recruitment (Merli, Moody, Smith, et al. 2015; Verdery, Merli, et al. 2015).

Sample size is another distinction because random walk samples are used in computer science or fields in which the cost of sampling additional individuals is low compared with RDS in human populations (Mouw and Verdery 2012). For instance, Hardiman and Katzir (2013) examined their estimators’ performance in four large networks with 1 percent sample of sizes $n = 9, 780$ , $n = 21, 734$ , $n = 30, 724$ , and $n = 48, 440$ . By contrast, Malekinejad et al. (2008) reported attained sample sizes for 63 RDS studies ranging from $n = 99$ to $n = 548$ , with a median $n = 152$ . A first consequence of smaller samples is that respondent-driven samples are more likely to contain finite sampling bias, even when assumptions are met, because the samples are too small for asymptotically unbiased RDS estimators to minimize bias. A second consequence of small respondent-driven samples is that they are likely to violate the RDS assumption that the sample is “in equilibrium,” a fact exacerbated by convenience sampling of seeds (Gile and Handcock 2010; Wejnert 2009). We note, however, that larger sample sizes have not been found to solve RDS’s core statistical problems (Verdery, Mouw, et al. 2015).

The final departure of RDS from RWS is anonymity, which pertains to the measurement of $ϕ_{k}$ , whether person $x$ ’s recruiter knows person $x$ ’s recruit. Unlike the situation in computer or online networks, for which it is comparatively easy to determine for each node $x_{k}$ in the random walk, whether the prior node, $x_{k - 1}$ , is tied to the subsequent node, $x_{k + 1}$ , this task is more challenging in a respondent-driven sample of a human population. One cannot seek $x_{k - 1}$ in a stored contact list of node $x_{k + 1}$ or otherwise backtrack the sample for direct measurement; rather, the existence of this tie must be elicited from respondents themselves during a period when the respondent is answering the survey, which can introduce measurement error and other challenges. The timing of recruitments and preservation of anonymity in RDS mean that (1) researchers cannot ask about recruitments that have not yet occurred (e.g., they cannot ask A whether he or she is tied to H in the random walk sample in Figure 1), and (2) researchers cannot divulge who recruited whom to respondents (e.g., they cannot tell H that A recruited E). The middle recruit is the only feasible person to ask about this tie’s existence in a respondent-driven sample (E in this example), although this requires E to report on a tie that exists between two of his alters and thus may introduce reporting error (a topic we examine below).

In many RDS surveys, a majority of respondents participate twice, once when they are recruited themselves (primary interview) and a second time when they return to the research site to collect additional incentives for successfully recruiting peers (secondary interview). Acknowledging this interview timing, we propose two questions researchers can ask RDS respondents to feasibly elicit information about potential ties between $x_{k - 1}$ and $x_{k + 1}$ :

A. In the secondary interview: “Does the person who gave you the coupon know the person who you gave the coupon to or vice versa?” (We refer to this from here on as the binary question format.)

B. In the primary or secondary interview: “What percent of people who you know in the population does the person who gave you the coupon know?” (We refer to this from here on as the percentage question format.)¹

The binary question format garners the exact information required by the Hardiman and Katzir estimators, but it relies on the accuracy of respondent reports about recruiter-recruit relationships. It can also be estimated only on a subset of sampled cases, as it cannot be asked until the secondary interview (after recruitment). The percentage question format differs from Hardiman and Katzir’s (2013) suggested approach, but it can be asked during either the main survey (of all respondents) or the follow-up interview (of the subset of respondents who recruit). If asked in both, researchers can check test-retest validity and potentially diagnose respondent comprehension problems. Of course, there are other possible ways to ask such questions in RDS surveys, but our proposed approaches are flexible in terms of implementation and preserve the desirable confidentiality of standard RDS studies.

3. Data and Methods

3.1. Approach

We begin by evaluating the performance of Hardiman and Katzir’s (2013) estimators applied to RDS through simulation methods. We aim to understand the effects of increasingly large departures from RWS, toward more realistic situations encountered within RDS data collection. To do this, we simulate data collection from underlying population social networks. It is notoriously difficult to obtain analytical results for RDS estimators, which is why many prior developments have tested proposed estimators through simulation. We test scenarios driven by data collection parameters to match how RDS departs from RWS, drawing 1,000 samples in each scenario. It is important to draw multiple samples per scenario to determine the estimators’ distributional properties (bias, SV, and total error). For each simulated sample, we calculate the Hardiman and Katzir LCC and GCC estimators implemented with both question formats we proposed. We compare these sample estimates with the parameters in the population social network (or as would be calculated in a census). After examining how Hardiman and Katzir’s estimators perform in simulations, we evaluate their feasibility in six empirical respondent-driven samples.

3.2. Data

We first simulate link-tracing samples from a hidden population social network of heterosexuals, sex workers, and injecting drug users at elevated risk for HIV/AIDS collected beginning in 1987 as part of the Project 90 study in Colorado Springs, Colorado (Klovdahl et al. 1994; Potterat et al. 2004; Rothenberg et al. 1995; Woodhouse et al. 1994). The project aimed to assess how network structure affected disease transmission, and, as such, the researchers sought to obtain a census of the hidden population and their links to one another. These data have previously been used in prior RDS assessments (Goel and Salganik 2010) and are made available to researchers through the Office of Population Research at Princeton University (2015). We focus on 4,111 individuals linked by 17,164 ties that remain in the network’s largest weakly connected component after dropping cases lacking valid attribute codes. Figure 2 shows the network linking members of this population, with nodes shaded by a key structuring variable (white/nonwhite). Whites make up 74.7 percent of network members, while 17.1 percent of ties cross race categories. Nodes of different races group together in different parts of the figure, but there are many cross-group links.

Figure 2.

Largest weakly connected component of Project 90 data set; nodes shaded by race (gray = white, black = nonwhite) and sized by degree. The network is displayed using the ForceAtlas2 algorithm, with no node overlap, in Gephi 0.9.

To understand how the Hardiman and Katzir estimators perform across a range of networks, we also examine additional networks from a data set of 100 Facebook networks collected in 2005, which have also been subject to intensive examinations in prior simulation evaluations of RDS (Mouw and Verdery 2012; Verdery, Mouw, et al. 2015). Importantly, because they were collected when Facebook was new and membership restricted to those with college e-mail addresses, researchers have argued that these networks represent realistic, offline social and interaction networks (Clouston et al. 2009; Traud et al. 2011; Traud, Mucha, and Porter 2012). We restrict analysis to 29 university networks in which the largest connected component of users with valid attribute codes contained between 5,000 and 10,000 nodes, size restrictions we put in place to avoid without replacement sampling effects (Barash et al. 2016) and to maintain computational tractability. Table 2 provides summary statistics for the Project 90 network and the Facebook networks. The Project 90 network is smaller, less dense, more clustered, and less homophilous than the Facebook networks.

Table 2.

Summary Network Statistics for Data Sets Analyzed in This Study

Network	Nodes	Edges	Density	GCC	LCC	Cross-group Ties^a
Project 90	4,111	34,328	.002	.657	.348	.171
Facebook networks^b
Minimum	4,985	212,114	.004	.200	.135	.015
25th percentile	5,930	367,486	.008	.216	.152	.032
Median	6,877	503,939	.013	.231	.167	.038
75th percentile	7,840	705,501	.014	.241	.179	.054
Maximum	9,693	905,428	.017	.276	.199	.163

Note: GCC = global clustering coefficient; LCC = local clustering coefficient.

Cross-group ties refer to ties that cross white/nonwhite categories in Project 90 and ties that cross freshman/nonfreshman categories in the Facebook networks.

Statistics presented for the Facebook networks are computed separately; the largest network does not necessarily have the largest proportion of cross-group ties, for instance.

3.3. Scenarios

We provide a replication file for researchers interested in replicating and expanding our scenarios for the Project 90 network, which are publicly available data. In both data sets, we focus on five scenarios designed to test the bias, SV, and error of Hardiman and Katzir’s (2013) estimators when used with standard RDS protocols as opposed to simple RWS. Table 3 shows what key features we manipulate in each scenario. We first simulate collecting simple random walks (“RWS baseline”). These scenarios begin from a single seed selected with steady state probabilities, are conducted with replacement, do not branch, experience 100 percent link-tracing efficacy without preferential recruitment, and do not contain any measurement error for $ϕ_{k}$ .

Table 3.

Parameters Used in Each Simulation Scenario

Scenario	Seeds	Selection	Replace	Branches	Efficacy (%)	Preferential	Error (9%)
RWS baseline	1	Steady state	Yes	1	100	No	0
RDS baseline	10	Convenience	No	3	100	No	0
+imperfect (80% efficacy)	10	Convenience	No	3	80	No	0
+preferences (targeted recruitment)	10	Convenience	No	3	80	Yes	0
+misreporting (φ_k mismeasurement)	10	Convenience	No	3	80	Yes	10

Note: RDS = respondent-driven sampling; RWS = random walk sampling.

We then selectively relax parameters until the samples resemble the standard RDS protocol. We start with a scenario designed to mimic an ideal case of RDS constrained by the method’s actual implementation in the field (“RDS baseline”). The samples in this scenario begin from 10 seeds selected via convenience sampling (implemented as uniform random seed selection in the main text; in online Appendix A, we consider four other seed selection scenarios and find that they did not alter our results), are conducted without replacement (recruitment is competitive between respondents), and may branch up to three ways from each respondent (i.e., each respondent is simulated as having three coupons); respondents always approach and succeed in recruiting peers who have not already been sampled (i.e., 100 percent recruitment efficacy), selecting them at random among the sets of their friends who have not participated (no preferences); and respondents accurately report the items used to measure $ϕ_{k}$ (either the presence or absence of a tie between their recruiter and his or her recruit for the binary question format or the percentage of his or her potential recruits known by their recruiter for the percentage question format). This RDS baseline scenario subsumes the first four ways RDS departs from RWS, as listed in Table 1.

We next examine the fifth through seventh ways RDS departs from RWS. We look at how less than perfect recruitment efficacy affects estimates by considering a scenario in which only 80 percent of offered coupons are accepted by the targeted peer (“+less than 100% efficacy”). We then test the effects of preferential recruitment (“+preferential recruitment”), modeling it as a case in which all respondents are half as likely to offer coupons to certain types of peers (to white peers in the Project 90 network and freshmen in the Facebook networks). Finally, we examine what happens when respondents misreport recruiter-recruit ties (“+ $ϕ_{k}$ measurement error”). For the binary question format in which respondents report on the presence or absence of a tie between their recruiter and recruit, we subject each report to a 10 percent random chance of being misattributed (ties reported as nonties or nonties reported as ties). For the percentage question format in which respondents report on the percentage of their network alters known by their recruiter, we randomly shift this number by up to ±10 percent from its true value (capping responses at 0 or 1).

In all simulated samples we assume respondents accurately report degree. Although sample size marks a key way in which RDS departs from RWS, we hold target sample sizes constant at 400, which is a small fraction of the population sizes we examine. We found that target sample sizes were attained in all scenarios, which reviews of RDS indicate happens frequently (Malekinejad et al. 2008).

3.4. Measures

We measure the performance of Hardiman and Katzir’s (2013) clustering coefficient estimators with three indicators. For each of the question formats (binary or percentage) of each of the estimators (GCC or LCC) in each scenario, we calculate (1) their bias, defined as $bias = a^{- 1} \sum_{i = 1}^{i = a} (\hat{c} - C)$ , where $a$ is the number of simulated samples; (2) their SV, defined as $SV = a^{- 1} \sum_{i = 1}^{i = a} {({\hat{c}}_{i} - a^{- 1} \sum_{j = 1}^{j = a} {\hat{c}}_{j})}^{2}$ ; and (3) their root mean squared error (RMSE), defined as $RMSE = \sqrt{(bia s^{2} + SV)}$ .

4. Simulation Results

We first consider the distribution of estimates for both the GCC and LCC calculated via the binary and percentage question formats in the baseline RWS scenario on the Project 90 network. Figure 3 shows that both estimators, using either question format, exhibit minimal bias that arises because of finite sample sizes. The LCC estimator is less biased than the GCC estimator ( $GCC binary bias = 0.017$ ; $LCC binary bias = 0.009$ ; $GCC percent bias = 0.017$ ; $LCC percent bias = 0.008$ ). SV is approximately equivalent across estimators and question formats ( $GCC binary SV = 0.010$ ; $LCC binary SV = 0.008$ ; $GCC percent SV = 0.009$ ; $LCC percent SV = 0.007$ ). Considering both bias and SV simultaneously, we find that the LCC percentage estimator performs the best and that the percentage question form has slightly lower error ( $GCC binary RMSE = 0.102$ ; $LCC binary RMSE = 0.092$ ; $GCC percent RMSE = 0.097$ ; $LCC percent RMSE = 0.083$ ).

Figure 3.

Performance of Hardiman and Katzir estimators by estimator and question format in random walk sampling (RWS) on the Project 90 data set.

We next examine the distribution of estimates in realistic respondent-driven samples and what features of RDS lead to performance deterioration compared with the RWS baseline scenario. Figure 4 shows that in the Project 90 network the GCC estimated using the binary question format performs poorly in each of the RDS scenarios, underestimating the population parameter substantially (GCC binary bias by scenario is $RDS baseline = - 0.132$ , $+ imperfect = - 0.127$ , $+ preferences = - 0.130$ , and $+ misreporting = - 0.067$ ). Underestimation begins with the RDS baseline scenario and persists, which indicates that problems for this estimator arise from the use of multiple seeds, convenience seed selection, without-replacement design, and/or branching. Because we do not see comparable biases in the percentage format under these scenarios (GCC percentage bias by scenario is $RDS baseline = - 0.010$ , $+ imperfect = - 0.007$ , $+ p r e f e r e n c e s = - 0.008$ , and $+ misreporting = - 0.006$ ), we attribute this bias to the binary question format’s restrictions on effective sample size because this format is asked only of nonseed respondents who recruit others, while the percentage format can be asked of any nonseed sample participant.

Figure 4.

Performance of Hardiman and Katzir estimators by estimator and question format in random walk sampling (RWS) and respondent-driven sampling (RDS) scenarios on the Project 90 data set.

The LCC estimators perform well in Figure 4. The binary question format of the LCC slightly overestimates clustering (LCC binary bias by scenario is $RDS baseline = 0.039$ , $+ imperfect = 0.044$ , $+ preferences = 0.038$ , and $+ misreporting = 0.019$ ), while the percentage form slightly underestimates it (LCC percentage bias by scenario is $RDS baseline = - 0.019$ , $+ imperfect = - 0.016$ , $+ preferences = - 0.016$ , and $+ misreporting = - 0.015$ ).

Estimates obtained in all RDS scenarios in the Project 90 network exhibit low SV (ranging from 0.001 to 0.003), substantially lower than was found for the RWS scenarios. This result follows from the without-replacement design of RDS, which tends to yield lower SV than the with-replacement design of RWS. RMSEs in the worst case scenarios, which contain all RDS deviations from RWS that we examine, are lower than we found for the RWS baseline scenarios in all cases. In the +misreporting scenarios, RMSEs are $GCC binary RMSE = 0.076$ ; $LCC binary RMSE = 0.057$ ; $GCC percent RMSE = 0.034$ ; $LCC percent RMSE = 0.045$ .

We next turn to results in the Facebook networks. Table 4 shows how absolute values of bias (“absolute bias”) and RMSEs are distributed within these networks by estimator and question format in three focal scenarios (RWS baseline, RDS baseline, and RDS misreporting). We display these scenarios because the +imperfect and +preferences scenarios made little difference in the results. We do not show the low SV we found in all scenarios for the Facebook networks (a maximum of 0.004 across networks in any scenario). The estimators exhibit almost no bias in the RWS baseline scenarios, with a maximum that is substantially lower than was seen in the Project 90 network. The RWS baseline scenario also tends to produce much lower RMSEs in these networks than it did in the Project 90 network.

Table 4.

Distributions of Absolute Bias Statistics and RMSEs in the 29 Facebook Networks Studied by Scenario, Estimator, and Question Format

	Absolute Bias				RMSE
	GCC		LCC		GCC		LCC
	Binary	Percentage	Binary	Percentage	Binary	Percentage	Binary	Percentage
RWS baseline
Minimum	.000	.000	.000	.000	.019	.006	.040	.023
25th percentile	.000	.000	.001	.000	.022	.008	.046	.028
Median	.001	.000	.001	.001	.023	.008	.048	.030
75th percentile	.001	.000	.002	.001	.024	.009	.052	.032
Maximum	.002	.000	.005	.003	.027	.012	.058	.040
RDS baseline
Minimum	.012	.006	.002	.000	.025	.010	.051	.025
25th percentile	.019	.009	.010	.004	.031	.014	.055	.029
Median	.020	.011	.012	.007	.033	.016	.057	.033
75th percentile	.026	.013	.017	.009	.037	.020	.062	.038
Maximum	.041	.021	.025	.015	.051	.028	.074	.052
RDS misreporting
Minimum	.030	.002	.032	.001	.043	.009	.064	.025
25th percentile	.046	.006	.044	.003	.055	.012	.068	.028
Median	.050	.008	.048	.005	.057	.014	.071	.031
75th percentile	.054	.010	.051	.007	.061	.017	.074	.037
Maximum	.065	.016	.061	.014	.070	.024	.089	.055

Note: GCC = global clustering coefficient; LCC = local clustering coefficient; RDS = respondent-driven sampling; RMSE = root mean squared error; RWS = random walk sampling.

The RDS scenarios also yield lower bias in the Facebook networks than they did in the Project 90 network, with maximum observed values all lower in these networks. In terms of bias, the Facebook networks indicate that the binary measures are the most biased, with the LCC being less biased than the GCC. The Facebook networks also have lower RMSEs than the Project 90 network. In terms of RMSEs in the realistic RDS scenarios, results from the Facebook networks suggest that the percentage question format is preferable to the binary format and that the GCC is slightly preferred over the LCC after accounting for SV (recall that the LCC had lower bias). In total, median RMSEs observed in the RDS scenarios in the Facebook networks are only slightly larger than the median RMSEs obtained in the RWS baseline scenarios, which indicates that the clustering coefficient estimators maintain reasonable properties for application to respondent-driven samples.

5. Application of Data Collection Instruments in Six Empirical Surveys

We now discuss six empirical RDS surveys collected in diverse hidden populations in multiple countries by different research teams that asked respondents the types of questions needed to estimate network clustering. Two studies examined female sex workers in China, two examined people who inject drugs in the Philippines, one study examined people who inject drugs in Canada, and the last survey, which contained both of our proposed question formats, looked at vegetarians and vegans in Argentina. For the sake of brevity, we omit full descriptions of these studies in the main text but provide complete details in online Appendix B. We focus on the proportion of invalid item responses (“Invalid %”) in each survey across question formats, where we define invalid responses as cases in which respondents did not answer the question, gave responses of “don’t know,” or otherwise offered evidence that they did not understand or wish to answer the question. We also compare the mean values of valid responses (“Mean of valid”) between relevant survey pairs (comparing the two surveys in China with each other and the two surveys in the Philippines with each other), and within individuals who answered both types of questions in the survey in Argentina.

Table 5 summarizes the item response patterns in these empirical surveys. Respondents were much more likely to give invalid responses to the binary question format than to the percentage question format. More speculatively, we can make some claims about conceptual validity by examining the cross-site concordance in the means of valid responses within the two sets of paired surveys. For instance, the means of valid responses in the female sex worker surveys collected by overlapping research teams in two cities in China are moderate (23.2 percent to 42.3 percent), while means of valid responses for the two surveys of persons who inject drugs in Philippine cities are much higher (78.7 percent to 91.7 percent). We take these findings to indicate that the survey questions are measuring consistent phenomena. In addition, we find nearly identical means of valid responses between the two question formats implemented in the Argentina survey. Here, both the percentage and binary measures found raw clustering levels in the 30.1 percent to 32.0 percent range, and we determined that the respondent-specific average of binary format versus percentage format reports had a Spearman’s correlation of .445, while the item-specific reports with potentially multiple binary reports per respondent had a polyserial correlation of .376. These correlations suggest a reasonably high level of agreement between question formats, even in the face of large amounts of missing data. Taken together, these results indicate that the questions tap into valid concepts, but they add another reason that researchers should prioritize implementing the percentage question format: respondents seem more willing or able to answer it.

Table 5.

Summary of Item Response Rates for Clustering Questions in Empirical Surveys

Survey Location	Population	Format	Reports^a	Invalid %	Mean of Valid (%)
Shanghai, China	FSW	Percentage	515	.0	23.2
Liuzhou, China	FSW	Percentage	576	.5	42.3
Cebu, Philippines	PWID	Binary	380	14.2	78.7
Mandaue, Philippines	PWID	Binary	291	8.3	91.7
Ottawa, Canada	PWID	Percentage^b	364	11.5	67.0
La Plata, Argentina	Veg	Percentage	145	5.5	32.0
La Plata, Argentina	Veg	Binary	131	36.6	30.1

Note: FSW = female sex workers; PWID = persons who inject drugs; Veg = self-identifying vegetarians and vegans.

We refer to reports rather than sample size because some respondents report on multiple relationships for the binary questions.

The format used in the Ottawa study is an interaction grid in which respondents identify which peers know one another; see online Appendix A.

6. Discussion and Conclusion

Sociological interest in marginalized populations means that researchers often confront situations where traditional sampling methods cannot be used. In such examples, the peer-driven recruitment procedures of RDS yield large and diverse samples quickly and cheaply, while maintaining respondent anonymity, which is why researchers have used this method to sample hundreds of stigmatized, sensitive, and hidden groups. Prior methodological research on RDS has focused on its estimators of the population mean and avoided examining how it may reveal other interesting features of hidden populations of relevance to sociology and public health (with a few notable exceptions, such as Crawford 2016 and Wejnert 2010). This avoidance is strategic: practical considerations limit researchers’ ability to uncover many aspects of the underlying population social network. In this article, we propose new data collection protocols and estimators for RDS that allow researchers to examine clustering, a social network feature of broad interest. We began by considering estimators of network clustering developed in computer science for RWS and expanded their application to the case of human populations sampled with RDS, with careful attention to practical differences between RDS and RWS. We offer data collection protocols in the form of two different question formats RDS surveys could adopt in the field to estimate network clustering, and we study how these question formats perform under two clustering coefficient estimators in simulations as well as their implementation challenges in six empirical surveys.

Overall, we recommend that researchers using RDS surveys begin asking respondents the types of questions that would allow clustering coefficient estimation. Although RDS estimators of the population mean often fail in the face of unmet assumptions about sample recruitment (Gile and Handcock 2010; Goel and Salganik 2010; Lu et al. 2013; Lu et al. 2012; McCreesh et al. 2012; Merli, Moody, Smith, et al. 2015; Tomas and Gile 2011; Verdery, Mouw, et al. 2015), we find that the clustering coefficient estimators we studied perform well even when core RDS assumptions are violated. Considering the two question formats we proposed, we also find that the percentage question format can be asked of more respondents, yielded better results in a simulation study, and appeared to be better understood by respondents in empirical studies. The two clustering estimators perform similarly, but the GCC estimator had lower total errors than the LCC estimator in most networks we studied. However, the contribution of SV to RMSE drives this result, so researchers concerned about bias may prefer to stick to the LCC estimator, which we found tends to exhibit lower bias.

We hope that methods for estimating clustering coefficients from RDS data will spur additional substantive and methodological contributions. Substantively, clustering is a core property that distinguishes human social networks from random graphs (Watts and Strogatz 1998), and many researchers have posited that it plays a role in the transmission of diseases and the adoption of behaviors through networks (e.g., Centola 2010; Eguíluz and Klemm 2002). These theories consist of a set of structural hypotheses, where the structure of the entire network makes it more or less conducive to diffusion, and they have been supported by results from mathematical models and some experiments. For example, such models suggest that, ceteris paribus, moving from low to moderate clustering of the risk network increases transmission (Keeling and Eames 2005), but moving from moderate to high clustering does not change transmission substantially until very high levels, when the network becomes disconnected (Newman 2003). Using clustering coefficients from RDS data could allow researchers to confirm the insights of these mathematical models of network structure and disease diffusion with macro-comparative methods.² In this vein, for instance, researchers might compare a set of similar populations sampled with RDS over multiple time points to examine whether changes in clustering levels are associated with changes in the prevalence of infectious diseases, such as HIV/AIDS. Clustering in the social network may be associated with differences in risk behaviors such as unprotected sex at the individual level. Prior research has found that network clustering moderates effects of peer contraceptive users in the use of fertility control (Kohler et al. 2001) but that such normative reinforcement can also facilitate the spread of unhealthy behaviors (Yamanis et al. 2015). Previous studies of this topic have been limited to traditional survey populations, however, and the approaches developed in this article will enable researchers to test these hypotheses in a more diverse series of hidden populations.

In addition, estimators of network clustering can offer methodological improvements to RDS. A first methodological extension could provide additional data to inform variants of RDS mean estimators that use exponential random graph modeling and algorithmic simulation in an effort to obtain less biased, lower variance results (Gile and Handcock 2011). Currently, these approaches model clustering as a by-product of dyadic homophily, divorced from assessments of clustering levels in the population of interest. With empirical estimates of clustering, researchers using such algorithms could confirm the clustering coefficients produced in their models. Such information may enhance the realism of the model-based approach and increase confidence in its bias and variance reductions.

A second methodological contribution could allow researchers to test one of the most central but least often evaluated assumptions of RDS, that the network contains a “giant component” whereby the vast majority of people are reachable through chains of arbitrary length through the network ties (Volz and Heckathorn 2008). Using random graph methods from the physics and computer-science traditions that generate network structures from degree distributions and clustering coefficients (Newman et al. 2001; Heath and Parikh 2011), researchers may also be able to determine if they are sampling a network with “bottlenecks,” that is, a grouping in which there are few links between cohesive groups in the network, a feature that many in the RDS community link to poor estimate quality (Toledo et al. 2011). This would add to the emerging diagnostic toolkit being developed for RDS (Gile et al. 2015). A related extension of this approach could calculate the “structural risk” of a network sampled with RDS by applying percolation or other diffusion models to examine the size and speed of hypothetical epidemics spreading on the modeled network (Britton et al. 2008; Merli, Moody, Mendelsohn, et al. 2015), a potential early warning system of a given hidden population’s epidemic potential gathered directly from RDS.

Such extensions and future directions lie outside of the scope of the present research. However, we emphasize that we view the development of clustering estimators for RDS data as the beginning of a new line of inquiry about how estimates of the topological features of networks sampled with RDS can inform substantive and methodological interests. The benefits of estimating clustering in respondent-driven samples are large, and we encourage researchers to begin deploying survey questions needed for their calculation. In either case, further attention to the ability of RDS to tell us more about hidden populations than disease prevalence is an important next step for the literature to take.

Footnotes

Acknowledgements

We thank M. Giovanna Merli, Ann Jolly, and Anne DeLessio-Parson for providing information about aspects of the empirical cases we examine.

Funding

We acknowledge assistance provided by the Population Research Institute, which is supported by an infrastructure grant from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (R24-HD041025), and from a seed grant provided by the Institute for CyberScience at Pennsylvania State University. Portions of this research were funded by National Center for Health Statistics grant 1R03SH000056-01 (Ashton M. Verdery, principal investigator).

Notes

Author Biographies

Ashton M. Verdery is an assistant professor of sociology and demography at Pennsylvania State University and an affiliate of the Population Research Institute, the Institute for CyberScience, and the Justice Center for Research. He holds a PhD in sociology from the University of North Carolina at Chapel Hill. His research focuses on social networks, quantitative methods, and population dynamics.

Jacob C. Fisher is a postdoctoral associate at Duke University. He holds a PhD in sociology and an MS in statistical science from Duke University. He specializes in social networks, quantitative methods, and computational social science.

Nalyn Siripong is a consultant for the East-West Center. She holds a PhD in epidemiology from the University of North Carolina at Chapel Hill and an MS in health economics from Chulalongkorn University. Her research focuses on injecting drug use, HIV, and social networks.

Kahina Abdesselam is a PhD candidate at the University of Ottawa, Faculty of Medicine, School of Epidemiology, Public Health and Preventive Medicine. She specializes in infectious disease and epidemiology, and she is currently an epidemiologist for the Public Health Agency of Canada.

Shawn Bauldry is an assistant professor of sociology at Purdue University. He holds a PhD in sociology and an MS in statistics from the University of North Carolina at Chapel Hill. His research focuses on the development of structural equation models, health disparities, and multigenerational processes.

References

Baraff

Aaron J.

McCormick

Tyler H.

Raftery

Adrian E.

2016. “Estimating Uncertainty in Respondent-driven Sampling Using a Tree Bootstrap Method.” Proceedings of the National Academy of Sciences 113(51):14668–73.

Barash

Vladimir D.

Cameron

Christopher J.

Spiller

Michael W.

Heckathorn

Douglas D.

2016. “Respondent-driven Sampling—Testing Assumptions: Sampling with Replacement.” Journal of Official Statistics 32(1):29–73.

Britton

Tom

Deijfen

Maria

Lagerås

Andreas N.

Lindholm

Mathias

. 2008. “Epidemics on Random Graphs with Tunable Clustering.” Journal of Applied Probability 45(3):743–56.

Centola

Damon

. 2010. “The Spread of Behavior in an Online Social Network Experiment.” Science 329(5996):1194–97.

Centola

Damon

Macy

Michael

. 2007. “Complex Contagions and the Weakness of Long Ties.” American Journal of Sociology 113 (3):702–34.

Clouston

S. P.

Verdery

A. M.

Amin

Sara

Gauthier

G. Robin

. 2009. “The Structure of Undergraduate Association Networks: A Quantitative Ethnography.” Connections 29(2):18–31.

Crawford

Forrest W.

2016. “The Graphical Structure of Respondent-driven Sampling.” Pp. 187–211 in Sociological Methodology, Vol. 46, edited by Alwin

Duane F.

Thousand Oaks, CA: Sage.

Crawford

Forrest W.

Aronow

Peter M.

Zeng

Jianghong

. 2015. “Identification of Homophily and Preferential Recruitment in Respondent-driven Sampling.” arXiv:1511.05397 [Stat], November. Retrieved June 10, 2017 (http://arxiv.org/abs/1511.05397).

Eguíluz

Víctor M.

Klemm

Konstantin

. 2002. “Epidemic Threshold in Structured Scale-free Networks.” Physical Review Letters 89(10):108701.

10.

Fisher

Jacob C.

Merli

M. Giovanna

. 2014. “Stickiness of Respondent-driven Sampling Recruitment Chains.” Network Science (02):298–301.

11.

Gile

Krista J.

2011. “Improved Inference for Respondent-driven Sampling Data with Application to HIV Prevalence Estimation.” Journal of the American Statistical Association 106(493):135–46.

12.

Gile

Krista J.

Handcock

Mark S.

2010. “Respondent-driven Sampling: An Assessment of Current Methodology.” Pp. 285–327 in Sociological Methodology, Vol. 40, edited by Liao

Tim Futing

. Thousand Oaks, CA: Sage.

13.

Gile

Krista J.

Handcock

Mark S.

2011. “Network Model-assisted Inference from Respondent-driven Sampling Data.”arXiv Preprint arXiv: 1108.0298.

14.

Gile

Krista J.

Johnston

Lisa G.

Salganik

Matthew J.

2015. “Diagnostics for Respondent-driven Sampling.” Journal of the Royal Statistical Society, Series A, Statistics in Society 178(1):241–69.

15.

Goel

Sharad

Salganik

Matthew J.

2009. “Respondent-driven Sampling as Markov Chain Monte Carlo.” Statistics in Medicine 28(17):2202–29.

16.

Goel

Sharad

Salganik

Matthew J.

2010. “Assessing Respondent-driven Sampling.” Proceedings of the National Academy of Sciences 107(15):6743–47.

17.

Goodman

Leo A.

1961. “Snowball Sampling.” Annals of Mathematical Statistics 32(1):148–70.

18.

Hardiman

Stephen J.

Katzir

Liran

. 2013. “Estimating Clustering Coefficients and Size of Social Networks via Random Walk.” Pp. 539–50 in Proceedings of the 22nd International Conference on World Wide Web. International World Wide Web Conferences Steering Committee.

19.

Heath

Lenwood S.

Parikh

Nidhi

. 2011. “Generating Random Graphs with Tunable Clustering Coefficients.” Physica A: Statistical Mechanics and Its Applications 390(23):4577–87.

20.

Heckathorn

Douglas D.

1997. “Respondent-driven Sampling: A New Approach to the Study of Hidden Populations.” Social Problems 44(2):174–99.

21.

Karim

Q. Abdool

Meyer-Weitz

Mboyi

Carrara

Mahlase

Frohlich

J. A.

Abdool Karim

S. S.

2008. “The Influence of AIDS Stigma and Discrimination and Social Cohesion on HIV Testing and Willingness to Disclose HIV in Rural KwaZulu-Natal, South Africa.” Global Public Health 3(4):351–65.

22.

Keeling

Matt J.

Eames

Ken T. D.

2005. “Networks and Epidemic Models.” Journal of the Royal Society Interface 2(4):295–307.

23.

Klovdahl

Alden S.

Potterat

John J.

Woodhouse

Donald E.

Muth

John B.

Muth

Stephen Q.

Darrow

William W.

1994. “Social Networks and Infectious Disease: The Colorado Springs Study.” Social Science and Medicine 38(1):79–88.

24.

Kohler

Hans-Peter

Behrman

Jere R.

Watkins

Susan C.

2001. “The Density of Social Networks and Fertility Decisions: Evidence from South Nyanza District, Kenya.” Demography 38(1):43–58.

25.

Krivitsky

Pavel N.

Handcock

Mark S.

Morris

Martina

. 2011. “Adjusting for Network Size and Composition Effects in Exponential-family Random Graph Models.” Statistical Methodology 8(4):319–39.

26.

Lippman

Sheri A.

Donini

Angela

Díaz

Juan

Chinaglia

Magda

Reingold

Arthur

Kerrigan

Deanna

. 2010. “Social-environmental Factors and Protective Sexual Behavior among Sex Workers: The Encontros Intervention in Brazil.” American Journal of Public Health 100(S1):S216–23.

27.

Lovász

László

. 1993. “Random Walks on Graphs: A Survey.” Combinatorics 2(1):1–46.

28.

Xin

. 2013. “Linked Ego Networks: Improving Estimate Reliability and Validity with Respondent-driven Sampling.” Social Networks 35(4):669–85.

29.

Xin

Bengtsson

Linus

Britton

Tom

Camitz

Martin

Jun Kim

Beom

Thorson

Anna

Liljeros

Fredrik

. 2012. “The Sensitivity of Respondent-driven Sampling.” Journal of the Royal Statistical Society, Series A, Statistics in Society 175(1):191–216.

30.

Xin

Malmros

Jens

Liljeros

Fredrik

Britton

Tom

. 2013. “Respondent-driven Sampling on Directed Networks.” Electronic Journal of Statistics 7:292–322.

31.

Malekinejad

Mohsen

Johnston

Lisa Grazina

Kendall

Carl

Kerr

Ligia Regina Franco Sansigolo

Rifkin

Marina Raven

Rutherford

George W.

2008. “Using Respondent-driven Sampling Methodology for HIV Biological and Behavioral Surveillance in International Settings: A Systematic Review.” AIDS and Behavior 12(1):105–30.

32.

Marsden

Peter V.

1987. “Core Discussion Networks of Americans.” American Sociological Review 52(1):122–31.

33.

McCreesh

Nicky

Copas

Andrew

Seeley

Janet

Johnston

Lisa G.

Sonnenberg

Pam

Hayes

Richard J.

Frost

Simon D. W.

White

Richard G.

2013. “Respondent- driven Sampling: Determinants of Recruitment and a Method to Improve Point Estimation.” PLoS ONE 8(10):e78402.

34.

McCreesh

Nicky

Frost

Simon

Seeley

Janet

Katongole

Joseph

Tarsh

Matilda Ndagire

Ndunguse

Richard

Jichi

Fatima

Lunel

Natasha L.

Maher

Dermot

Johnston

Lisa G.

2012. “Evaluation of Respondent-driven Sampling.” Epidemiology 23(1):138.

35.

McPherson

Miller

Smith-Lovin

Lynn

Brashears

Matthew E.

2006. “Social Isolation in America: Changes in Core Discussion Networks over Two Decades.” American Sociological Review 71(3):353–75.

36.

McPherson

Miller

Smith-Lovin

Lynn

Cook

James M.

2001. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology 27:415–44.

37.

Merli

M. Giovanna

Moody

James

Mendelsohn

Joshua

Gauthier

Robin

. 2015. “Sexual Mixing in Shanghai: Are Heterosexual Contact Patterns Compatible with an HIV/AIDS Epidemic?” Demography 52(3):919–42.

38.

Merli

M. Giovanna

Moody

James

Smith

Jeffrey

Jing

Weir

Sharon

Chen

Xiangsheng

. 2015. “Challenges to Recruiting Population Representative Samples of Female Sex Workers in China Using Respondent-driven Sampling.” Social Science and Medicine 125(1):79–93.

39.

Milgram

Stanley

. 1967. “The Small World Problem.” Psychology Today 2(1):60–67.

40.

Moody

James

. 2002. “The Importance of Relationship Timing for Diffusion.” Social Forces 81(1):25–56.

41.

Moody

James

Benton

Richard A.

2016. “Interdependent Effects of Cohesion and Concurrency for Epidemic Potential.” Annals of Epidemiology 26(4):241–48.

42.

Morris

Martina

Kurth

Ann E.

Hamilton

Deven T.

Moody

James

Wakefield

Steve

. 2009. “Concurrent Partnerships and HIV Prevalence Disparities by Race: Linking Science and Public Health Practice.” American Journal of Public Health 99(6):1023–31.

43.

Mouw

Ted

Verdery

Ashton M.

2012. “Network Sampling with Memory: A Proposal for More Efficient Sampling from Social Networks.” Pp. 206–50 in Sociological Methodology, Vol. 42, edited by Liao

Tim Futing

. Thousand Oaks, CA: Sage.

44.

Neely

William Whipple

. 2009. “Statistical Theory for Respondent-driven Sampling.” PhD dissertation, University of Wisconsin–Madison. Retrieved June 10, 2017 (http://search.proquest.com.libproxy.lib.unc.edu/pqdtglobal/docview/305033289/abstract/96BB2CDA89994EB2PQ/1).

45.

Nesterko

Sergiy

Blitzstein

Joseph

. 2015. “Bias-variance and Breadth-depth Tradeoffs in Respondent-driven Sampling.” Journal of Statistical Computation and Simulation 85(1):89–102.

46.

Newman

Mark E. J.

2003. “Properties of Highly Clustered Networks.” Physical Review E 68(2):026121.

47.

Newman

Mark E. J.

Strogatz

Steven H.

Watts

Duncan J.

2001. “Random Graphs with Arbitrary Degree Distributions and Their Applications.” Physical Review E 64(2):026118.

48.

Office of Population Research, Princeton University. 2015. “Project 90: Partial Data.” Retrieved December 11, 2015 (http://opr.princeton.edu/archive/p90/).

49.

Potterat

J. J.

Woodhouse

D. E.

Muth

S. Q.

Rothenberg

Darrow

W. W.

Klovdahl

A. S.

Muth

J. B.

2004. “Network Dynamism: History and Lessons of the Colorado Springs Study.” Pp. 87–114 in Network Epidemiology: A Handbook for Survey Design and Data Collection, edited by Morris

New York: Oxford University Press.

50.

Rhodes

Tim

Simic

Milena

. 2005. “Transition and the HIV Risk Environment.” BMJ 331(7510):220–23.

51.

Rothenberg

Richard B.

Woodhouse

Donald E.

Potterat

John J.

Muth

Stephen Q.

Darrow

William W.

Klovdahl

Alden S.

1995. “Social Networks in Disease Transmission: The Colorado Springs Study.” NIDA Research Monograph 151:3–19.

52.

Salganik

Matthew J.

Heckathorn

Douglas D.

2004. “Sampling and Estimation in Hidden Populations Using Respondent-driven Sampling.” Pp. 193–240 in Sociological Methodology, Vol. 34, edited by Stolzenberg

Ross M.

Boston: Blackwell.

53.

Schneider

John A.

Cornwell

Benjamin

Ostrow

David

Michaels

Stuart

Schumm

Phil

Laumann

Edward O.

Friedman

Samuel

. 2012. “Network Mixing and Network Influences Most Linked to HIV Infection and Risk Behavior in the HIV Epidemic Among Black Men Who Have Sex with Men.” American Journal of Public Health 103(1):e28–36.

54.

Silverman

Kenneth

Wong

Conrad J.

Needham

Mick

Diemer

Karly N.

Knealing

Todd

Crone-Todd

Darlene

Fingerhood

Michael

Nuzzo

Paul

Kolodner

Kenneth

. 2007. “A Randomized Trial of Employment-based Reinforcement of Cocaine Abstinence in Injection Drug Users.” Journal of Applied Behavior Analysis 40(3):387.

55.

Smith

Jeffrey A.

2012. “Macrostructure from Microstructure: Generating Whole Systems from Ego Networks.” Pp. 155–205 in Sociological Methodology, Vol. 42, edited by Liao

Tim Futing

. Thousand Oaks, CA: Sage.

56.

Toledo

Lidiane

Codeco

Claudia T.

Bertoni

Neilane

Albuquerque

Elizabeth

Malta

Monica

Bastos

Francisco I.

2011. “Putting Respondent-driven Sampling on the Map: Insights from Rio de Janeiro, Brazil.” Journal of Acquired Immune Deficiency Syndromes 57(August):S136–43.

57.

Tomas

Amber

Gile

Krista J.

2011. “The Effect of Differential Recruitment, Non-response and Non-recruitment on Estimators for Respondent-driven Sampling.” Electronic Journal of Statistics 5:899–934.

58.

Traud

Amanda L.

Kelsic

Eric D.

Mucha

Peter J.

Porter

Mason A.

2011. “Comparing Community Structure to Characteristics in Online Collegiate Social Networks.” SIAM Review 53(3):526–43.

59.

Traud

Amanda L.

Mucha

Peter J.

Porter

Mason A.

2012. “Social Structure of Facebook Networks.” Physica A: Statistical Mechanics and Its Applications 391(16):4165–80.

60.

Verdery

Ashton M.

Merli

M. Giovanna

Moody

James

Smith

Jeffrey A.

Fisher

Jacob C.

2015. “Respondent-driven Sampling Estimators Under Real and Theoretical Recruitment Conditions of Female Sex Workers in China.” Epidemiology 26(5):661–65.

61.

Verdery

Ashton M.

Mouw

Ted

Bauldry

Shawn

Mucha

Peter J.

2015. “Network Structure and Biased Variance Estimation in Respondent Driven Sampling.” PLoS ONE 10(12):e0145296.

62.

Volz

Erik

Heckathorn

Douglas D.

2008. “Probability Based Estimation Theory for Respondent Driven Sampling.” Journal of Official Statistics 24(1):79.

63.

Watts

Duncan J.

Strogatz

Steven H.

1998. “Collective Dynamics of ‘Small-world’ Networks.” Nature 393(6684):440–42.

64.

Wejnert

Cyprian

. 2009. “An Empirical Test of Respondent-driven Sampling: Point Estimates, Variance, Degree Measures, and Out-of-equilibrium Data.” Pp. 73–116 in Sociological Methodology, Vol. 39, edited by Xie

. Hoboken, NJ: Wiley-Blackwell.

65.

Wejnert

Cyprian

. 2010. “Social Network Analysis with Respondent-driven Sampling Data: A Study of Racial Integration on Campus.” Social Networks 32(2):112–24.

66.

White

Richard G.

Hakim

Avi J.

Salganik

Matthew J.

Spiller

Michael W.

Johnston

Lisa G.

Kerr

Ligia

Kendall

Carl

et al . 2015. “Strengthening the Reporting of Observational Studies in Epidemiology for Respondent-driven Sampling Studies: ‘STROBE-RDS’ Statement.” Journal of Clinical Epidemiology 68(12):1463–71.

67.

White

Richard G.

Lansky

Amy

Goel

Sharad

Wilson

David

Hladik

Wolfgang

Hakim

Avi

Frost

Simon D. W.

2012. “Respondent Driven Sampling—Where We Are and Where Should We Be Going?” Sexually Transmitted Infections 88(6):397–99.

68.

Woodhouse

Donald E.

Rothenberg

Richard B.

Potterat

John J.

Darrow

William W.

Muth

Stephen Q.

Klovdahl

Alden S.

Zimmerman

Helen P.

Rogers

Helen L.

Maldonado

Tammy S.

Muth

John B.

1994. “Mapping a Social Network of Heterosexuals at High Risk for HIV Infection.” AIDS 8(9):1331–36.

69.

World Health Organization. 2013. “Introduction to HIV/AIDS and Sexually Transmitted Infection Surveillance: Module 4: Introduction to Respondent Driven Sampling.”Geneva, Switzerland: World Health Organization. Retrieved June 10, 2017 (http://www.who.int/iris/handle/10665/116864).

70.

Yamanis

Thespina J.

Fisher

Jacob C.

Moody

James W.

Kajula

Lusajo J.

2015. “Young Men’s Social Network Characteristics and Associations with Sexual Partnership Concurrency in Tanzania.” AIDS and Behavior 20(6):1244–55.

71.

Yamanis

Thespina J.

Merli

M. Giovanna

Neely

William Whipple

Tian

Felicia Feng

Moody

James

Xiaowen

Gao

Ersheng

. 2013. “An Empirical Analysis of the Impact of Recruitment Patterns on RDS Estimates among a Socially Ordered Population of Female Sex Workers in China.” Sociological Methods and Research 42(3):392–425.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.15 MB