Abstract
Respondent-driven sampling (RDS) is a popular method for sampling hard-to-survey populations that leverages social network connections through peer recruitment. Although RDS is most frequently applied to estimate the prevalence of infections and risk behaviors of interest to public health, such as HIV/AIDS or condom use, it is rarely used to draw inferences about the structural properties of social networks among such populations because it does not typically collect the necessary data. Drawing on recent advances in computer science, the authors introduce a set of data collection instruments and RDS estimators for network clustering, an important topological property that has been linked to a network’s potential for diffusion of information, disease, and health behaviors. The authors use simulations to explore how these estimators, originally developed for random walk samples of computer networks, perform when applied to respondent-driven samples with characteristics encountered in realistic field settings that depart from random walks. In particular, the authors explore the effects of multiple seeds, without replacement versus with replacement, branching chains, imperfect response rates, preferential recruitment, and misreporting of ties. The authors find that clustering coefficient estimators retain desirable properties in respondent-driven samples. This work takes an important step toward calculating network characteristics using nontraditional sampling methods, and it expands the potential of RDS to tell researchers more about hidden populations and the social factors driving disease prevalence.
Keywords
1. Introduction
Researchers in many fields are interested in populations that cannot be sampled by conventional methods because they are rare, lack a sampling frame, or have members who are unwilling to participate in traditional survey protocols. Such groups, known as hidden populations (Heckathorn 1997), are often marginalized and at high risk for infections such as HIV/AIDS. Respondent-driven sampling (RDS) is a set of methods for sampling and making inferences about hidden populations that has proliferated throughout the social sciences and public health (Malekinejad et al. 2008; White et al. 2012). RDS uses a without-replacement “link-tracing” approach, similar to snowball sampling, whereby respondents attempt to recruit a limited number of their personal network contacts in the target population until the desired sample size is attained. RDS offers a popular, quick, cost-effective, and anonymous approach for sampling understudied groups such as the homeless, drug users, or commercial sex workers that claims to provide asymptotically unbiased estimates of the population mean under limited conditions (Salganik and Heckathorn 2004; Volz and Heckathorn 2008). There are many concerns about the statistical properties of estimators for RDS data (Crawford et al. 2015; Fisher and Merli 2014; Gile and Handcock 2010; Goel and Salganik 2010; Lu et al. 2012, 2013; McCreesh et al. 2012; Merli, Moody, Smith, et al. 2015; Tomas and Gile 2011; Verdery, Mouw, et al. 2015). However, the continued development of estimators, diagnostics, and reporting protocols for use with such data are beginning to address these concerns (Baraff, McCormick, and Raftery 2016; Crawford 2016; Gile 2011; Gile and Handcock 2011; Gile, Johnston, and Salganik 2015; Lu 2013; McCreesh et al. 2013; Nesterko and Blitzstein 2015; Verdery, Merli, et al. 2015; White et al. 2015; Yamanis et al. 2013), though more work is needed.
Most RDS studies focus on prevalence estimation (i.e., estimation of the population mean or proportion of a focal attribute such as condom use) and avoid making inferences about other relevant estimands. We focus on network structure and, in particular, clustering. The structure of both social and contact networks is a key component of the risk environment for members of hidden populations (Rhodes and Simic 2005) with important implications for disease transmission (Schneider et al. 2012; Morris et al. 2009) and health behaviors (Centola and Macy 2007). Highly clustered risk networks, such as sexual contact networks or shared needle networks, can lead to more redundant paths, making disease transmission more likely (Moody 2002) and altering the relationship between concurrency and epidemic potential (Moody and Benton 2016). Clustering can also have benefits. Highly clustered friendship networks lead to normative reinforcement, and they can increase individual likelihoods of engaging in and spreading health-promoting behaviors such as joining an Internet-based health forum (Centola 2010), adopting modern contraceptives (Kohler, Behrman, and Watkins 2001), abstaining from illicit drugs (Silverman et al. 2007), getting tested for HIV (Karim et al. 2008), or avoiding unprotected sex (Lippman et al. 2010). Normative reinforcement through clustering can also drive unhealthy behaviors, such as sexual concurrency (Yamanis et al. 2015).
Despite its sociological and epidemiological importance, few studies of hidden populations using RDS have directly examined network structure. This is by design. Because field implementations of RDS require that samples be conducted without replacement and with maximal anonymity, typical respondent-driven samples have limited opportunity to measure network structure beyond recruiter-recruit relationships. Some have proposed using RDS to measure homophily (Wejnert 2010), or the tendency for people with similar attributes to be tied (McPherson, Smith-Lovin, and Cook 2001), but these approaches are flawed (Crawford et al. 2015), and there have been few developments since. Others have fit exponential random graph models to RDS data (Gile and Handcock 2011; Merli, Moody, Smith, et al. 2015), but learning about networks themselves was not the primary purpose of these studies. The ability of RDS studies to estimate network structure is important, however, because without closer attention to network characteristics that influence risk behaviors and sexually transmitted infections, RDS studies will be unable to offer a comprehensive picture of the dynamics driving epidemic transmission or other network diffusion processes.
In this article we focus on the performance of recently developed estimators of network clustering that can be applied to RDS data. Work in computer science has proposed clustering estimators for data obtained via random walk sampling (RWS) (Hardiman and Katzir 2013), which is an alternative link-tracing sampling design more appropriate for computer networks than human populations. RDS procedures depart from RWS in several important ways that require new data collection protocols to estimate network characteristics of interest from RDS surveys of human populations, and which may call into question whether such estimators will have favorable statistical properties when used with RDS data. We review these discrepancies in detail throughout the article. In Section 2 we discuss measures of network clustering, introduce their estimation in network censuses versus samples, and review how RDS differs from RWS. Throughout Section 2, we focus on RDS data collection strategies that could inform clustering estimators, which leads us to introduce two alternative survey question approaches for RDS. In Section 3 we describe the empirical data and simulation methods we use to evaluate whether our proposed survey questions and estimators of network clustering are appropriate for RDS data, focusing on bias, sampling variance (SV), and total error. Section 4 contains results from these simulations. In Section 5 we discuss how our proposed survey questions perform in six empirical RDS surveys. In Section 6 we summarize the contributions of this article and lay out additional directions for this research. Our results indicate that the estimators maintain reasonable properties with RDS data and that the questions have good empirical properties. These findings lead us to suggest that researchers add clustering questions and estimators to RDS protocols to further explore network structure. We conclude by focusing on the potential benefits of clustering estimation with RDS data.
2. Background
2.1. Initial Notation
The notation that follows guides our discussion throughout the article. For illustrative purposes, we rely on Figure 1, which shows (1) a hypothetical population (i.e., nodes A through I), (2) the social network linking its members (solid lines connecting nodes), (3) a hypothetical time-ordered RWS link-tracing sample starting from node A (dashed, directed, and numbered lines), and (4) a table counting relevant nodal statistics (on the right). Note that item 3 refers to a random walk sample rather than a respondent-driven sample; in a respondent-driven sample, node E would be ineligible to be sampled a second time because RDS is conducted without replacement. Below, we review this and other differences between RWS and RDS that together call into question whether clustering estimators designed for RWS can be applied to RDS.

Example network with hypothetical random walk sampling and components needed to calculate local and global clustering coefficients for the whole network.
We characterize a social network of
For convenience, we define
2.2. Clustering Coefficients
Watts and Strogatz (1998) introduced the clustering coefficient to characterize small-world networks (Milgram 1967). Small-world networks are (1) highly clustered, meaning that most ties between people appear in pockets of interconnection (see below), and (2) have short average path lengths, meaning that the minimum number of steps between network members is, on average, low (e.g., as embodied in the famous phrase “six degrees of separation”). Clustering coefficients measure the first criterion.
Watts and Strogatz (1998) originally proposed a global measure of the clustering coefficient, defined as
where i, j, and k index unique respondents (Hardiman and Katzir 2013; Newman, Strogatz, and Watts 2001; Watts and Strogatz 1998). The global clustering coefficient (GCC) summarizes the overall network clustering by dividing the count of triangles by the count of connected triplets, where triangles are defined as sets of three individuals (
Extensions to the clustering coefficient concept consider the average amount of clustering among each individual’s affiliates in the network. This second measure, the local clustering coefficient (LCC), is defined as
The LCC measures the average of each individual’s number of triangles divided by his or her connected triplets. In Figure 1, the LCC is obtained by first dividing triangles by connected triplets, then taking the average (when
Although clustering coefficients are recent additions to the social networks literature, they resemble other important network characteristics, in particular, transitivity, ego network density, and measures of clustering from the exponential random graph modeling framework. We omit detailed discussion of these alternative measures for the sake of brevity.
2.3. Measuring Clustering in Network Censuses and Samples
The calculation of many network-level statistics, including the clustering coefficient, assumes that researchers measure the entire adjacency matrix,
When researchers cannot conduct a census of the network, they often turn to samples. There are many approaches to collecting sampled network data, including randomly drawn samples (Krivitsky, Handcock, and Morris 2011; Marsden 1987; McPherson, Smith-Lovin, and Brashears 2006; Smith 2012) and numerous link-tracing approaches (Goodman 1961; Heckathorn 1997; Mouw and Verdery 2012; Volz and Heckathorn 2008). We focus on the latter.
2.4. Hardiman and Katzir Estimators
Hardiman and Katzir (2013) introduced estimators for the LCC and GCC that use data gathered in a random walk sample, such as that shown in Figure 1. Intuitively, for vertices
Next for the LCC, define a weighted sum of the
Hardiman and Katzir (2013) also developed an estimator of the GCC. Letting
Hardiman and Katzir (2013) used both analytic proofs and simulation to show that their proposed estimators are asymptotically unbiased with minimal variance for large random walk samples and that they produce more consistent results at any given sample size than other approaches that query each sampled node’s full ego network (counting ego network reports in the sample size). Although RDS does not rely on simple random walks, researchers may wish to apply these estimators to respondent-driven samples. Section 2.5 discusses RDS departures from RWS with special attention to the empirical contexts in which RDS studies are conducted. Within it, we propose new survey questions that researchers could use to estimate clustering via the Hardiman and Katzir estimators. We examine how these questions perform in six empirical surveys in Section 6.
2.5. RDS Departures from RWS
The Hardiman and Katzir estimators cannot immediately be applied to RDS studies in the field, because they were developed for RWS, which differs considerably in core assumptions. Deviations of RDS from RWS have been shown in prior work to bias other estimators, such as that of the population mean (Gile 2011; Merli, Moody, Smith, et al. 2015; Tomas and Gile 2011) and SV (Verdery, Mouw, et al. 2015), so we should not expect that a naive application of Hardiman and Katzir’s (2013) clustering coefficient estimators will yield viable estimates from empirical respondent-driven samples.
Table 1 summarizes eight RDS departures from RWS that may affect clustering estimation. A random walk sample of a network begins with selecting a single “seed” node, typically with probability proportionate to the steady-state probability,
Comparison of Features of RWS and RDS
Note: RDS = respondent-driven sampling; RWS = random walk sampling.
RWS and RDS also differ in their approach to tracing links. Random walk samples proceed without branching (i.e., having only one coupon), whereas respondent-driven samples almost always allow branching in practice through the distribution of two or three recruiting coupons to each respondent (Goel and Salganik 2009). Random walk samples are conducted with replacement, whereas RDS is conducted without replacement, which means that recruitment becomes competitive (Barash et al. 2016; Crawford 2016; Gile 2011; Gile and Handcock 2010; Heckathorn 1997). Other differences arise because RWS is researcher-driven (or algorithm-driven), while RDS is respondent-driven. In RDS, respondents must identify, approach, and successfully recruit peers, which can yield less than perfect link-tracing efficacy and introduce preferential recruitment (Merli, Moody, Smith, et al. 2015; Verdery, Merli, et al. 2015).
Sample size is another distinction because random walk samples are used in computer science or fields in which the cost of sampling additional individuals is low compared with RDS in human populations (Mouw and Verdery 2012). For instance, Hardiman and Katzir (2013) examined their estimators’ performance in four large networks with 1 percent sample of sizes
The final departure of RDS from RWS is anonymity, which pertains to the measurement of
In many RDS surveys, a majority of respondents participate twice, once when they are recruited themselves (primary interview) and a second time when they return to the research site to collect additional incentives for successfully recruiting peers (secondary interview). Acknowledging this interview timing, we propose two questions researchers can ask RDS respondents to feasibly elicit information about potential ties between
A. In the secondary interview: “Does the person who gave you the coupon know the person who you gave the coupon to or vice versa?” (We refer to this from here on as the binary question format.)
B. In the primary or secondary interview: “What percent of people who you know in the population does the person who gave you the coupon know?” (We refer to this from here on as the percentage question format.) 1
The binary question format garners the exact information required by the Hardiman and Katzir estimators, but it relies on the accuracy of respondent reports about recruiter-recruit relationships. It can also be estimated only on a subset of sampled cases, as it cannot be asked until the secondary interview (after recruitment). The percentage question format differs from Hardiman and Katzir’s (2013) suggested approach, but it can be asked during either the main survey (of all respondents) or the follow-up interview (of the subset of respondents who recruit). If asked in both, researchers can check test-retest validity and potentially diagnose respondent comprehension problems. Of course, there are other possible ways to ask such questions in RDS surveys, but our proposed approaches are flexible in terms of implementation and preserve the desirable confidentiality of standard RDS studies.
3. Data and Methods
3.1. Approach
We begin by evaluating the performance of Hardiman and Katzir’s (2013) estimators applied to RDS through simulation methods. We aim to understand the effects of increasingly large departures from RWS, toward more realistic situations encountered within RDS data collection. To do this, we simulate data collection from underlying population social networks. It is notoriously difficult to obtain analytical results for RDS estimators, which is why many prior developments have tested proposed estimators through simulation. We test scenarios driven by data collection parameters to match how RDS departs from RWS, drawing 1,000 samples in each scenario. It is important to draw multiple samples per scenario to determine the estimators’ distributional properties (bias, SV, and total error). For each simulated sample, we calculate the Hardiman and Katzir LCC and GCC estimators implemented with both question formats we proposed. We compare these sample estimates with the parameters in the population social network (or as would be calculated in a census). After examining how Hardiman and Katzir’s estimators perform in simulations, we evaluate their feasibility in six empirical respondent-driven samples.
3.2. Data
We first simulate link-tracing samples from a hidden population social network of heterosexuals, sex workers, and injecting drug users at elevated risk for HIV/AIDS collected beginning in 1987 as part of the Project 90 study in Colorado Springs, Colorado (Klovdahl et al. 1994; Potterat et al. 2004; Rothenberg et al. 1995; Woodhouse et al. 1994). The project aimed to assess how network structure affected disease transmission, and, as such, the researchers sought to obtain a census of the hidden population and their links to one another. These data have previously been used in prior RDS assessments (Goel and Salganik 2010) and are made available to researchers through the Office of Population Research at Princeton University (2015). We focus on 4,111 individuals linked by 17,164 ties that remain in the network’s largest weakly connected component after dropping cases lacking valid attribute codes. Figure 2 shows the network linking members of this population, with nodes shaded by a key structuring variable (white/nonwhite). Whites make up 74.7 percent of network members, while 17.1 percent of ties cross race categories. Nodes of different races group together in different parts of the figure, but there are many cross-group links.

Largest weakly connected component of Project 90 data set; nodes shaded by race (gray = white, black = nonwhite) and sized by degree. The network is displayed using the ForceAtlas2 algorithm, with no node overlap, in Gephi 0.9.
To understand how the Hardiman and Katzir estimators perform across a range of networks, we also examine additional networks from a data set of 100 Facebook networks collected in 2005, which have also been subject to intensive examinations in prior simulation evaluations of RDS (Mouw and Verdery 2012; Verdery, Mouw, et al. 2015). Importantly, because they were collected when Facebook was new and membership restricted to those with college e-mail addresses, researchers have argued that these networks represent realistic, offline social and interaction networks (Clouston et al. 2009; Traud et al. 2011; Traud, Mucha, and Porter 2012). We restrict analysis to 29 university networks in which the largest connected component of users with valid attribute codes contained between 5,000 and 10,000 nodes, size restrictions we put in place to avoid without replacement sampling effects (Barash et al. 2016) and to maintain computational tractability. Table 2 provides summary statistics for the Project 90 network and the Facebook networks. The Project 90 network is smaller, less dense, more clustered, and less homophilous than the Facebook networks.
Summary Network Statistics for Data Sets Analyzed in This Study
Note: GCC = global clustering coefficient; LCC = local clustering coefficient.
Cross-group ties refer to ties that cross white/nonwhite categories in Project 90 and ties that cross freshman/nonfreshman categories in the Facebook networks.
Statistics presented for the Facebook networks are computed separately; the largest network does not necessarily have the largest proportion of cross-group ties, for instance.
3.3. Scenarios
We provide a replication file for researchers interested in replicating and expanding our scenarios for the Project 90 network, which are publicly available data. In both data sets, we focus on five scenarios designed to test the bias, SV, and error of Hardiman and Katzir’s (2013) estimators when used with standard RDS protocols as opposed to simple RWS. Table 3 shows what key features we manipulate in each scenario. We first simulate collecting simple random walks (“RWS baseline”). These scenarios begin from a single seed selected with steady state probabilities, are conducted with replacement, do not branch, experience 100 percent link-tracing efficacy without preferential recruitment, and do not contain any measurement error for
Parameters Used in Each Simulation Scenario
Note: RDS = respondent-driven sampling; RWS = random walk sampling.
We then selectively relax parameters until the samples resemble the standard RDS protocol. We start with a scenario designed to mimic an ideal case of RDS constrained by the method’s actual implementation in the field (“RDS baseline”). The samples in this scenario begin from 10 seeds selected via convenience sampling (implemented as uniform random seed selection in the main text; in online Appendix A, we consider four other seed selection scenarios and find that they did not alter our results), are conducted without replacement (recruitment is competitive between respondents), and may branch up to three ways from each respondent (i.e., each respondent is simulated as having three coupons); respondents always approach and succeed in recruiting peers who have not already been sampled (i.e., 100 percent recruitment efficacy), selecting them at random among the sets of their friends who have not participated (no preferences); and respondents accurately report the items used to measure
We next examine the fifth through seventh ways RDS departs from RWS. We look at how less than perfect recruitment efficacy affects estimates by considering a scenario in which only 80 percent of offered coupons are accepted by the targeted peer (“+less than 100% efficacy”). We then test the effects of preferential recruitment (“+preferential recruitment”), modeling it as a case in which all respondents are half as likely to offer coupons to certain types of peers (to white peers in the Project 90 network and freshmen in the Facebook networks). Finally, we examine what happens when respondents misreport recruiter-recruit ties (“+
In all simulated samples we assume respondents accurately report degree. Although sample size marks a key way in which RDS departs from RWS, we hold target sample sizes constant at 400, which is a small fraction of the population sizes we examine. We found that target sample sizes were attained in all scenarios, which reviews of RDS indicate happens frequently (Malekinejad et al. 2008).
3.4. Measures
We measure the performance of Hardiman and Katzir’s (2013) clustering coefficient estimators with three indicators. For each of the question formats (binary or percentage) of each of the estimators (GCC or LCC) in each scenario, we calculate (1) their bias, defined as
4. Simulation Results
We first consider the distribution of estimates for both the GCC and LCC calculated via the binary and percentage question formats in the baseline RWS scenario on the Project 90 network. Figure 3 shows that both estimators, using either question format, exhibit minimal bias that arises because of finite sample sizes. The LCC estimator is less biased than the GCC estimator (

Performance of Hardiman and Katzir estimators by estimator and question format in random walk sampling (RWS) on the Project 90 data set.
We next examine the distribution of estimates in realistic respondent-driven samples and what features of RDS lead to performance deterioration compared with the RWS baseline scenario. Figure 4 shows that in the Project 90 network the GCC estimated using the binary question format performs poorly in each of the RDS scenarios, underestimating the population parameter substantially (GCC binary bias by scenario is

Performance of Hardiman and Katzir estimators by estimator and question format in random walk sampling (RWS) and respondent-driven sampling (RDS) scenarios on the Project 90 data set.
The LCC estimators perform well in Figure 4. The binary question format of the LCC slightly overestimates clustering (LCC binary bias by scenario is
Estimates obtained in all RDS scenarios in the Project 90 network exhibit low SV (ranging from 0.001 to 0.003), substantially lower than was found for the RWS scenarios. This result follows from the without-replacement design of RDS, which tends to yield lower SV than the with-replacement design of RWS. RMSEs in the worst case scenarios, which contain all RDS deviations from RWS that we examine, are lower than we found for the RWS baseline scenarios in all cases. In the +misreporting scenarios, RMSEs are
We next turn to results in the Facebook networks. Table 4 shows how absolute values of bias (“absolute bias”) and RMSEs are distributed within these networks by estimator and question format in three focal scenarios (RWS baseline, RDS baseline, and RDS misreporting). We display these scenarios because the +imperfect and +preferences scenarios made little difference in the results. We do not show the low SV we found in all scenarios for the Facebook networks (a maximum of 0.004 across networks in any scenario). The estimators exhibit almost no bias in the RWS baseline scenarios, with a maximum that is substantially lower than was seen in the Project 90 network. The RWS baseline scenario also tends to produce much lower RMSEs in these networks than it did in the Project 90 network.
Distributions of Absolute Bias Statistics and RMSEs in the 29 Facebook Networks Studied by Scenario, Estimator, and Question Format
Note: GCC = global clustering coefficient; LCC = local clustering coefficient; RDS = respondent-driven sampling; RMSE = root mean squared error; RWS = random walk sampling.
The RDS scenarios also yield lower bias in the Facebook networks than they did in the Project 90 network, with maximum observed values all lower in these networks. In terms of bias, the Facebook networks indicate that the binary measures are the most biased, with the LCC being less biased than the GCC. The Facebook networks also have lower RMSEs than the Project 90 network. In terms of RMSEs in the realistic RDS scenarios, results from the Facebook networks suggest that the percentage question format is preferable to the binary format and that the GCC is slightly preferred over the LCC after accounting for SV (recall that the LCC had lower bias). In total, median RMSEs observed in the RDS scenarios in the Facebook networks are only slightly larger than the median RMSEs obtained in the RWS baseline scenarios, which indicates that the clustering coefficient estimators maintain reasonable properties for application to respondent-driven samples.
5. Application of Data Collection Instruments in Six Empirical Surveys
We now discuss six empirical RDS surveys collected in diverse hidden populations in multiple countries by different research teams that asked respondents the types of questions needed to estimate network clustering. Two studies examined female sex workers in China, two examined people who inject drugs in the Philippines, one study examined people who inject drugs in Canada, and the last survey, which contained both of our proposed question formats, looked at vegetarians and vegans in Argentina. For the sake of brevity, we omit full descriptions of these studies in the main text but provide complete details in online Appendix B. We focus on the proportion of invalid item responses (“Invalid %”) in each survey across question formats, where we define invalid responses as cases in which respondents did not answer the question, gave responses of “don’t know,” or otherwise offered evidence that they did not understand or wish to answer the question. We also compare the mean values of valid responses (“Mean of valid”) between relevant survey pairs (comparing the two surveys in China with each other and the two surveys in the Philippines with each other), and within individuals who answered both types of questions in the survey in Argentina.
Table 5 summarizes the item response patterns in these empirical surveys. Respondents were much more likely to give invalid responses to the binary question format than to the percentage question format. More speculatively, we can make some claims about conceptual validity by examining the cross-site concordance in the means of valid responses within the two sets of paired surveys. For instance, the means of valid responses in the female sex worker surveys collected by overlapping research teams in two cities in China are moderate (23.2 percent to 42.3 percent), while means of valid responses for the two surveys of persons who inject drugs in Philippine cities are much higher (78.7 percent to 91.7 percent). We take these findings to indicate that the survey questions are measuring consistent phenomena. In addition, we find nearly identical means of valid responses between the two question formats implemented in the Argentina survey. Here, both the percentage and binary measures found raw clustering levels in the 30.1 percent to 32.0 percent range, and we determined that the respondent-specific average of binary format versus percentage format reports had a Spearman’s correlation of .445, while the item-specific reports with potentially multiple binary reports per respondent had a polyserial correlation of .376. These correlations suggest a reasonably high level of agreement between question formats, even in the face of large amounts of missing data. Taken together, these results indicate that the questions tap into valid concepts, but they add another reason that researchers should prioritize implementing the percentage question format: respondents seem more willing or able to answer it.
Summary of Item Response Rates for Clustering Questions in Empirical Surveys
Note: FSW = female sex workers; PWID = persons who inject drugs; Veg = self-identifying vegetarians and vegans.
We refer to reports rather than sample size because some respondents report on multiple relationships for the binary questions.
The format used in the Ottawa study is an interaction grid in which respondents identify which peers know one another; see online Appendix A.
6. Discussion and Conclusion
Sociological interest in marginalized populations means that researchers often confront situations where traditional sampling methods cannot be used. In such examples, the peer-driven recruitment procedures of RDS yield large and diverse samples quickly and cheaply, while maintaining respondent anonymity, which is why researchers have used this method to sample hundreds of stigmatized, sensitive, and hidden groups. Prior methodological research on RDS has focused on its estimators of the population mean and avoided examining how it may reveal other interesting features of hidden populations of relevance to sociology and public health (with a few notable exceptions, such as Crawford 2016 and Wejnert 2010). This avoidance is strategic: practical considerations limit researchers’ ability to uncover many aspects of the underlying population social network. In this article, we propose new data collection protocols and estimators for RDS that allow researchers to examine clustering, a social network feature of broad interest. We began by considering estimators of network clustering developed in computer science for RWS and expanded their application to the case of human populations sampled with RDS, with careful attention to practical differences between RDS and RWS. We offer data collection protocols in the form of two different question formats RDS surveys could adopt in the field to estimate network clustering, and we study how these question formats perform under two clustering coefficient estimators in simulations as well as their implementation challenges in six empirical surveys.
Overall, we recommend that researchers using RDS surveys begin asking respondents the types of questions that would allow clustering coefficient estimation. Although RDS estimators of the population mean often fail in the face of unmet assumptions about sample recruitment (Gile and Handcock 2010; Goel and Salganik 2010; Lu et al. 2013; Lu et al. 2012; McCreesh et al. 2012; Merli, Moody, Smith, et al. 2015; Tomas and Gile 2011; Verdery, Mouw, et al. 2015), we find that the clustering coefficient estimators we studied perform well even when core RDS assumptions are violated. Considering the two question formats we proposed, we also find that the percentage question format can be asked of more respondents, yielded better results in a simulation study, and appeared to be better understood by respondents in empirical studies. The two clustering estimators perform similarly, but the GCC estimator had lower total errors than the LCC estimator in most networks we studied. However, the contribution of SV to RMSE drives this result, so researchers concerned about bias may prefer to stick to the LCC estimator, which we found tends to exhibit lower bias.
We hope that methods for estimating clustering coefficients from RDS data will spur additional substantive and methodological contributions. Substantively, clustering is a core property that distinguishes human social networks from random graphs (Watts and Strogatz 1998), and many researchers have posited that it plays a role in the transmission of diseases and the adoption of behaviors through networks (e.g., Centola 2010; Eguíluz and Klemm 2002). These theories consist of a set of structural hypotheses, where the structure of the entire network makes it more or less conducive to diffusion, and they have been supported by results from mathematical models and some experiments. For example, such models suggest that, ceteris paribus, moving from low to moderate clustering of the risk network increases transmission (Keeling and Eames 2005), but moving from moderate to high clustering does not change transmission substantially until very high levels, when the network becomes disconnected (Newman 2003). Using clustering coefficients from RDS data could allow researchers to confirm the insights of these mathematical models of network structure and disease diffusion with macro-comparative methods. 2 In this vein, for instance, researchers might compare a set of similar populations sampled with RDS over multiple time points to examine whether changes in clustering levels are associated with changes in the prevalence of infectious diseases, such as HIV/AIDS. Clustering in the social network may be associated with differences in risk behaviors such as unprotected sex at the individual level. Prior research has found that network clustering moderates effects of peer contraceptive users in the use of fertility control (Kohler et al. 2001) but that such normative reinforcement can also facilitate the spread of unhealthy behaviors (Yamanis et al. 2015). Previous studies of this topic have been limited to traditional survey populations, however, and the approaches developed in this article will enable researchers to test these hypotheses in a more diverse series of hidden populations.
In addition, estimators of network clustering can offer methodological improvements to RDS. A first methodological extension could provide additional data to inform variants of RDS mean estimators that use exponential random graph modeling and algorithmic simulation in an effort to obtain less biased, lower variance results (Gile and Handcock 2011). Currently, these approaches model clustering as a by-product of dyadic homophily, divorced from assessments of clustering levels in the population of interest. With empirical estimates of clustering, researchers using such algorithms could confirm the clustering coefficients produced in their models. Such information may enhance the realism of the model-based approach and increase confidence in its bias and variance reductions.
A second methodological contribution could allow researchers to test one of the most central but least often evaluated assumptions of RDS, that the network contains a “giant component” whereby the vast majority of people are reachable through chains of arbitrary length through the network ties (Volz and Heckathorn 2008). Using random graph methods from the physics and computer-science traditions that generate network structures from degree distributions and clustering coefficients (Newman et al. 2001; Heath and Parikh 2011), researchers may also be able to determine if they are sampling a network with “bottlenecks,” that is, a grouping in which there are few links between cohesive groups in the network, a feature that many in the RDS community link to poor estimate quality (Toledo et al. 2011). This would add to the emerging diagnostic toolkit being developed for RDS (Gile et al. 2015). A related extension of this approach could calculate the “structural risk” of a network sampled with RDS by applying percolation or other diffusion models to examine the size and speed of hypothetical epidemics spreading on the modeled network (Britton et al. 2008; Merli, Moody, Mendelsohn, et al. 2015), a potential early warning system of a given hidden population’s epidemic potential gathered directly from RDS.
Such extensions and future directions lie outside of the scope of the present research. However, we emphasize that we view the development of clustering estimators for RDS data as the beginning of a new line of inquiry about how estimates of the topological features of networks sampled with RDS can inform substantive and methodological interests. The benefits of estimating clustering in respondent-driven samples are large, and we encourage researchers to begin deploying survey questions needed for their calculation. In either case, further attention to the ability of RDS to tell us more about hidden populations than disease prevalence is an important next step for the literature to take.
Footnotes
Acknowledgements
We thank M. Giovanna Merli, Ann Jolly, and Anne DeLessio-Parson for providing information about aspects of the empirical cases we examine.
Funding
We acknowledge assistance provided by the Population Research Institute, which is supported by an infrastructure grant from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (R24-HD041025), and from a seed grant provided by the Institute for CyberScience at Pennsylvania State University. Portions of this research were funded by National Center for Health Statistics grant 1R03SH000056-01 (Ashton M. Verdery, principal investigator).
Notes
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
