Abstract
Respondent-driven sampling (RDS) is a chain-referral method for sampling members of hidden or hard-to-reach populations, such as sex workers, homeless people, or drug users, via their social networks. Most methodological work on RDS has focused on inference of population means under the assumption that subjects’ network degree determines their probability of being sampled. Criticism of existing estimators is usually focused on missing data: the underlying network is only partially observed, so it is difficult to determine correct sampling probabilities. In this article, the author shows that data collected in ordinary RDS studies contain information about the structure of the respondents’ social network. The author constructs a continuous-time model of RDS recruitment that incorporates the time series of recruitment events, the pattern of coupon use, and the network degrees of sampled subjects. Together, the observed data and the recruitment model place a well-defined probability distribution on the recruitment-induced subgraph of respondents. The author shows that this distribution can be interpreted as an exponential random graph model and develops a computationally efficient method for estimating the hidden graph. The author validates the method using simulated data and applies the technique to an RDS study of injection drug users in St. Petersburg, Russia.
Keywords
1. Introduction
Hidden populations, such as drug users, men who have sex with men, sex workers, or homeless people, are often subjected to social stigma or criminalization. Learning about these populations can be challenging for sociologists, epidemiologists, and public health researchers, because potential subjects may fear exposure or prosecution. Several survey techniques have been developed for sampling from hidden populations, including social link tracing and snowball designs (Goodman 1961; Thompson and Frank 2000). Respondent-driven sampling (RDS) is a common survey method for hidden or hard-to-reach populations for which no convenient sampling frame exists (Broadhead et al. 1998; Heckathorn 1997). In RDS, study participants recruit members of their social network who are also members of the hidden population. Starting with a set of “seeds,” participants are given a fixed number of coupons tagged with a unique code. Participants then recruit members of their social network by giving them coupons. The recipient of the coupon redeems it at the study site (or over the phone, online, etc.), is interviewed, and receives coupons to recruit others. A dual incentive encourages recruitment: subjects receive a small reward for participating in the study and for each new subject they recruit. Subjects cannot be recruited more than once, and only a small number of coupons are given to new participants, to prevent the local network from being saturated with coupons or the emergence of a secondary market for coupons. To safeguard the privacy of subjects not participating in the study, subjects do not reveal the identities of their social contacts to researchers. The only network information typically reported by subjects is their network degree, the number of social contacts who are also members of the study population.
Although RDS is an effective procedure for recruiting members of a hidden population, estimation of population characteristics from data obtained by RDS is controversial. Most methodological work on RDS assumes that the recruitment process takes place in a hidden social network connecting members of the study population. With the understanding that the structure of this hidden network likely affects individual subjects’ likelihood of being recruited, many researchers have sought to determine sampling probabilities for design-based estimation of population means (e.g., human immunodeficiency virus [HIV] infection prevalence). Salganik and Heckathorn (2004) constructed a model of the recruitment process in which subjects receive only one coupon and can be recruited infinitely many times. They modeled the recruitment as a random walk with replacement on the hidden population social network. When this walk is at “equilibrium,” they argued that the probability that a given subject is sampled is proportional to his or her network degree. Salganik and Heckathorn and Volz and Heckathorn (2008) proposed a Horvitz-Thompson type estimator for the population mean, in which observations are weighted by the inverse of the subject’s degree. Aronow and Crawford (2015) clarified the conditions under which this estimator has good statistical properties, and Gile (2011) derived a related estimator whereby sampling is without replacement.
Unfortunately, the characterization of the RDS recruitment process as a sampling design, whereby sampling probability is a function of network degree alone, suffers from some fundamental flaws. First, RDS recruitment is always without replacement, because subjects cannot be recruited more than once; second, a without-replacement random walk on a network is never at equilibrium with respect to its probability of sampling particular subjects—once a subject is recruited, he or she can never be visited by the recruitment process again; and third, if the recruitment process operates on the social network connecting the sampled individuals and seeds are not chosen at random, the network structure itself determines the probability that a given person will be reached by the recruitment chain. Indeed, for a given sample size n on a fixed population network, any potential subject whose minimum path length to a seed is greater than n has sampling probability
It is difficult to determine the correct sampling probabilities for recruited subjects under RDS because the underlying social network is only partially observed (Gile and Handcock 2010, 2015). The unobserved links between recruited subjects, and between recruited and unrecruited population members, constitute missing data in RDS studies. Characterization of the network on which the sampling process takes place is therefore a major methodological frontier in research on estimation from RDS (Handcock and Gile 2010). Remarkably, a typical RDS study reveals a great deal of information about the network of respondents: the observed degrees, recruitment chain, and patterns of coupon allocation and depletion are all readily available and provide valuable information about the local structure of the population network. Insight into the information content of data from RDS studies would clarify exactly which network and population properties researchers can hope to estimate, and which they cannot, in real-world studies. In particular, a better understanding of the network on which RDS recruitment operates could facilitate computation of marginal sampling probabilities similar to those calculated by Gile and Handcock (2015) for use in Horvitz-Thompson-type estimators for population means (e.g., Volz and Heckathorn 2008). Alternatively, specification of a probability model for dependence between trait values of vertices that share an edge in G may allow regression approaches to population estimation and adjustment for dependence in outcomes induced by the network structure (e.g., Bastos et al. 2012). An estimate of the subnetwork of respondents in an RDS study could also be used to estimate the size of the target population in a manner analogous to the “network scale-up” population size estimator (Bernard et al. 2010; Feehan and Salganik 2014; Killworth et al. 1998).
In addition to its statistical uses for population-level inference, the subnetwork of respondents is of inherent sociological and epidemiological interest. The network connecting sampled subjects reveals social links between participants and possible avenues for transmission of ideas, behaviors, practices, or infectious agents. Comprehensive sociometric mapping can be difficult and costly in hidden populations, and many researchers have attempted to estimate epidemiological properties of recruited individuals’ networks from recruitment data obtained by RDS (e.g., Cepeda et al. 2011; Li et al. 2011; Liu et al. 2009; Stein, Steenbergen, Buskens, et al. 2014; Stein, Steenbergen, Chanyasanha, et al. 2014). The ability to estimate features of the subnetwork of respondents in an RDS study would place sociological and epidemiological inquiries about the local network onto firmer theoretical and methodological ground.
In other areas of network theory, researchers have made progress in reconstructing networks from partial observation. When links are missing, some techniques assume that subjects with similar traits are likely to be connected (Atchade 2011; Leskovec, Huttenlocher, and Kleinberg 2010; Lü and Zhou 2011; Koskinen et al. 2013). When vertices, edges, or egocentric networks are sampled, several authors have proposed ways of estimating global network properties (Bliss, Danforth, and Dodds 2014; Goyal, Blitzstein, and de Gruttola 2014; Smith 2012) or when vertices can be observed more than once (Frank and Snijders 1994; Yan and Gregory 2013). Sometimes dynamic or random processes can reveal structural information about networks (Kramer et al. 2009; Shandilya and Timme 2011; Linderman and Adams 2014). Gile (2011) and Gile and Handcock (2015) presented methods for random graph model-assisted inference of the degree distribution from RDS, but they still assume that sampling probability is a function of network degree alone.
In this article, we show how to use data from RDS studies to probabilistically reconstruct the social network of respondents. We first define the observed data under RDS and construct a realistic continuous-time model of the RDS recruitment process on a graph. The model is a simple and natural formalization of the RDS recruitment procedure initially defined by Heckathorn (1997). Interrecruitment waiting times carry information about the network edges linking recruiters to unsampled individuals at each moment in time. We combine this timing information, knowledge of who recruited whom, who had coupons at which times, and the network degrees of recruited subjects to place a well-defined probability distribution on the structure of the recruitment-induced subgraph. A fundamental result of this article is that under simple and realistic assumptions, the likelihood of the recruitment process on a hidden graph can be interpreted as an exponential random graph model (ERGM). We describe a technique for jointly estimating the recruitment-induced subgraph and recruitment rate. An important feature of the algorithm is a computationally efficient method to calculate the likelihood of the recruitment-induced subgraph. We validate the proposed technique using simulated and real networks and apply it to an RDS study of injection drug users in St. Petersburg, Russia. We conclude with a new perspective on the information content of data from RDS studies.
2. Definitions and Assumptions
We begin by stating definitions and assumptions to ensure that the graph inference problem is well posed. (We use the terms graph and network interchangeably.) The first is implicit in the foundational work on RDS and guarantees that the objects under study exist (Heckathorn 1997; Salganik and Heckathorn 2004; Volz and Heckathorn 2008).
Assumption (A-1): The hidden population exists and has finite size N. The social network connecting members of the hidden population is an undirected graph
Members of the hidden population are vertices in V. A vertex is recruited if it is known to the study. A vertex is a recruiter if it has at least one coupon and at least one unrecruited neighbor; a susceptible vertex is unrecruited and has at least one neighbor who is a recruiter. A susceptible edge connects a recruiter and a susceptible vertex, and recruitments can take place only across susceptible edges. A recruited vertex cannot be recruited again. At the moment it is recruited, a vertex is endowed with a non-negative number of coupons it may use to recruit its susceptible neighbors. Every recruitment reduces the number of coupons held by the recruiter by one. When all the coupons belonging to a recruiter vertex are depleted, the vertex is no longer a recruiter, and any edges incident to it are no longer susceptible. Seeds are recruited vertices chosen from the entire population of vertices by some mechanism, not necessarily random, usually at the beginning of the study. Seeds are not considered to have been recruited by any other vertex.
Definition 1 (Recruitment-induced Subgraph): The recruitment-induced subgraph is
Definition 2 (Recruitment Graph): The directed recruitment graph is
Because subjects cannot be recruited more than once,
Definition 3 (Coupon Matrix): Let
The RDS recruitment process reveals only some of this information to researchers.
Assumption (A-2): The observed data consist of
In particular, researchers do not observe the recruitment-induced subgraph

Example of unobserved and observed data in respondent-driven sampling (RDS). The true hidden population network is G. One seed is chosen (the vertex marked 1), and the RDS recruitment proceeds with each recruited vertex receiving two coupons. The directed recruitment graph
We now state three assumptions about the behavior of recruiters and their knowledge of the recruitment status of their neighbors.
Assumption (A-3): Vertices become recruiters immediately upon entering the study and receiving one or more coupons. They remain recruiters until their coupons or susceptible neighbors are depleted, whichever happens first.
Assumption (A-4): When a susceptible neighbor j of a recruiter i is recruited by any recruiter, the edge connecting i and j is immediately no longer susceptible.
By assumption (A-4), recruitment is competitive: the first recruiter to recruit a given susceptible vertex immediately removes it from the pool of susceptibles. Finally, we specify a parametric waiting time distribution for the time it takes for a recruiter to recruit a susceptible neighbor.
Assumption (A-5): The time to recruitment along an edge connecting a recruiter to a susceptible neighbor has exponential distribution with rate
By assumption (A-5), waiting times to recruitment along susceptible edges are independent and elapse concurrently in continuous time, so recruitment is simultaneous. Together, assumptions (A-3), (A-4), and (A-5) place a well-defined probability distribution on the recruitment-induced subgraph of respondents.
2.1. Consequences of the Waiting Time Assumption
The results below follow directly from assumption (A-5). Let R be the set of recruiters with coupons and let S be the set of susceptible vertices at a certain moment in the recruitment process. Let
Proposition 1: Given that the recruiter u recruits one of its susceptible neighbors
Proposition 2: The waiting time to the next recruitment of any susceptible vertex is distributed as
Proofs of propositions 1 and 2 are given in the online Appendix. Intuitively, proposition 2 means that the new recruited vertex is chosen with probability proportional to the number of edges along which it can be recruited. These results formalize the consequences of simultaneous and competitive recruitment in continuous time.
Interestingly, assumptions (A-3) to (A-5) and the resulting recruitment probability differ starkly from the recruitment dynamics used in simulations by other researchers to test the performance of estimators for RDS. Gile and Handcock (2010) simulated the RDS recruitment process by first choosing seeds, after which “subsequent sample waves were selected without-replacement by sampling up to two nodes at random from among the unsampled alters of each sampled node” (p. 303). This leads to a brief corollary establishing the difference between these approaches.
Corollary 1: Assumptions (A-3) to (A-5) (simultaneous and competitive recruitment) result in different conditional recruitment probabilities than the RDS recruitment implementation of Gile and Handcock (2010).
A proof is given in the online Appendix. The process defined by Gile and Handcock (2010) requires that recruiters “take turns.” This approach implicitly requires that recruiters have knowledge about the behavior of other recruiters—even those to whom they are not connected in the network. This process induces a different distribution on the susceptible degree, and hence on the overall degree, of the new recruit than the model described in assumptions (A-3) to (A-5) of this article. Most existing methods for population inference from RDS data depend intimately on the degree distribution of recruited vertices (e.g., Gile 2011; Salganik and Heckathorn 2004; Volz and Heckathorn 2008), so it is important to highlight scenarios when methods for simulation of recruitment dynamics differ.
3. Likelihood of the Recruitment Time Series
Proposition 2 shows that under assumptions (A-3) to (A-5), the rate of recruitment is proportional to the number of susceptible edges. Given a realization of the recruitment-induced subgraph
Definition 4 (Compatibility): An estimated subgraph
The vertices in the estimated subgraph are identical to the set of recruited vertices:
All directed recruitment edges are represented as undirected edges: for each
The number of edges in
Let
Let

Examples of matrices used to calculate the recruitment time series likelihood. At top left is the recruitment graph
Proposition 3: Under assumptions (A-1) to (A-5), the likelihood of the recruitment time series is
where
is a vector whose elements are the number of susceptible edges just before each recruitment event.
A proof is given in the online Appendix. As before, the rate of recruitment is proportional to the number of susceptible edges, and proposition 3 generalizes proposition 2 by providing an explicit expression for the number of susceptible edges at each step, taking coupons into account and allowing for seeds to be added at any time.
Although equation (1) is the likelihood of the recruitment time series
4. Reconstructing the Recruitment-induced Subgraph
Together, the compatibility conditions (definition 4) and proposition 3 make possible simultaneous estimation of the recruitment-induced subgraph
where
Computationally efficient Monte Carlo sampling of
When only a single “most likely” subgraph
5. Validation by Simulation
In simulation studies, reasonably accurate reconstruction of the recruitment-induced subgraph
6. Application
The HIV epidemic in St. Petersburg, Russia, is concentrated in people who inject drugs (PWID). At least 12,000 people are registered as drug users, but the number of current PWID is likely much higher (Heimer and White 2010). Injection drug use is highly stigmatized in the Russian Federation, and criminal penalties for drug possession can be severe. PWID suffer from high rates of HIV infection and may lack access to treatment and health-related educational resources (Niccolai et al. 2010, 2011).
As part of a study to assess perceived barriers to use of HIV prevention and treatment services,

Raw data from a respondent-drivien sampling (RDS) sample of
Participation in the study was limited to current injection drug users over the age of 21 years who had injected within the previous four weeks. Subjects’ status as PWID was verified either by inspection of arms for injection marks or explanation of drug preparation. Subjects received a voucher with a value of about US$20 for being interviewed and a secondary reward with value about US$10 for recruiting another eligible subject. Following their interview, each subject received three coupons, and no subject could be recruited more than once. Informed consent was obtained from all participants, and the study was approved by the Yale University and Stellit (St. Petersburg) institutional review boards.
Figure 4 shows the observed data

Raw respondent-driven sampling data
A subject’s minimum degree was defined as the number of undirected edges incident to that subject in the recruitment graph
Construction of
Overall recruitment of participants in this study was rapid: the mean time between interviews was
Figure 5 shows the MAP estimate of the adjacency matrix for all 813 sampled subjects (left) and inset submatrix (right). Recruitment edges appear in gray. The apparent bands in the adjacency matrix represent high-degree individuals with many nonrecruitment edges. Probabilistic assignment of these edge ends to other recruited individuals depends on the timing of recruitments of other subjects. The blocklike structure evident in this adjacency matrix may indicate subnetworks of highly connected individuals. Subjects recruited nearby in time may be more likely to know one another, even if they are not linked by a recruitment edge.

Maximum a posteriori (MAP) estimate of
7. Discussion
Nearly every paper on statistical methods for RDS data states or assumes a version of assumption (A-1): the social network connecting members of the hidden population exists and determines the sampling probabilities. But because this network is only partially observed in real-world RDS studies, assumption (A-1) is usually disregarded in the formulation of statistical estimators. Instead, researchers usually make the simplifying assumption that sampling probability is proportional to degree and does not otherwise depend on subjects’ location in the network. This simplification is justified by a thought experiment in which the rules of the game are altered: subjects can be recruited infinitely many times, each subject receives only one coupon, and this process continues for an infinitely long time (Goel and Salganik 2009; Salganik and Heckathorn 2004; Volz and Heckathorn 2008).
In this paper, we have embraced assumption (A-1) and its natural consequence: RDS recruitment happens across edges in the network connecting members of the hidden population. This point of view emphasizes that RDS is more like a stochastic spreading process on a hidden network than a survey sampling method. We define a simple continuous-time model for RDS recruitment on a hidden population graph using the kind of data obtained by every RDS study. The model results in sensible nonuniform conditional recruitment probabilities: the next subject is recruited with probability proportional to the number of edges he or she shares with recruiters (proposition 2), not their total network degree. Combining this model with the observed data from an RDS study allows joint estimation of the recruitment-induced subgraph
This approach yields two computational benefits. First, the time required to evaluate the likelihood via proposition 3 is a function of the sample size n alone, and it does not depend on the population size N, which is likely to be much larger. In particular, we never simulate unobserved portions of the population network G; the ERGM (equation 1) specifies a probability model for the recruitment-induced subgraph
Our approach is unique because it uses all the available data
Historically, there have been two major statistical objections to RDS as a survey design for inference of population quantities. First, sampling probabilities cannot be computed directly from the observed data without additional assumptions (Gile 2011; Gile and Handcock 2010). Second, there may be statistical dependence between the traits of a given subject and his or her neighbors (particularly their recruiter) in the network (Fisher and Merli 2014; Heckathorn 1997, 2002; Tomas and Gile 2011). This dependency might be due to homophily—the tendency for people to form social ties with others similar to themselves—or preferential recruitment of certain types of people, conditional on existing social ties. Clearly, the network structure local to the seeds and recruitment chain encodes the sampling probabilities and the statistical dependencies between subjects’ attributes. This leads us to the conclusion that a fundamental obstacle to principled statistical inference for RDS is missing data: in RDS, not all network neighbors of a vertex i are observed, either because they remain unsampled, or because the recruitment graph
Although the network may be of interest for sociological reasons, it can also be viewed as a nuisance parameter when population attributes are of primary interest. Marginalizing (integrating) over the recruitment-induced subgraph
In conclusion, we offer a mixed message about the prospect for statistically rigorous analysis of data from real-world RDS studies. First, current estimators for population characteristics depend on assumptions that bear little similarity to RDS recruitment processes on social networks, and they do not use all the available data. This may account for their poor performance in empirical studies. Second, and more optimistically, data from RDS studies contain far more information about the social network connecting respondents than has been acknowledged. Estimation of population-level characteristics should therefore proceed from knowledge about the network of sampled subjects: The subgraph
Footnotes
Acknowledgements
I am especially indebted to Edward H. Kaplan, Robert Heimer, Peter M. Aronow, and Leonid Chindelevitch for providing detailed comments on the manuscript. I also thank Yakir Berchenko, Russel Barbour, Alexander Bazazi, Lin Chen, Krista Gile, Mark Handcock, Olga Levina, Aleksandr Sirotkin, Edward White, Jiacheng Wu, Alexei Zelenev, and Li Zeng for valuable suggestions and discussion.
Funding
This work was supported by National Institutes of Health (NIH) grant KL2 TR000140, National Institute of Mental Health grant P30 MH062294, the Yale Center for Clinical Investigation, and the Yale Center for Interdisciplinary Research on AIDS. The Project 90 data were obtained from the Office of Population Research at Princeton University (
). The RDS data presented in the application are from the Influences on HIV Prevalence and Service Access among IDUs in Russia and Estonia study, funded by NIH/National Institute on Drug Abuse grant 1R01DA029888 to Robert Heimer and Anneli Uusküla (co–principal investigators). I made use of the Yale University Biomedical High Performance Computing Center, funded by NIH grant RR029676-01.
Author Biography
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
