Abstract
Sub-studies addressing specific research questions may necessitate selecting a random sample from different eligibility groups of The World Trade Center Health Registry (WTCHR) for estimating population parameter mean. In such a situation, Singh (1988) and Osahan (1997) proposed three two-stage sampling strategies in overlapping clusters. These sampling strategies are compared using WTCHR’s overlapping eligibility groups, in terms of their bias and efficiency.
Keywords
Introduction
The terrorist events of September 11, 2001 (9/11) in New York City led to almost 3,000 deaths, including more than 2,200 civilians, 343 firefighters, and 60 police officers. Tens of thousands of lower Manhattan building occupants, residents, and school children were evacuated and had their lives and livelihoods disrupted for months to years afterward. The rescue, recovery, and clean-up effort involved city, state, and federal agency employees as well as contracted workers and volunteers from all 50 states (Farfel et al., 2008).
The World Trade Center Health Registry (WTCHR) is the largest post-disaster registry in U.S. history (Brackbill et al., 2009). Its goals are to document short- and long-term 9/11-related physical and mental health impacts and gaps in care, connect individuals to care, and inform 9/11 health care policy and response planning for future disasters.
The WTCHR is comprised of enrollees from five overlapping eligibility groups (rescue and recovery workers; building occupants and passersby; residents; school students/staff). Sub-studies addressing specific research questions may necessitate selecting a random sample from different WTCHR eligibility groups. However, many enrollees are members of multiple eligibility groups, so if we are selecting a random sample separately from two eligibility groups, there are chances some sampled units belong to either both eligibility groups or another third eligibility group.
This overlapping eligibility complicates the process of construction of estimators. To minimize the estimator bias and variability, a sampled person should belong to only one group or a statistical weight should be applied to a person that accounts for the number of eligibility groups that the person is included in. Using this technique, efficient estimators can be constructed from WTCHR samples even with significant overlap in eligibility groups.
Overlapping clusters are common in many types of epidemiological investigations. Special sampling strategies can be used to accommodate overlapping clusters so that resulting estimates remain statistically valid. Such special sampling strategies have been used in regional epidemiological surveys for infectious diseases like mycobacterium tuberculosis (TB) (Osahan, 1997). These improved strategies were also used to study tuberculosis transmission in correctional facilities (Bellin et al., 1993), in schools (Braden et al., 1995), in shelters (CDC, 1991), in an airplane (Driver et al., 1994), in a church (Dutt et al., 1995) and in a bar (Kline et al., 1995).
These sampling strategies are equally applicable to household clusters in ecological surveys formed around factories burning coal and emitting polycyclic aromatic hydrocarbons (PAH’s) (Menzie et al., 1992). Overlapping clusters are also important in conducting market research when an advertising company may wish to identify the range of TV stations to determine the size of clusters of their viewers. Similarly, many other situations benefit from its application.
The objective of this study is to examine three sampling strategies’ relative efficiencies and compare the variances of unbiased estimators of the following strategies: Probabilities Proportional to Size (PPS), Probabilities Proportional to Size Without Repetitions of second stage units (PPSW), and mean square error of biased estimator of strategy Simple Random Sampling (SRS) as applied to the WTCHR population. The scope of WTCHR dataset to fit in dual sampling methodology is yet to explore (Singh et al., 2014).
Geographic distribution of Lower Manhattan registrants’ primary residence by Zip Code in New York City on September 11, 2001 (Farfel et al., 2008).
The WTCHR was created in July 2002 as a collaborative effort between the Agency for Toxic Substances and Disease Registry (ATSDR) and the New York City Department of Health and Mental Hygiene (NYC DOHMH). Development of eligibility criteria (Farfel et al., 2008) took into account proximity by time and place to the WTC attack, acute exposure to the dust and debris cloud that resulted from the collapse of the towers, and chronic exposure to dust, smoke and fumes in the vicinity of the WTC site through the end of the recovery phase in June 2002. The WTCHR was approved by the institutional review boards of NYC DOHMH and U.S. Centers for Disease Control and Prevention.
Initial data collection
Of the 71,437 baseline interviews conducted from September 2003 through November 2004 (2–3 years post-9/11) with people who voluntarily enrolled in the WTCHR, 67,527 (95%) were completed using computer-assisted telephone interviewing (CATI) and the remaining 3,910 were completed using in-person, computer-assisted personal interviewing (CAPI).
Overlapping eligibility groups
All enrollees were categorized into four broad overlapping categories known as eligibility groups:
Rescue and recovery workers and volunteers, Building occupants, passersby, and people in transit, Residents south of Canal Street, School students and staff.
Figure 1 is a map of lower Manhattan showing the WTC site (Farfel et al., 2008). The largest group of WTCHR enrollees included people present in lower Manhattan near the WTC site on the morning of 9/11 (
The number of persons eligible for the Registry was estimated to be approximately 409,000, of whom 71,437 (17.4%) enrolled (Murphy et al., 2007). Outreach and multilingual media campaigns encouraged enrollment through a toll-free number or Web site (classified as “self-identified”). Lists of persons potentially exposed were provided by entities such as employers and governmental agencies (classified as “list-identified”). Final enrollment was 70% self-identified and 30% list-identified (Brackbill et al., 2009). The percentage of list-identified enrollees also varied across eligibility groups, ranging from 14% among students to 37% among workers.
SAS
Sampling strategies in overlapping clusters
Cluster sampling is sometimes used in surveys to avoid constructing a list of sampling units (sampling frame) which can be challenging to develop. In contrast, generating such a frame for clusters is economical and is often readily available. Developing a list of households for a State in the USA may be challenging, but lists are available for smaller units like cities, counties and zip codes. Similarly, it may not be possible to list all of the customers of a chain of stores in an area. However, it would be possible to have such a list from each store and treat them as clusters. In such situations, cluster sampling is beneficial. Clusters are formed either before selecting the sample (CBS) or after selecting the sample (CAS). In the two practical situations discussed above, the CBS system of cluster formation is applicable and clusters may be overlapping or non-overlapping. For non-overlapping clusters, there are a number of usual available methodologies in the literature for cluster sampling that can be applied (Cochran, 1977).
For overlapping clusters, however, the literature is limited (Sethi, 1965; Goel & Singh, 1977; Agarwal & Singh, 1982; Amdekar, 1985; Singh, 1988; Osahan, 1997). Singh (1988) developed two sampling strategies in overlapping clusters in the CBS system for estimating the population mean, assuming known population size. An extension of this method was undertaken by Tracy and Osahan (1994).
The first strategy SRS of Singh (1988):
Clusters were selected with equal probabilities, and Second stage units were taken with equal probabilities.
In the second strategy PPS:
Clusters were selected with probabilities proportional to size of cluster, and Second stage units were taken with equal probabilities.
An improved sampling strategy PPSW was proposed by Osahan (1997) in which:
Clusters were selected with probabilities proportional to size of cluster, and For second stage units, a single big sub-sample was selected and thus there was no chance of repetition of second stage units.
The above approach was expected to provide a better estimator in comparison to Singh’s (1988). Also here, the comparison was made relatively more simple and direct, whereas in Singh’s approach the support of evidence given by Hansen and Hurwitz (1943) was required. The proposed strategy was discussed in the context of a tuberculosis outbreak but equally applicable to other situations where sampling was required from a population consisting of overlapping clusters.
The two sampling strategies of Singh (1988) and third of Osahan (1997) are two-stage sampling strategies that can be efficiently applied to a population comprised of overlapping clusters/first-stage units. The goal of the present investigation is to empirically compare the efficiency of these three sampling strategies in overlapping eligibility groups in the WTCHR population.
Let
Assume that these
equality holds for non-overlapping clusters, a situation where no unit is common to more than one cluster. But in fact, a unit may be associated with more than one cluster and let Fj be the frequency of j-th unit occurring in
Define
where
From (2.3) and (2.7) of Singh (1988), the Mean Square Error (MSE, being a biased estimator) and variance expression for two strategies are given as
and
where
and
In the first and second strategies, clusters are taken with simple random sampling with replacement (SRSWR) and probability proportional to size with replacement (PPSWR), respectively. The second stage units are selected with simple random sampling without replacement (SRSWOR) in both the strategies.
Osahan (1997) proposed sampling strategy in two stage-sampling consists of the following steps:
Let
If i-th cluster is selected
This strategy leads to a fully without replacement sample, i.e., there is no chance of repetition in the ultimate sample of units.
The variance of the estimator is given as
In Singh’s (1988) latter strategy,
Eligibility Group Characteristics, World Trade Center Health Registry, 2003–2004,
Overlap of three major enrollment groups in the WTC Health Registry (its overlap is described in text and Table 1).
Using data from the World Trade Center Health Registry as described above, we dropped the student and staff group and categorized 71,228 enrollees (including students and staff who qualified) into three main categories, as follows:
Rescue and recovery workers and volunteers, Building occupants, passersby, and people in transit south of Chambers Street, and Residents south of Canal Street.
The above three large groups were considered in this study as an enrollee may be associated with more than one eligibility groups.
Twenty-six percent of enrollees met more than one eligibility criterion. The greatest overlap was among building occupants, passersby, and people in transit who were also either rescue/recovery workers (
The size of the eligibility groups were 14,665 for residents, 30,655 for rescue and recovery workers and volunteers, and 43,487 for building occupants, passersby, and people in transit. For the present illustration, only these three large eligibility groups were used to compare three sampling strategies.
The description of overlapping in eligibility groups for this population has been shown in Fig. 2. The percentage of overlapping (OL) was calculated as
To illustrate we estimated ‘age on 9/11’ (rounded to the nearest integer) comparing three sampling strategies in overlapping eligibility groups. Average age was imputed separately for missing values for 88 residents, 82 rescue and recovery workers and volunteers, and 257 building occupants, passersby, and people in transit. Table 1 compares the two sampling strategies SRS and PPS of Singh (1988) with the one PPSW of Osahan (1997).
Some other important parameters like WTCHR eligibility group size, sample sizes taken, and means and variances of Zi’s are also shown in the table to shed light on their computations. A second set of comparisons formed by selecting samples of 0.5% from the population of the same three main eligibility groups is also shown.
The percent relative efficiency (RE) of the compared strategies varied from 100.03 to 182.89. PPSW was nearly 83% more efficient at generating a stable sample for estimation of age on 9/11 in the three overlapping eligibility groups as compared to SRS. The WTCHR population of the three main overlapping eligibility groups is exceptionally large (
The present empirical investigation is unique in its nature, as it is the only one that considers the real overlap in eligibility groups of the WTCHR population. In earlier studies, very small populations were constructed to compare variances of the sampling strategies. For example, Tracy and Osahan (1994) created two hypothetical populations of
There is a realistic limitation of unknown population size in such studies dealing with overlapping clusters, though this is not the case in the present study where population size is known in advance.
Footnotes
Acknowledgments
I would like to thank World Trade Center Health Registry, NYC Department of Health and Mental Hygiene, for allowing me to use WTCHR Wave 1 dataset to prepare this manuscript and reprint Fig.
(Farfel et al., 2008). I also wish to thank Cheryl Stein, Robert Brackbill, Mark Farfel, Sharon Perlman and Charon Gwynn for their critical review. The comments of the referees on an earlier version were most helpful in improving this manuscript.
Conflict of interest
The author has no conflicts of interest to disclose.
