Abstract
Mixed methods studies of human disease that combine surveillance, biomarker, and qualitative data can help elucidate what drives epidemiological trends. Viral genetic data are rarely coupled with other types of data due to legal and ethical concerns about patient privacy. We developed a novel approach to integrate phylogenetic and qualitative methods in order to better target HIV prevention efforts. The overall aim of our mixed methods study was to characterize HIV transmission clusters. We combined surveillance data with HIV genomic data to identify cases whose viruses share enough similarities to suggest a recent common source of infection or participation in linked transmission chains. Cases were recruited through a multi-phase process to obtain consent for recruitment to semi-structured interviews. Through linkage of viral genetic sequences with epidemiological data, we identified individuals in large transmission clusters, which then served as a sampling frame for the interviews. In this article, we describe the multi-phase process and the limitations and challenges encountered. Our approach contributes to the mixed methods research field by demonstrating that phylogenetic analysis and surveillance data can be harnessed to generate a sampling frame for subsequent qualitative data collection, using an explanatory sequential design. The process we developed also respected protections of patient confidentiality. The novel method we devised may offer an opportunity to implement a sampling frame that allows for the recruitment and interview of individuals in high-transmission clusters to better understand what contributes to spread of other infectious diseases, including COVID-19.
In the health sciences, “mixed methods” research has long been at least an implicit part of many researchers’ practice (Andrew & Halcomb, 2009), but the language used to describe the mixing, and the particular methods mixed, have been highly variable (Creswell et al., 2011). Critical engagement with mixed methods as a formal approach to research in the field is growing (Creswell, 2009), and is increasingly of interest to the U.S. National Institutes of Health (Creswell, 2010; Plano Clark, 2010). As a major funder of HIV research, U.S. National Institutes of Health priorities affect the field, and quantitative and qualitative components are now often incorporated within larger research projects, including clinical trials. A famous example is the VOICE trial, which tested multiple pharmaceutical methods of preventing HIV in women. VOICE failed to produce evidence of effectiveness for any of the study arms despite good self-reported adherence during the trial; when drug concentration levels in participants’ blood were measured, however, results suggested widespread nonadherence (Marrazzo et al., 2015). A qualitative posttrial substudy was conducted to explore participants’ reasons for overreporting adherence (Montgomery et al., 2017).
The drug concentration levels measurements in the VOICE study are an example of biomarkers, which can be defined simply as measurable indicators of the physiological state of an organism. Oft-used biomarkers in HIV-related research include infection status (HIV or other sexually transmitted infections) and drug levels (e.g., to assess medication adherence, toxicity). In mixed methods studies, these are frequently coupled with surveys or semistructured interviews (Alary et al., 2015; Koester et al., 2015). One potentially useful biomarker that has not been frequently incorporated in mixed methods research is viral genetic data. Previously, processing times and cost to obtain such data were often prohibitive, but technological advances now make the use of this particular type of biomarker more feasible than ever before. The availability of such data, however, raises new kinds of questions and concerns in terms of research design and ethics. As such, the incorporation of genetic data to mixed methods studies requires careful consideration and, potentially, new kinds of partnerships and research practices.
This article documents the novel approach we developed to mix the analysis of genetic data with qualitative methods, in order to explore whether this could offer greater opportunities for targeting HIV prevention. We adopted an explanatory sequential design (Fetters et al., 2013), in which previously collected viral genetic samples were subjected to phylogenetic analysis, and the findings informed sampling for semistructured interviews from within so-called “transmission clusters” (more on this below). The two phases were connected, or integrated, not only at the point of selecting and recruiting participants for the qualitative component (Ivankova et al., 2006), but in other dimensions as well (e.g., team dimension, as our team was intentionally chosen to represent various disciplines; at the theoretical dimension through framing that employed the social ecological model; Fetters & Molina-Azorin, 2017). While use of the explanatory sequential design in HIV-related mixed methods projects was common at the time this research was conceived, to the best of our knowledge, no previously published studies mixed the particular methods we employed, in the way we devised. Indeed, phylogenetic analysis had not typically been mixed with other methods at all. Even today, publications from studies that bring phylogenetic analysis together with other methods do not discuss their study design using mixed methods terminology.
The overall aim of our mixed methods study was to characterize HIV transmission clusters in San Francisco. With this article, our goal is multifold. First, we aim to paint a general picture of the kinds of challenges we feel would be relevant to any mixed methods study incorporating similar biomarkers (e.g., construction of workable, multiparty partnerships, respect for participant privacy, feasibility of recruitment). Second, we offer detailed information on the method itself so that other researchers can attempt to replicate or adapt it to ethically use sensitive genetic data in a mixed methods approach. Third, we hope to encourage more scholarly conversations regarding the use of mixed methods in phylogenetics studies. In what follows, we outline the potential value of our novel approach, offer brief background on phylogenetics and its use in HIV research (including previous, mixed methods work), explain how our approach is different, and identify two main challenges it faced from the outset (ethical implementation and recruitment feasibility). Then, we detail the methods and provide recruitment results, along with a brief example of thematic findings from one cluster to better illustrate how the method works. In the Discussion section, we highlight surprises, lessons learned, and reflect briefly on the potential applicability of the method to research on other infections, including SARS-CoV-2.
Contribution to HIV Prevention Research
Before detailing our mixed methods technique and how it differs from previous work, we note why we believe devising a novel approach is worth the effort—that is, how we imagine it would contribute to HIV prevention research. In general, the hope is that this particular mixed method approach would produce data that, when integrated, would achieve an “expansive” fit (Fetters et al., 2013), enriching our understanding of the dynamics that structure HIV transmission within particular clusters by revealing previously unknown patterns among cluster members. Those findings could then help enhance the impact of prevention efforts. For instance, though it was a decade ago (Grant et al., 2010) that science demonstrated the effectiveness of HIV preexposure prophylaxis (PrEP—a daily pill taken to prevent HIV), this is still far from universal knowledge. If our research found that a large proportion of people in a particular cluster had never heard of PrEP, this could indicate that public health campaigns are missing important segments of their intended audience. On the other hand, a finding that certain locations are implicated in transmission clusters would suggest targeted prevention strategies, such as offering testing services and prevention materials at such epidemiological “hotspots,” could be of value. Finally, if homelessness or substance abuse formed the behavioral context of rapidly growing clusters, then a narrow focus on providing PrEP or HIV testing would be unlikely to be ideally effective. In short, by more fully characterizing transmission clusters, our approach could identify key drivers of the epidemic and help direct HIV prevention resources to their most efficacious use.
Brief Background on HIV and Viral Genetic Data
As noted above, in general, technological advances have made genetic analysis faster and cheaper, and therefore more feasible to incorporate in mixed methods studies. Regarding HIV specifically, there is another contributor to this feasibility. Since 2016, U.S. guidelines for HIV treatment have recommended antiretroviral drug resistance testing at time of diagnosis (Günthard et al., 2016), and this includes analyzing viral genetic sequences (i.e., the order of the virus’s genetic information). Thus, greater availability of HIV-related biomarkers is a by-product of appropriate treatment for the virus. However, even before this, viral sequences had been used in phylogenetic analysis, which seeks to understand the evolutionary development and diversification of the virus. Often the goal has been to identify transmission clusters, defined as groups of cases whose viruses share enough similarities to suggest a recent common source of infection or participation in linked chains of transmission. Epidemiological linkage data are displayed in phylogenetic “trees” (a diagram that shows the evolutionary interrelations of a group of genes from viruses derived from a common source), to visualize the transmission clusters. This analytical tool has long been used in HIV research (Leitner et al., 1996), though only relatively recently have researchers begun to explore incorporating sociodemographic information as a way to seek new possibilities for intervention, most specifically at the transmission cluster level.
Utility of Phylogenetics During the COVID-19 Pandemic
The current COVID-19 pandemic also demonstrates the value of phylogenetics in both the research and public health arenas. Viral sequencing of SARS-CoV-2 has been used to explore possible origins and reservoirs for the novel coronavirus (e.g., Rothan & Byrareddy, 2020; Shereen et al., 2020; Zhang et al., 2020), as well as its evolution in people (e.g., Stefanelli et al., 2020). Genomic sequencing of SARS-CoV-2 led to the recognition of the emergence of variants that are potentially more transmissible and cause more severe disease, such as B.1.1.7 initially identified in the United Kingdom (Rambaut et al., 2020), B.1.351 initially identified in South Africa (Tegally et al., 2020) and B.1.1.28 initially identified in Brazil (Candido et al., 2020). Phylogenetic analysis has also been used to characterize COVID-19 clusters, superspreader events and transmission dynamics, both retrospectively (J. F.-W. Chan et al., 2020; Deng et al., 2020; Gonzalez-Reiche et al., 2020; Lemieux et al., 2021) and prospectively (Meredith et al., 2020).
Phylogenetic Analysis in Mixed Methods Research
As noted previously, the push to combine phylogenetic data sets with clinical, epidemiological, and behavioral information is recent (Delva et al., 2016; Frost & Pillay, 2015; German et al., 2017). This has previously been pursued in two basic ways. In one approach, researchers typically access two existing pools of data, which may have accumulated over long periods of time: viral samples, and information drawn from sources like HIV case surveillance, contact tracing/partner notification, or clinical records. After conducting phylogenetic analysis on the viral samples and producing hypothesized transmission clusters, researchers “tag” sequences in the clusters with additional details about the individuals, for example, gender, race/ethnicity, exposure category, number of contacts reported, results of other lab work (Avila et al., 2014; P. A. Chan et al., 2015; Dennis et al., 2015; Jovanović et al., 2019; Levy et al., 2011; Lin et al., 2013; Little et al., 2014; Oster et al., 2011; Pasquale et al., 2016; Poon et al., 2015; Wertheim et al., 2017). Because interest in HIV transmission clusters often drives the research, social and demographic information are usually only added for clustered individuals. The limitation of this approach is that, as researchers do not generally attempt to relocate cluster members, only previously collected data can be added. In essence, this approach entails retrospectively building a bridge between two existing data sources that may not have been envisioned as complementary at the time of their creation. Although none of the studies cited here used this terminology, they might technically be described as using an explanatory sequential design (Fetters et al., 2013). The quantitative/phylogenetic analysis did occur first, and “qualitative data collection and analysis” followed—but only in the sense that qualitative data were retrieved after clusters are produced; scant “analysis” (if any) was performed on the nongenetic data separately.
An alternate, more prospective approach has sometimes been taken in research aiming to reconstruct and characterize specific local transmission networks. For these studies, participants recruited from particular locales consent to contributing two types of data at roughly the same time: viral samples for genotyping and demographic and risk behavior data, most often via computer-assisted self-interviews or questionnaires (Dennis et al., 2018; Grabowski et al., 2014; Jafa et al., 2009; Kharsany et al., 2014; Kiwuwa-Muyingo et al., 2017; Lee et al., 2009; Robertson et al., 2014). In contrast to the first group of studies discussed, this research is characterized by a convergent design, with the explicit intention of merging (Fetters et al., 2013) the two data sets. Using this approach, more (and more varied) nongenetic data were collected from research participants and hence were available to subsequently add to transmission clusters; however, sampling for the qualitative data collection was not driven by cluster membership because clusters had not been identified at the time data collection took place. A final detail bears mention about these previous efforts to incorporate phylogenetic analysis in a mixed methods approach: we could only locate one prior mixed method HIV transmission cluster research study that included semistructured interviews in the study design (Robertson et al., 2014). In this study, however, sampling for those interviews was not guided by cluster membership.
As methodologists will appreciate, these subtleties are significant because in any study, the type of data collected and the way participants are sampled have profound implications for the types of questions the resulting data set is capable of answering. Because the aim of our study was to discover whether deeply contextualized social aspects of HIV transmission clusters might reveal new opportunities for HIV prevention, two elements of research design were crucial: (1) that our sample for qualitative data collection be recruited with reference to cluster membership, and (2) that we collect data capable of sufficiently exploring cluster members’ experience. Hence, we developed an approach that uses HIV transmission clusters as a sampling frame for semistructured interviews to study HIV transmission clusters in San Francisco. (We also included any patients with new infections and a few patients not from a cluster to keep interviewers blinded to whether a particular participant was in a cluster; the same recruitment procedures were used for these patients, but their data are not included in this article). Our approach combines two elements of previous work in a novel way: it combines the sampling by cluster frequently employed in the first group of studies mentioned above with the strategy of collecting additional data directly from research participants used in the second group. Our use of semistructured interviewing is also novel, allowing a more holistic and in-depth exploration of participants’ perceptions and experiences than is possible through questionnaires or other structured methods (Arnold & Lane, 2011). Findings from thematic analysis of interview data by cluster will be reported in a separate manuscript.
Recruitment Challenges Using Surveillance Data
This method raised two important questions, acknowledged from the outset. First, would enough cluster members be locatable/contactable to render this approach feasible? Second, could we devise ethical ways to identify and recruit potential interviewees (especially regarding respect for privacy)? The latter concern has increasingly been raised with reference to phylogenetic research (Coltart et al., 2018; Mutenherwa et al., 2019; Schairer et al., 2017). This article details the procedures we developed, and answers these two questions.
Method
Overview
Collaborating Investigator Institutions
This study is a collaborative effort between the University of California San Francisco (UCSF) and the San Francisco Department of Public Health (SFDPH). The AIDS Research Institute (ARI)–UCSF Laboratory of Clinical Virology, which conducts HIV-1 drug resistance testing for publicly funded and community-based clinics in San Francisco, provided viral sequence data exclusively for the purposes of phylogenetic analysis for this study.
Identifying and Selecting Clusters
The parameters used for identifying transmission clusters have been described previously (O’Keefe et al., 2021; Truong et al., 2015). In brief, HIV transmission cluster membership was based on full protease and partial reverse transcriptase sequences. Transmission clusters were defined as having Shimodaira–Hasegawa node support greater than 0.90 and mean pairwise genetic distance less than 0.03 substitutions per site. Sequences from transmission clusters were matched to the San Francisco HIV/AIDS Case Registry to obtain data on demographic and risk characteristics. Purposeful selection of transmission clusters for possible inclusion in the qualitative component of the study was conducted between 2015 and 2017.
Recruitment
A multiphase recruitment process followed, focusing on individuals who had contributed sequences to selected clusters. The process respected individual patient confidentiality while allowing the SFDPH-based recruiter (hereafter, “recruiter”) and qualitative team at UCSF (hereafter, UCSF) to jointly manage recruitment and ensure that potential participants met the following eligibility criteria: at least 16 years of age, 1 cognitive capacity to provide informed consent; and their ability to understand and respond to interview questions in English (because translating the lengthy qualitative interview guide and survey to the multiple major foreign languages spoken by San Francisco residents was not feasible), commute to interview location, and complete a 2-hour interview. The UCSF team scheduled and conducted the interviews. Given the extensive communication and potentially sensitive information involved in this research, special attention was devoted to the design of study-specific data-transfer architecture, which included a searchable, deidentified database to track recruitment efforts, and a separately stored spreadsheet to transfer names and contact information of consenting participants (described in detail next). This research was approved by the UCSF Institutional Review Board.
Detailed Methods
We designed the procedures below to identify and select clusters for recruitment (Phase 1-2), and recruit and interview selected individuals from those clusters (Phase 3-5), as shown in Figure 1. A critical aspect of our collaboration was that SFDPH utilized existing patient contact information to work directly with the potential participants, thereby safeguarding their anonymity until specific consent to release their names and contact information to UCSF could be obtained.

Phases of sample selection and participant recruitment for qualitative interviews conducted by UCSF and SFDPH, San Francisco, 2015-2017.
Phase 1: Identifying Initial Clusters
SFDPH-based personnel analyzed viral genetic data provided by the ARI-UCSF Laboratory of Clinical Research to identify HIV transmission clusters. The SFDPH analyst annotated each sequence within the transmission clusters. This provided a unique but anonymous identifier for each sequence and for each contributing individual. The latter was necessary because roughly one quarter of patients had contributed multiple sequences to the data set; we included all sequences (rather than only the earliest) in the phylogenetic analysis. Additional information about each sequence included mode of transmission and gender (see Table 1). Annotated, anonymized clusters were uploaded to a shared, secure online platform, hosted by UCSF, to which the SFDPH and UCSF study teams both had access.
Cluster Characteristics and Sample Size During Phases 1 and 2, and Data Collection for Qualitative Interviews, San Francisco, 2015-2017.
Note. MSM = men who have sex with men; PWID = person who injects drugs; MSM-PWID = MSM who inject drugs; HET = heterosexual; NIR = no identified risk; NRR = no risk reported; CW = cis-women; TW = transwomen; PC = persons of color.
Number of matched individuals less the number deceased, lost to follow-up, out of jurisdiction. bNumber and percentage are out of the total number of matched individuals in the cluster. cNumber and percentage are out of the total number of recruitable individuals the cluster. dContained one sequence that was unmatchable (out of SFDPH jurisdiction), hence not counted.
UCSF reviewed the data seeking the largest clusters for possible inclusion in the qualitative component, and determining how many clusters to select for recruitment. Clusters composed of more than 10 sequences were selected. Several factors contributed to the determination of this minimum cluster size. First, our objective was to interview a meaningful proportion (20%-40%) of persons in all selected clusters and we assumed the participation rate for interviews would not exceed 50%. Second, it was anticipated that some persons in clusters would be deceased, lost to follow-up (LTFU), or fall outside of SFDPH’s jurisdiction (OOJ), and therefore be ineligible to participate. These considerations collectively suggested that very small clusters would be unlikely to yield sufficient interviews.
Furthermore, recognizing the possibility of multiple contributions from individual patients, among clusters with ten or more sequences, those with the greatest number of unique individuals were assigned the highest priority for recruitment. UCSF uploaded the list of selected clusters to the secure online platform.
Phase 2: Selecting Clusters
SFDPH-based personnel accessed the HIV/AIDS Case Registry to obtain additional demographic and risk data on the cases in clusters prioritized for recruitment. Data included sample collection date, HIV diagnosis date, age at diagnosis (given as a range), race/ethnicity, and whether the individual was not “recruitable” due to being deceased, LTFU, or OOJ. The revised data set was uploaded into the secure online platform.
Although all cases were still anonymized at this stage, these additional data allowed UCSF to review the overall diversity of the pool of possible participants, as well as determine the number of recruitable individuals in each cluster. Using this information, UCSF generated a final recruitment spreadsheet and uploaded it to the secure, online systems.
Phase 3: Identifying and Informing Potential Study Participants
The SFDPH’s HIV/AIDS Case Registry contains all the information needed to contact the selected patients directly. Nevertheless, the investigators deemed it prudent to start the recruitment process by first notifying medical providers about the selected patients. Hearing about the study first from providers or their office staff would help minimize the impression of direct intrusion by the study into the patients’ private medical circumstances. The procedure would also allow providers to advise the recruiter to refrain pursuing any patient who should not be approached.
This recruitment process called for the SFDPH-based recruiter to use the HIV/AIDS Case Registry data to identify the medical provider of each individual selected for recruitment and contact the provider through telephone or email using approved scripts/materials and following privacy procedures compliant with the Health Insurance Portability and Accountability Act of 1996. The recruiter described the study to the provider and either requested that the provider inform the patient about the opportunity to participate or obtained the provider’s assent to have the recruiter approach the patient directly. Providers who opted to inform their patients would discuss the study with their patients, ascertain their interest in learning more about the study, and relay the response back to the recruiter. Patients who wished to receive further information about the study would verbally consent to contact by the recruiter. In cases where providers assented to have the recruiter directly contact and introduce the study to their patients, the recruiter used the contact information already available to the recruiter in the registry.
Phase 4: Initial Consent to Be Contacted by UCSF
The recruiter called all selected patients who had been approached by their providers or whose providers assented to the contact to explain the study in greater detail, specifically indicating that the UCSF study involved a 2-hour, recorded interview that required the recruiter to pass patients’ contact information to UCSF. If interested, they were asked to consent for the release of their name and contact information to the UCSF team. Only then was contact information for consenting participants added to the recruitment spreadsheet on the secure online platform. The recruiter also alerted the UCSF interviewers immediately by email that new contact information had been added to the recruitment spreadsheet.
Phase 5: Study Recruitment and Scheduling
The UCSF study team accessed the secure online spreadsheet to obtain names and contact information for potential interviewees as they became available. UCSF interviewers contacted potential participants to answer any remaining questions and schedule qualitative interviews.
Qualitative Data Collection
Interviews took place in a private location and were conducted by one of two trained qualitative researchers on the study team. The interviews lasted between 1 and 2 hours, and were audio recorded. They employed a narrative approach to the initial, semistructured portion (to enhance interviewees’ ability to account for their lived experience) and a structured survey at the end (to ensure a small number of standard data points were covered with each participant). Topics covered included prediagnosis and postdiagnosis conceptions of risk, prevention strategies, sexual practices, and patterns of drug use, as well as the context around seroconversion, HIV diagnosis, and antiretroviral treatment. Participants received an incentive of $75 as compensation for time and travel.
Results
Phase 1: Identifying Initial Clusters
The study’s initial data set contained 7,253 viral sequences, from which our analysis identified 1,392 clusters. Nine of those clusters contained more than 10 sequences. UCSF deemed the largest eight clusters likely sufficient to satisfy needed sample size. These eight clusters ranged in size from 12 to 29 sequences, contributed by 9 to 20 unique individuals who could be matched to records in the case registry (again, clusters with the greatest number of unique individuals were prioritized for recruitment). Of the 104 individuals in the identified clusters, 83 were cis-men, 14 were cis-women, and 7 were transwomen. In four of the clusters, the mode of HIV transmission reported for the majority of members was sex among men who have sex with men (MSM). In one cluster, transmission was largely MSM and people who injection drugs (PWID); three clusters contained mixed HIV transmission modes. Collectively, less than 60% of the transmission for cases within the chosen eight clusters was attributed to MSM. More details on the clusters selected for recruitment are shown in Table 1.
Phase 2: Selecting Clusters
Cluster members known to be deceased, LTFU or OOJ were removed. The remaining clusters contained between 7 and 10 individuals for whom recruitment efforts could be undertaken. On average, 53% of the sequences in clusters translated to recruitable individuals. To provide one detailed example, the second-largest cluster (B in Table 1) contained 17 sequences. Two patients had each contributed two sequences, and one sequence was classified as OOJ, leaving 14 matched individuals. Of these, three were LTFU and two more were OOJ. Thus, the recruitment list for that cluster contained nine individuals (9/17 = 53%). Overall, persons of color comprised two thirds of the pool of potential participants. The finalized recruitment list included 68 individuals.
Phase 3: Identifying and Informing Potential Study Participants
A review of medical records by the recruiter revealed that five of the 68 possible participants were ineligible because they relocated outside the study area, were non-English speaking, incarcerated, or suffering from severe mental health issues. Initial study letters or emails were sent to the medical providers of the remaining 63 individuals. It was often necessary to contact multiple providers per case. On average, outreach attempts were made to 1.74 providers per potential participant (range 1-5). The average number of providers contacted for individuals who completed interviews was 1.67 (range 1-3). Of the 110 providers contacted in the course of recruitment, half practiced in a public hospital setting (n = 55) and nearly one-third practiced in public clinics (n = 32). The remaining providers were in private practice at a large health maintenance organization (n = 10) or another location (n = 6), or at the Veteran’s Administration (n = 4).
One public clinic providing care for two patients being recruited declined to participate in the study. In addition, one medical provider did not respond to outreach; we interpreted this as a passive decline. Through discussions between the recruiter and providers who did respond, an additional 17 participants were found to be ineligible, predominantly due to clinicians’ assessment of patients’ cognitive function or mental health, or having dropped out of care (LTFU). In 35 cases, providers authorized the recruiter to contact patients directly about the research. In eight cases, providers or other personnel at the practice location approached the patient about the study; four of these patients consented to be contacted by the recruiter and four declined.
Phase 4: Initial Consent to Be Contacted by UCSF
The study recruiter was able to contact all but one of the 39 individuals who entered this phase. Of those contacted, five were ineligible (because they had relocated outside study area, or were non-English speaking), five declined to participate (not interested, too busy), and 28 consented to release contact information to UCSF. No identifying information about any patient was ever shared with UCSF before a patient provided verbal consent to release their contact information, a primary aim for ethical recruitment of our participants.
Phase 5: Study Recruitment and Scheduling
UCSF attempted to contact 28 potential participants; 2 were unreachable or unresponsive. Of the 26 potential participants reached by UCSF, two declined (relocated out of the study area; no longer interested) and 24 were scheduled for an interview at UCSF.
Interview Data Collection
The 24 scheduled participants were successfully interviewed. Overall, an average of 35% of recruitable participants in each selected cluster was interviewed. Interviewees included 20 men (including 14 MSM), three cis-women, and one transwoman; nearly two thirds were persons of color. The number and characteristics of individuals interviewed, organized by clusters, are displayed in Table 1. Thematic analysis of interview data (Ryan & Bernard, 2003) is underway. As the qualitative data were collected with the intent of “merging” them with the phylogenetic trees, this integration will occur after thematic analysis is complete. Narratives from each cluster will be analyzed for areas of convergence and divergence, and trends will be compared across clusters. More detail on data analysis and the findings produced will be reported in future manuscripts.
Findings by Cluster
As noted, our hope is that integrating our phylogenetic and qualitative data produces an expansion in our understanding of the dynamics structuring transmission clusters. Though full thematic findings of the interviews are beyond the scope of this article, and integration is not yet complete, a brief example may help demonstrate the “value added” of the method detailed here. Cluster B (see Table 1) contained 17 sequences, contributed by 14 unique individuals, nine of whom were available to recruit. This was an ethnically diverse cluster, composed of Black, Hispanic, Native American, White, and multiracial individuals. In terms of modes of transmission, cases had been attributed to PWID, MSM, a combination MSM-PWID (used when an individual reports both MSM and injection drug use), and heterosexual transmission. There were cis-men, cis-women, and transgender women in the cluster (the database counts transwomen as MSM). This is what was known about Cluster B from phylogenetic analysis and surveillance data.
Unfortunately, during recruitment, four of the nine potential participants were found to have been lost to medical follow up or suffer from mental health issues their physicians considered serious enough to preclude their participation; one had been incarcerated; and an additional individual declined participation, thus we were able to interview only three cluster members. We spoke with a Black cis-woman infected through heterosexual transmission, a multiracial cis-male infected through MSM, and a Black transgender woman infected through MSM-PWID (we’ll call them Renee, Jordan, and Bella, respectively). Given the multiple ways these participants’ cases appear to differ, one might puzzle over the dynamics linking the cluster. The qualitative findings from these interviews, however, add more context to the phylogenetic assessment, illustrating how surveillance data can be correct and still insufficient to understand the full story behind chains of transmission.
Renee did report believing she had contracted HIV from a male partner, but their joint substance use, which she characterized as addiction, was the crucial context that she considered the real source of her risk. She noted that this partner “was cheatin’ on me with other people, so I was faithful but they weren’t.” When asked to confirm that this had been a stable partner, she responded, “If you call it that. I mean it’s—[pause] we were on drugs, we’re all doin’ drugs. I mean, what stability is that?” She also intimated that her substance use had influenced the types of partners she had: “And now when I’m clean, and I look back and I see some of these people—What the hell was I thinkin’? Look at him! What was I thinking?”
In an analogous way, while most sexual experiences Jordan discussed during his interview involved cis-women, he did report behavior that would have been classified as MSM in surveillance data (see Truong et al., 2019, for a discussion of whether such classifications enhance or complicate our understanding of the epidemic). However, this happened only under very particular circumstances: during his own period of substance abuse, specifically of methamphetamine, he related having sexual contact with transgender women. Although he also used crack, he explicitly differentiated the effect the two drugs had on his sexual behavior: I can get weird on drugs . . . the drugs will make you [do] whatever . . . I’ve lowered my standards with crack, but I’ve never done some weird—hung out with—let a gay dude touch my penis so I could get a hit—or something like that . . . [but with meth I had] sex with some trannies in there. I actually lived with a chick for a little while who had the full thing, the full—you know, she had tools. So I wouldn’t call her my girlfriend, but I did have sex with her many times, ten to fifteen times. I lived with her for a couple months. But then it just was getting too weird.
Finally, like Renee and Jordan, in her interview Bella reported behavior that accorded with the surveillance data associated with her case. Unlike them, the surveillance data captured what she considered the most significant elements of her risk fairly well (with the important exception that she did not consider herself a man but was still classified as one, through use of the category MSM). Bella had gone through a period of substance abuse, during which she would performed transactional sex and had other sexual relationships, very often with cis-men who identified as heterosexual. She explained, “Then I started shootin’ up [meth] and then the rush was even different. So then everything become sexual, everything.” Bella believed she knew which of her partners she’d acquired the infection from: “I risked myself, messin’ around with somebody that I knew that messed around with somebody who had [HIV], and he told me he put on a condom and he didn’t.” However, she also recognized that this had not been her only potential exposure. Like Renee, she looked back and lamented the encounters she’d had with “scrungy guys” she wouldn’t choose as partners when sober, but “when I was high, I didn’t care . . . [a]bout messin’ around with guys, sleepin’ around, suckin’ dick, doin’ whatever, I didn’t care. ‘Cause I was high. So it didn’t matter.”
For these three participants, then, drug dependency and sex were intertwined as HIV risk factors, even though this was not apparent from the surveillance data attached to each case. Their stories also overlapped in other ways. All had experienced periods of housing instability and homelessness, and one San Francisco neighborhood served as an important backdrop in all three cases. In the case of Cluster B, merging semistructured interview data with phylogenetics and surveillance data offers a previously-unseen glimpse into the processes by which diverse people came together in particular ways, at a specific time and place. This is the context necessary to understand the dynamics of HIV exposure and acquisition within this heterogeneous transmission cluster.
Discussion
We describe the process of integrating phylogenetic and qualitative approaches with the aim of achieving greater insight into HIV transmission clusters in San Francisco, and through these to improve HIV prevention. The procedures we developed linked viral genetic sequences with epidemiological data to identify individuals in large transmission clusters, which then served as a sampling frame for semistructured interviews. This approach required collaboration across multiple entities responsible for maintaining different and sensitive data sets. Sharing sensitive patient data with UCSF investigators without prior patient consent was not possible given legal protection of surveillance data, Health Insurance Portability and Accountability Act of 1996 regulations and standard ethical practice. Hence, our procedures called for sharing only deidentified data across organizations until patients gave specific verbal consent to the recruiter to release contact information to UCSF. Data sharing, with or without identifiers, was always conducted through secure data systems. Recruitment data show that we could develop and implement feasible procedures while protecting patient confidentiality.
Approaching Providers First
The nature of this research, focused as it was on transmission clusters, required participation of specific individuals, making it infeasible to rely on recruitment methods often used for in-depth interviews, for example, study fliers or online banner ads. After all, even patients who believe they know how they acquired the virus likely do not know how many other transmission events that source might have been involved in. Hence, surveillance and genetic data offered the only route to identify patients in large transmission clusters. Given these are sensitive data sets of private patient medical information, we opted to reach possible participants by approaching their providers first through our SFDPH collaborators. This approach allowed providers to decline the recruitment of any individual for whom they considered participation would be detrimental and provided potential participants an additional layer of protection. Provider responsiveness, however, was the most significant barrier to contacting patients. To address this challenge, we modified recruitment procedures after the study was under way to allow providers to assent to direct contact of patients by the recruiter, a change suggested by providers themselves. This procedure facilitated contact with some patients, though an important barrier remained. Since we chose to obtain the assent for direct outreach on an opt-in basis only, it could not be used to recruit patients whose providers did not respond to our outreach. A similar procedure on an opt-out basis, that is alerting providers that a particular patient had been selected for recruitment and would receive direct outreach from recruiters if the provider failed to respond within a given time frame, might have reduced this barrier, but whether it could be ethically employed requires careful consideration.
Fortunately, once the recruiter was able to contact potential participants, the large majority consented both to share the contact information with UCSF and to participate in the study. Moreover, most who agreed to be interviewed completed the interview.
Cluster Size Consequences
Our experience with these procedures showed that even when starting from clusters with the largest number of sequences, the number of recruitable individuals per cluster quickly shrank, whether due to multiple sequences per person, or removal of OOJ and LTFU cases. While such culling, along with the more limited determination of ineligibility through medical records review, greatly reduced cluster sizes, these procedures resulted in the recruiter conducted time-intensive provider outreach only for potential participants with a realistic chance of participation.
The relatively small size and diversity of the largest clusters in our data set was unexpected. Based on findings of other studies, we had predicted the largest clusters would include upwards of 30 sequences (Brenner et al., 2011; Lewis et al., 2008). As reported, no cluster in our data set contained more than 29 sequences and 20 individuals. Furthermore, as this study focused on HIV transmission clusters in San Francisco where 74% of people living with HIV are MSM, 60% are White, and 92% are male, one might expect the largest clusters to be composed primarily of white MSM (SFDPH, 2016). However, in the large clusters from which we recruited interviewees, MSM as mode of transmission was underrepresented while people of color and cis-women were overrepresented, compared with people known to be living with HIV in San Francisco (SFDPH, 2016). Transwomen were represented in roughly the proportion one would expect, based on surveillance data. The underrepresentation and overrepresentation of certain key populations within these large clusters raises questions about local HIV transmission chains that the larger phylogenetic study will be well-positioned to address. One aspect of our study to consider is that sequence data came from a laboratory that provides services for publicly funded facilities. Hence, sequences from patients obtaining care at private facilities were not included. If White MSM are more likely to access care at private facilities, this may partly explain the difference between the sociodemographic composition of individuals contributing viral sequences and the characteristics of San Francisco residents living with HIV.
Contributions of Clusters to HIV Prevention
The narratives provided in the interviews will attest to whether the contexts in which individuals acquire HIV share similarities within clusters, and whether the qualitative data we gathered can be used to inform HIV prevention interventions. We provided a brief example, based on interviews with three members of a highly diverse cluster, which illustrated the way cis-gender men and women and transwomen were linked through similar behavior/mental health conditions (substance use/addiction), although this was not apparent from the surveillance data attached to each case and resulted in different modes of transmission. We believe this suggests that, at a minimum, qualitative data can usefully “thicken” (à la Geertz, 1973), the “description” of the HIV epidemic offered by phylogenetic trees.
To be fair, finding a relationship between drug use and sex (transactional and not) is, of course, not novel in the HIV literature. Neither is the increased HIV risk experienced by the unhoused. But here we would like to make two points. First, just because the mixed method approach documented here did not uncover new correlations or practices in this one, brief example does not mean it can’t. Second, findings need not be novel to be valuable if the knowledge can be applied. A nuanced understanding of the context surrounding HIV transmission events (even if that context echoes previous research findings) can inform prevention planning by public health officials and help prioritize efforts when resources are constrained.
Limitations of Study
Attention to characteristics of potential research participants is always required, especially when conducting purposive sampling. Working from transmission clusters, however, requires even greater reflexivity (German et al., 2017). Researchers must be cognizant of who is contributing viral sequence specimens to the data set, and who is not. As aforementioned, the source for viral sequence samples analyzed in this study may have impacted the sociodemographic composition of the clusters we identified. Likewise, individuals with undiagnosed HIV, as well as people living with HIV who have not had resistance testing done, did not contribute to our data set. To the extent that any of the above dispositions are more likely among certain types of people than others, selection bias may be in effect. It is essential to bear in mind this potential bias, because all such excluded individuals could have been involved in the actual transmission chains that produced the clusters depicted in our phylogenetic tree, even though these cases were not in our data set and therefore could not be recruited.
In addition, although all patients were included in the cluster data, the interview data will be biased by not having cases classified as LTFU and OOJ that were removed from our pool; these individuals may have moved away or simply dropped out of care. Similarly, cases that did not meet eligibility criteria will contribute to bias. None of the patients failed to meet the 16 years of age criterion. However, the selected clusters included a few non-English speakers, as well as a few patients with insufficient mental capacity to provide informed consent and complete an interview. Unfortunately, the contexts in which these individuals practiced HIV prevention and acquired HIV remain outside our understanding of transmission within the clusters we studied.
Thus, we will need to be mindful that the stories told in interviews conducted from our sample cannot not be considered “the whole story” of a cluster. This circumstance, however, is often the case with qualitative research, which does not typically claim to be exhaustive and generalizable (Pope & Mays, 1995; Tracy, 2010). In that sense, the limitations of this sampling method may not differ significantly from more traditional qualitative approaches. That being noted, the technical expertise and resources required to employ this sampling method are considerable, and the conditions under which sampling from clusters is practicable, for example, in places where there is a high enough sampling fraction (Grabowski et al., 2018) are narrower than, for instance, convenience sampling (Volz et al., 2012).
Conclusion
The procedures described here take up the challenge of pairing phylogenetics with other kinds of data in a way that differs from that used in research to date. As described above, some previous studies have made excellent use of existing data, combining phylogenetic analysis with information from epidemiological surveillance or contact tracing while other studies have taken a valuable prospective approach. However, the methods used in previous research meant that the broader circumstances surrounding the infection of individuals within particular transmission clusters could not be explored. In contrast, our method samples from transmission clusters and collects data spanning temporal, topical, and emotional breadth through semistructured interviewing. This difference has important ramifications, as already noted, for the kinds of questions the resulting data set can answer. It also, however, has ethical implications. As Coltart et al. (2018) noted in their review of ethical issues in phylogenetic research on HIV, these analyses can inadvertently create the perception that certain groups are responsible for spreading the virus, and may lead to their (further) stigmatization, especially if structural factors are omitted from analysis. They further indicated, however, that researching “structural factors and their effect on HIV transmission risk can decrease the blame mentality and create an alternative understanding of how to reduce HIV transmission and which individuals or groups are most at risk and why” (Coltart et al., 2018, p. e659).
Analysis of the collected qualitative data will enable assessment of cluster-level patterns with reference to precisely this structural context, and thus may contribute to improved tailoring of prevention and treatment efforts among populations at risk for or diagnosed with HIV. This approach also has relevance beyond our study. Sampling from transmission clusters can be done for many types of subsequent data collection (i.e., not only semistructured interviews), which may expand the ways those clusters can be characterized and understood. Alternatively, the combination of sampling and qualitative data collection could be incorporated in research on infectious diseases other than HIV where greater knowledge of transmission context could be helpful (e.g., hepatitis C, tuberculosis; Grabowski et al., 2018).
The proliferation of genetic data begs the question of whether the novel method detailed here could be used to learn more about the context of human-to-human transmission, particularly in fast-growing clusters associated with COVID-19 outbreaks. At the moment, the answer appears to be “maybe,” largely because debate is ongoing among the scientific community about the best way to conduct phylogenetic analysis on COVID-19 samples, and what the results mean (Forster et al., 2020; Mavian et al., 2020). As consensus emerges on the phylogenetic methods, the demand for testing will likely have generated a large number of viral sequences for analysis, making it possible to identify clusters quickly. The unfortunately high number of infections will be associated with a significant amount of surveillance data, as states and localities ramp up contact tracing efforts (Fortin, 2020). As with HIV, however, the data collected will not provide the nuanced context surrounding infection that a semistructured interview can elicit. Thus, the method we devised may offer an opportunity for researchers to carefully implement a sampling frame that will allow them to recruit and interview individuals in high-transmission clusters to better understand what contributes to spread. However, it will still be necessary to attend to issues like the ones outlined above. While COVID-19 may be less (or at least differently) stigmatized than HIV and other sexually transmitted infections, both survivors of the virus (Maslin Nir, 2020) and people of Asian descent are reporting negative, marginalizing reactions from others (Tavernise & Oppel, 2020). Viral genetic data are personal, private health information, therefore concerns about privacy are still salient. Research partnerships like the one described for this study are likely to be necessary, and laws and ethical considerations already in place will need to be observed when using surveillance data and contacting patients. Overall, we would concur with the voices encouraging collaboration to speed results and urging great care be taken with research findings (Chookajorn, 2020).
Contributions to the Field of Mixed Methods Research
One contribution to the field is the demonstration that phylogenetic analysis and surveillance data can be harnessed to generate a sampling frame for subsequent qualitative data collection, using an explanatory sequential design. The novel aspect of our approach is primarily that semistructured interviews with cluster members were planned and collected prospectively. This is in contrast to previous studies that had access only to data collected for other reasons, and/or that are typically very “coarse” and closed-ended. Another contribution of relevance to mixed methods investigators with interest in phylogenetics is that we provided a detailed account of how well each step in our approach worked, including the limitations and challenges encountered. We offered concrete lessons learned, highlighting certain steps in our approach that could be improved on, such as engagement of providers. A third contribution is simply to apply the constructs and terminology of the mixed methods community to research that includes phylogenetic analysis (both our own and previous studies). Our literature review suggested that these scholarly communities are rarely in dialog; this may be partly because they speak such specialized languages (e.g., “explanatory sequential design” and “Shimodaira–Hasegawa node support”). Particularly in the context of the COVID-19 pandemic, we hope to have encouraged other researchers to consider experimenting with sampling from clusters as a systematic approach to achieve a more nuanced understanding of viral transmission networks.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by NIH R01 MH096642 (PI: H.M. Truong).
