Abstract
Few empirical studies exist to guide researchers in determining the number of focus groups necessary for a research study. The analyses described here provide foundational evidence to help researchers in this regard. We conducted a thematic analysis of 40 focus groups on health-seeking behaviors of African American men in Durham, North Carolina. Our analyses revealed that more than 80% of all themes were discoverable within two to three focus groups, and 90% were discoverable within three to six focus groups. Three focus groups were also enough to identify all of the most prevalent themes within the data set. These empirically based findings suggest focus group sample sizes that differ from many of the “rule of thumb” recommendations in the existing literature. We discuss the relative generalizability of our findings to other study contexts, and we highlight some methodological questions about adequate sample sizes for focus group research.
Introduction
Focus groups are commonly used across a wide range of research disciplines, including health sciences, marketing, communications, and virtually all fields of behavioral and social sciences. Despite this widespread use, little attention has been given to methodological aspects of the technique, beyond the formation and conduct of focus groups (Agar and MacDonald 1995; Carey 1995; Duggleby 2005; Kaplowitz and Hoehn 2001). One of the largest gaps in the focus group literature concerns the question of sample size: How many focus groups are needed in a study to adequately address a research objective?
Seventy years after the method was introduced, researchers must still rely on rules of thumb and personal judgment when deciding how many focus groups to include in a study. Similarly, reviewers and funders have little basis for determining whether a study includes an adequate number of focus groups. The analyses we present in this article begin to address the need for empirically based recommendations on focus group sample size.
Focus Group Sample Size Recommendations
We reviewed textbooks that cover general qualitative research or focus group methodology (bibliography available in Online Appendix). Of the 62 books reviewed, 42 provided no guidance on the number of focus groups needed for a study (many did not even cover the topic of focus groups), six recommended saturation as the key determinant, 10 provided some form of numeric recommendation, and four mentioned saturation and included a number. Focus group sample size recommendations from these books and a number of journal articles range from as few as two focus groups per study to more than 40 (Fern 1982; Greenbaum 1997; Kitzinger and Barbour 2001; Krueger and Casey 2015; Morgan 1996; Powell and Single 1996; Vaughn et al. 1996). One commonly cited guideline is that focus group research requires at least two groups for each defining demographic characteristic (Barbour 2007; Carey 1995; Knodel 1993; Krueger and Casey 2015; Morgan 1988; Ulin et al. 2005). None of these recommendations are supported by empirical data.
Nonprobability Sampling and Saturation
Many published works cite “theoretical saturation” as the primary method for determining nonprobability sample sizes in qualitative research (e.g., Bluff 1997; Byrne 2001; Fossey et al. 2002; Guest et al. 2006; Morse 1995; Sandelowski 1995). Theoretical saturation was first defined by Glaser and Strauss (1967:61) as the point at which “no additional data are being found whereby the [researcher] can develop properties of the category.” Glaser and Strauss’s definition, however, is intended for a grounded theory approach, in which theoretical models are developed and constantly compared against incoming data. For grounded theorists, theoretical saturation refers to the point at which the theoretical model being developed stabilizes. Because not all qualitative data analysis employs a grounded theory approach, Guest and colleagues (2006:65) and others use the broader term “data saturation,” which is defined as “the point in data collection and analysis when new information produces little or no change to the codebook.”
The concept of saturation has become the gold standard by which sample sizes for qualitative inquiry, including focus groups, are determined (Guest et al. 2006; Guest and MacQueen 2008). However, using the concept as a practical reference for estimating sample sizes is problematic because most research requires such estimation before a study is implemented (Charmaz 2014; Cheek 2000). By definition, saturation can be determined only during or after data analysis.
Some scholars have addressed this issue using mathematical modeling or statistical methods. Romney et al. (1986) used factor analysis and mathematical modeling (cultural consensus theory) to show that the confidence level associated with obtaining accurate information is a positive function of the degree of consensus about a particular topic and the number of interviewees. They calculated that as few as four individuals can render accurate information with a high confidence level (.999) if they possess a high degree of knowledge with respect to the domain of inquiry. Galvin (2015) reviewed and statistically analyzed—using binomial logic—54 qualitative studies and identified a similar trend. Like the work of Romney and colleagues, Galvin’s analysis illustrates that the probability of identifying a theme is a function of the breadth of its presence within a study population and the number of individuals in a sample. Based on his analysis, the probability of identifying a concept (theme) among a sample of six individuals is greater than 99% if that concept is shared among 55% of the larger study population.
Employing this same logic, Fugard and Potts (2015) developed a quantitative tool to estimate sample sizes for thematic analyses of qualitative data. Their calculation incorporates (1) the estimated prevalence of a theme within the population; (2) the number of desired instances of that theme; and (3) the desired power for a study. Their tool estimates, for example, that to have 80% power to detect two instances of a theme with a 10% prevalence in a population, 29 participants would be required.
These theoretical models and tools provide general guidance for determining sample sizes for qualitative studies. As with any theoretical model, however, they are limited by their inherent assumptions and absence of empirical data. One of the first studies to bridge this gap, and systematically examine saturation using raw data, was conducted by Morgan et al. (2002). In individual interviews about environmental risks, Morgan and colleagues found that the first five to six interviews produced the majority of new data, and little new information was gained as the sample size approached 20 interviews. Across all four of their interview data sets, approximately 80–92% of concepts were identified within the first 10 interviews. 1 This finding is consistent with the results of a stepwise inductive thematic analysis of 60 in-depth interviews (10 successive batches of six interviews) among female sex workers in West Africa (Guest et al. 2006). Of the 114 themes identified in the entire data set, 80 (70%) turned up in the first six interviews and 100 themes (92%) were identified within the first 12 interviews (Figure 1).

Code creation during analysis of 60 individual interviews (adapted from Guest et al. 2006).
Results from two empirical studies reported by Francis et al. (2010) are congruent with earlier findings (Guest et al. 2006; Morgan et al. 2002). In one study, Francis and colleagues interviewed 14 doctors to identify their beliefs about managing patients with upper respiratory tract infections. In another study, they interviewed 17 blood relatives of people with Paget’s disease of the bone to explore the acceptability of a potential genetic screening service. For both studies, the researchers operationalized saturation as the point, after conducting 10 interviews, when three additional interviews yielded no new themes. Based on this definition, the authors did not reach saturation within the first study among 14 doctors but did reach saturation after 17 interviews in the second study. Due to the small sample sizes, estimating the proportion of themes identified at distinct points in their analysis is difficult. However, in both of the studies, the vast majority of themes were identified within five to six interviews—the point at which the curve begins to sharply flatten out.
Research by Hagaman and Wutich (2017) suggests that cross-cultural research contexts may require more interviews to reach saturation than the aforementioned studies. These authors conducted a study exploring water norms and knowledge across four sites in four countries. They found that fewer than 16 interviews were enough to identify common themes from each of the individual sites, but that 20–40 interviews were needed to identify meta-themes that cut across all sites.
We found only two studies that empirically assessed saturation within focus groups. Coenen et al. (2012) compared two approaches for analyzing focus groups and individual interviews: an inductive (open coding) approach in which themes and codes were generated from the data and a deductive approach in which existing codes—from the International Classification of Functioning, Disability and Health (ICF)—were applied to the data. The authors operationalized saturation as the point during data collection and analysis at which linking the concepts of two consecutive focus groups or individual interviews revealed no additional second-level categories. Saturation was reached after performing five focus groups and eight individual interviews using the inductive approach and five focus groups and eight individual interviews using the deductive approach. In a study intended to validate the ICF—and using the same method of assessing saturation as Coenen et al. (2012)—Kirchberger et al. (2009) found that saturation was reached after conducting eight focus groups. A summary of these two data-driven studies, as well as those based on individual interviews, is presented in Table 1.
Summary of Saturation Findings from Empirical Studies.
Our study builds on the existing literature by focusing explicitly on inductive thematic analysis within a study specifically designed to assess saturation. The saturation-focused objectives of the research allowed us to justify a large sample size and keep the procedures and context surrounding data collection and analysis consistent to enhance methodological rigor.
Methods
This research is part of a larger study funded by the Patient Centered Outcomes Research Institute, which supports methodological investigations on engaging patients in health and health-care research (see www.pcori.org). We conducted two prestudy focus group discussions among members of the African American community in Durham, North Carolina, to solicit their opinions on the health issues most in need of research within their community. We chose health-seeking behavior among African American men as the topic of the focus group discussions (substantive findings from the study presented elsewhere), based on the funder’s objectives and the community’s feedback. The study was reviewed and approved by FHI 360’s Protection of Human Subjects Committee. Oral informed consent was obtained from each participant before initiation of data collection. Each study participant was provided an incentive of US$40.
Data Collection
Eligibility criteria included being a (self-identified) African American/black man, aged 25–64, and a resident of Durham, North Carolina. The study team recruited participants through Craigslist, through flyers posted in public areas, and through peer recruitment. For each focus group, we attempted to include eight individuals—the modal recommendation for group size in the literature—which also kept the focus group size consistent across each data collection event. We overrecruited for each focus group to offset any no-shows.
A draft of the focus group instrument was pretested among five men from the target population and revised based on their responses. The final instrument contained 13 open-ended questions. Between January and May 2013, we conducted 40 focus groups—a number within the upper range of recommendations in the literature, enabling us to reach the data saturation point with confidence. One researcher facilitated all of the discussions. She followed the instrument structure consistently and probed responses to questions, but she did not introduce any information learned in previous focus groups as one typically would in inductive qualitative research. This was done to treat each focus group as a “new” and unique event to facilitate the methods objectives of the analysis. All focus group discussions were audio recorded.
Data Analysis
The recordings were transcribed verbatim, using a transcription protocol developed for thematic analyses (McLellan et al. 2003). After verifying the accuracy of the transcripts, two members of the study team (including the data collector) created a content-driven, thematic codebook using an iterative process (MacQueen et al. 2008). Both analysts independently reviewed each focus group transcript, identified new themes, and created new codes corresponding to those themes. 2
All codes were defined and agreed upon using the template outlined by MacQueen et al. (2008). Analysts used NVivo 10 (QSR 2012) to apply codes to the transcripts (coding of text segments was not mutually exclusive). After each transcript was coded, analysts discussed emergent themes and compared code application. All coding discrepancies were resolved through discussion to create a consensus-coded file, and the codebook was revised to add new codes or reflect changes to code definitions. Changes to the codebook were logged after the analysis of each transcript. Analysts were not blinded to the study’s purpose, since we had no a priori hypotheses regarding saturation.
We assessed data saturation using a methodology similar to that employed by Guest et al. (2006). We documented the incremental progression of theme identification (i.e., codebook content) by examining the codebook log to determine when each code was identified. We then documented code frequencies (i.e., the number of focus groups in which a code was identified at least once). We tallied the number of codes created across all 40 transcripts and determined the point when 80% of all thematic codes had been identified. We also analyzed the data based on a 90% metric to provide a more robust measure of saturation.
To investigate the possibility of a temporal bias in our analysis, we randomly ordered the 40 focus groups 10 times 3 using an online random sequence generator. We replicated the saturation analyses described above for each of the 10 randomly ordered data sets, using matrix queries of code application to indicate when new codes appeared, and then averaged saturation measures.
To determine whether the themes identified early in the analysis were the most salient, we used code frequencies as a proxy measure of salience, and grouped the results into terciles—low, medium, and high frequency. We then determined the percentage of high-frequency codes that were identified in the earlier stages of analysis. We did this for the original data set and all 10 randomly ordered data sets.
Results
The focus group discussions lasted, on average, 1 hour and 50 minutes. Each focus group had six to eight individuals (only one group had six participants), with a mean of 7.75 individuals per group.
Sample
Of the 390 individuals who expressed interest in study participation, 49 did not meet the eligibility criteria. Eleven eligible men consented and were scheduled but did not attend the focus group. Twenty men consented, showed up as scheduled, and received an incentive but did not take part because the cap of eight per group had been met. Our final sample included 310 men.
Study Participants
The median age of the men in our sample was 49 years (Table 2). More than 90% of the men had completed high school, 75% were unemployed, and 78% had an annual household income of less than US$20,000. Although 70% of the men had seen a physician within the past year, 61% were uninsured.
Participant Characteristics.
aAlthough our total sample population was 310 men, the N differs for each variable because of different response rates.
Codebook Development and Changes
We developed 94 content-driven codes (refer to the codebook and code frequencies online), and the codebook remained relatively stable during analysis. Only 15 changes were made: 11 code definitions were expanded to be more inclusive (example below), two code names were revised to reflect expanded definitions, and two codes were deleted. All but one of these 15 changes were made early in the analysis between the second and fifth focus groups.
Data Saturation Points
Almost two-thirds of the 94 content codes were generated from the first focus group (Figure 2). Seventy-nine (84%) of the 94 codes were created and applied during the analysis of the first three focus groups. The 90% saturation mark was reached after six focus groups. With a few exceptions, the identification of new themes and the corresponding creation of codes followed a negative curvilinear pattern, with the slope rapidly approaching a plateau by the fourth focus group.

Code creation over time, in chronological order of data collection (N = 40 focus groups).
This pattern persisted, regardless of the order in which we analyzed the data. Our analysis of the 10 randomly ordered data sets revealed that the majority of codes—ranging from 76 (81%) to 85 (90%)—were applied within the first three focus groups. When we averaged the number of new codes applied in each focus group across the 10 randomly ordered data sets, the same negative curvilinear pattern that we had observed in the chronologically ordered data set emerged (Figure 3). On average, 80.3 codes (85%) were applied to the first three focus group transcripts and 85.6 codes (91%) to the first four.

Application of new codes averaged across 10 randomly ordered data sets (N = 40 focus groups).
Combining the chronologically and randomly ordered data sets, the average number of focus groups required to reach 80% saturation was 2.7 (range two to three groups). To reach 90%, the average number of groups required was 4.3, with a range of three to six groups (Table 3).
Saturation Points among Data Sets.
aThe first four focus groups in these data sets included 89.4% of all codes, just short of the 90% metric.
Code Frequency and Salience
After dividing the 94 codes into terciles—high (n = 31), medium (n = 33), and low frequency (n = 30)—we identified where in the data set the high-frequency codes first appeared. In the original data set, which was sequenced to match the order in which the data were collected and the codes were developed, we found that all of the high- and medium-frequency codes were developed during the analysis of the first three focus groups. The same pattern held in all 10 of the randomly ordered data sets.
Discussion
Our data suggest that a sample size of two to three focus groups will likely capture at least 80% of themes on a topic—including those most broadly shared—in a study with a relatively homogeneous population using a semistructured guide. As few as three to six focus groups are likely to identify 90% of the themes. To assess the generalizability of these findings, we place them in the context of five general factors that can affect the rate at which saturation is approached: (1) degree of instrument structure; (2) sample homogeneity; (3) complexity of the study topic; (4) study purpose; and (5) analyst categorization style (Guest 2015).
Degree of Instrument Structure
Our study utilized one moderator and a semistructured instrument in which the initial (open-ended) questions were asked verbatim of all groups and in the same order. For qualitative inquiry that doesn’t employ a guide or scripted questions, saturation is likely to require more focus groups than in our study. Conversely, more structured elicitation techniques, such as free listing and pile sorting (Weller and Romney 1988), would likely reach saturation sooner.
Sample Homogeneity
Although our sample had some demographic heterogeneity, most participants were similar with respect to education, employment, household income, and health insurance status. More focus groups may be needed to reach saturation as the heterogeneity of the sample population increases.
Topic Complexity
Our study could be classified between simple and moderately complex. We asked the participants about their opinions of and experience with health and the health-care system. On the one hand, the range of health issues and health-care coverage is relatively large (compared to, say, discussing a tangible consumer product), and the subject matter can be technical. On the other hand, this is a simple topic in that participants are relaying their personal experiences and discussing cultural norms. More focus groups may be needed to reach saturation as the complexity of the topic increases or becomes more abstract in nature.
Study Purpose
The purpose of our study was to identify salient themes on a topic. This is a common objective in applied research, so our findings (at least relative to this parameter) should be generalizable to many other focus group studies. Our data may not be generalizable, however, to analyses that seek highly granular themes or aim to determine the maximum variability of participants’ responses. Similarly, our findings may not be relevant to research that is more interpretative or doesn’t generate quantifiable data such as theme frequencies.
Analyst Categorization Style
Analysts vary in the level of granularity at which they code data—a phenomenon known as the “lumper–splitter problem” (Guest et al. 2012). Neither of our two coders had a history of coding at either extreme of the granularity continuum. Their analysis of 40 focus groups generated 94 codes that, based on our experience, are indicative of neither lumping broadly nor splitting narrowly. Studies that use analysts with coding styles at the extremes of the granularity continuum—or who are instructed to code at a very specific or broad level—will find saturation rates to vary accordingly.
In addition to these factors, we might assume that the size of the focus group influences the amount and quality of the data generated. For reasons of consistency, we attempted to limit the size of the groups in our study to eight participants. Although this is the modal size recommended in the literature, focus group sizes often range between six and 12 individuals. This may temper the generalizability of our findings to studies with significantly smaller or larger groups.
Another factor, related to group size, that can potentially affect saturation is the degree of interpersonal dynamics within a group. A few highly disruptive or vocal participants, for example, can reduce the variability of responses within a focus group. This is why good moderating skills are essential to the reliability and validity of focus group research. The influence of group dynamics, however, is likely to average out over time.
Conclusion
Our analyses revealed that, within our data set, more than 80% of all themes were discoverable within two to three focus groups and 90% of themes could be discovered within three to six focus groups. Also, we were able to identify the most prevalent themes within our data set with only three focus groups. These findings build the evidence base about adequate sample sizes for focus group research. At the same time, we recognize that our findings may not apply to all contexts of focus group research. Many questions remain. How does the degree of heterogeneity within a focus group, the complexity of a topic, or the size of a focus group affect the saturation rate and the nature of the data generated? How much does the amount of structure in the data collection process affect saturation? Additionally, with the rapid increase in Internet access and bandwidth over the past decade, qualitative research is experiencing a growth in remote data collection techniques. This raises the question of whether these findings will hold true for other modalities of collecting focus group data, such as synchronous and asynchronous online approaches. We hope that these data will serve as a foundation for the work of future scholars, and we encourage researchers to examine data saturation in other contexts.
Footnotes
Acknowledgments
The authors are grateful to Annette Carrington Johnson (North Carolina Central University) for her invaluable help in getting this project off the ground and ensuring its smooth completion. We also owe our thanks to Jamilah Taylor who helped coordinate the implementation of this study. Finally, we wish to thank the men of Durham who took part in the focus groups for their enthusiastic and candid participation.
Authors’ Note
All views expressed in this article, however, are solely those of the authors.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding for this study was provided by the Patient-Centered Outcomes Research Institute (PCORI), through grant #1IP2PI000395-01.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
