Abstract
Background:
Foodborne illness is a continuing public health problem in the United States. Although outbreak-associated illnesses represent a fraction of all foodborne illnesses, foodborne outbreak investigations provide critical information on the pathogens, foods, and food-pathogen pairs causing illness. Therefore, identification of a food source in an outbreak investigation is key to impacting food safety.
Objective:
The objective of this study was to systematically identify outbreak-associated case demographic and outbreak characteristics that are predictive of food sources using Shiga toxin–producing Escherichia coli (STEC) outbreaks reported to Centers for Disease Control and Prevention (CDC) from 1998 to 2014 with a single ingredient identified.
Materials and Methods:
Differences between STEC food sources by all candidate predictors were assessed univariately. Multinomial logistic regression was used to build a prediction model, which was internally validated using a split-sample approach.
Results:
There were 206 single-ingredient STEC outbreaks reported to CDC, including 125 (61%) beef outbreaks, 30 (14%) dairy outbreaks, and 51 (25%) vegetable outbreaks. The model differentiated food sources, with an overall sensitivity of 80% in the derivation set and 61% in the validation set.
Conclusions:
This study demonstrates the feasibility for a tool for public health professionals to rule out food sources during hypothesis generation in foodborne outbreak investigation and to improve efficiency while complementing existing methods.
Introduction
F
Several recent initiatives, including the Council to Improve Foodborne Outbreak Response (CIFOR), the Integrated Food Safety Centers of Excellence, and the Foodborne Diseases Centers for Outbreak Response Enhancement (FoodCORE), aim at improving the quality of outbreak investigations nationwide by providing public health professionals at the local, state, and federal level with guidelines, tools, and resources to aid outbreak investigations (CDC, 2010, 2012; CIFOR, 2014). Examples of existing resources include guidelines for model practices during outbreak investigations (CIFOR, 2014), a national hypothesis-generating questionnaire (CDC, 2013), pathogen symptom profiles (Hall et al., 2001; Turcios et al., 2006; Hedberg et al., 2008; Domínguez et al., 2010), and population-level food consumption data (CDC, 2007; OPHD, 2010). Using data from foodborne illness outbreak reports in the United States, our goal was to systematically identify factors predictive of outbreak food sources and to develop a tool for investigators to use for hypothesis generation. This is a novel, data-driven approach to hypothesis generation in foodborne illness outbreak investigations. For this study, the focus was Shiga toxin–producing Escherichia coli (STEC) outbreaks. STEC is one of the most common foodborne illness pathogens, causing ∼6% of confirmed, single-etiology outbreaks (only norovirus and Salmonella cause more outbreaks). Most STEC outbreaks are attributed to beef, followed by leafy vegetables (Gould et al., 2013a). There are many heterogeneous STEC serogroups that cause human gastrointestinal illness, the most common of which is O157:H7 (Gould et al., 2013b).
Materials and Methods
Data source
National data on reported foodborne STEC outbreaks from 1998 to 2014 were available from the Centers for Disease Control and Prevention's (CDC) Foodborne Outbreak Surveillance System. The Electronic Foodborne Outbreak Reporting System (eFORS) collected data on foodborne and waterborne enteric disease outbreaks from 1998 to 2008. In 2009, the National Outbreak Reporting System (NORS) replaced eFORS and expanded to collect data on foodborne, waterborne, person-to-person, animal contact, environmental contamination, and undetermined transmission routes. These passive surveillance systems receive reports from state, local, and territorial health agencies using a standard form (CDC, 2009). Outbreaks reported to the CDC were extracted on August 27, 2015 using the following qualifications: foodborne mode of transmission, finalized report, onset year 1998–2014, and STEC etiology. CDC provided data in a Microsoft Access database. Relational data were merged into a single, flat file using SAS 9.4.
Prediction model
Using STEC outbreaks with a single ingredient identified, a multinomial logistic regression model was developed to predict food sources. The model was developed using a random subset of outbreaks and validated on the remaining outbreaks.
Food source categories
The Interagency Food Safety Analytics Collaboration (IFSAC) Food Categorization Scheme was used to identify food source categories (Supplementary Fig. S1; Supplementary Data are available online at
Demographic and outbreak predictors
Outbreak-associated case demographic predictors included percentage female (percentage of cases in an outbreak who were women) and age (percentage of cases in an outbreak aged <5, 5–19, 20–49, and ≥50 years). Outbreak predictors included the number of cases (both laboratory confirmed and epidemiologically linked), percentage hospitalized, multistate outbreak (i.e., cases that occurred in multiple states), exposure setting (private or non-private), season, outbreak duration (number of days between the date of illness onset for first and last reported case), and serogroup (O157:H7 or non-O157:H7). For exposure setting, a private establishment was defined as a location not subject to inspection, where a non-food worker would prepare and serve food (e.g., “private home”). All other settings were non-private, defined as locations subject to inspection, where a food worker would prepare and serve food (e.g., a restaurant or facility). Season was based on onset date of the first case, and it was categorized as fall (September–November), spring (March–May), summer (June–August), or winter (December–February).
Statistical analysis
A split-sample approach was used for internal validation. The dataset was randomly divided into a derivation set (70%) and a validation set (30%). The association between each candidate predictor and food source category was assessed in univariate comparisons in the derivation set. For continuous predictors that met parametric assumptions, an analysis of variance (ANOVA) test was used to compare differences by foods. For non-normal continuous predictors, a Kruskal–Wallis test was used. For categorical predictors, a Pearson chi-squared (χ 2) test was used. A Fisher's exact test was used for categorical predictors with small cell sizes.
Candidate predictors with more complete data (<20% missingness) and univariate significance (p < 0.10) were included in multinomial logistic regression analysis, which is an extension of binary logistic regression for multi-category outcomes (Biesheuvel et al., 2008; Barnes et al., 2013; Ge et al., 2013). A backwards method was used to select final predictors in the derivation set, selecting a minimal number of predictors that maintained adequate classification accuracy. The model predicted the probability of each food source (beef, dairy, vegetable), such that the predicted probabilities for each outbreak added to 100%.
The model was scored to the validation set, which did not impact univariate analysis or multivariable model building, and was evaluated based on diagnostic classification accuracy in both the derivation and validation sets. The model was also scored to both the entire dataset and random subsets (30%, 60%, and 90% of the total dataset). Trends for each predictor from type 3 analysis of effects based on the Wald χ 2 test were determined. Maximum likelihood estimates were obtained, along with odds ratio estimates with 95% Wald confidence limits. The predicted food was determined by the highest predicted probability for each outbreak based on the model. Predicted probabilities were plotted in triangle plots using the TRIPLOT macro for SAS (Graham and Midgley, 2000; Friendly, 2009; Barnes et al., 2013).
All analyses were performed in SAS, version 9.4. Multinomial logistic regression was performed using the glogit link function in SAS PROC Logistic.
Results
STEC outbreaks
From 1998 to 2014, 470 STEC outbreaks were reported. Of these, 153 (33%) did not identify a food and were excluded (Fig. 1). In addition, 80 complex food outbreaks were excluded, which comprised 25% of the remaining 317 outbreaks with an identified food source. Of the 237 outbreaks with a single ingredient identified, 125 (53%) were beef, 44 (19%) were vegetable row crops, 30 (13%) were dairy, and 7 (3%) were sprouts. Vegetable row crops and sprouts were combined into a single “vegetables” category. The remaining 31 outbreaks represented a heterogeneous group of foods with rare STEC exposure and were excluded (Supplementary Table S1).

Outbreaks reported to the eFORS and the NORS 1998–2014. eFORS, Electronic Foodborne Outbreak Reporting System; NORS, National Outbreak Reporting System.
Univariate analysis
Of the 145 beef, dairy, and vegetable outbreaks in the derivation dataset, the median percentage female was higher for vegetables (64%) than for beef and dairy (50%) (Table 1). The median percentage aged <5 years was the highest for dairy (22%), whereas those aged 5–19 and 20–49 years were the highest for dairy (50%) and vegetables (44%), respectively. The median number of cases was the highest for vegetables (n = 18), followed by beef (n = 9), and dairy (n = 5). Dairy (70%) and beef (52%) were more often reported in a private setting, whereas vegetables were more likely reported in a non-private setting (82%). The proportion of multistate outbreaks was the highest for vegetable (56%), followed by beef (36%), and dairy (10%). Seasonal trends were noted, with more vegetable outbreaks in fall (44%) and more beef in summer (40%). Outbreaks with the non-O157:H7 serogroup were the most common for dairy (25%), followed by vegetables (18%), and beef (4%) (Supplementary Table S2). The percentage of outbreaks with incomplete data was not significantly different between food sources (Table 1).
Data presented as median (IQR), p-value from Kruskal–Wallis.
Incomplete data assessed independently of continuous variables, presented as proportion (%), p-value from chi-square.
Presented as proportion (%), p-value from chi-square.
Presented as proportion (%), Fisher's Exact.
Prediction model
There were 116 outbreaks in the final model after 29 outbreaks with incomplete data had been excluded. Final predictors were percentage female, number of cases, exposure setting, multistate outbreak, season, and serogroup. The model correctly classified 56 of 64 beef outbreaks (sensitivity 88%, specificity 71%); 9 of 16 dairy outbreaks (sensitivity 56%, specificity 98%); and 28 of 36 vegetable outbreaks (sensitivity 78%, specificity 93%) (Table 2). The predicted probabilities for beef outbreaks clustered in the “beef” apex of the plot (Fig. 2a), indicating a high predicted probability of beef for actual beef outbreaks. Similarly, vegetable outbreaks clustered in the “leafy” apex, indicating a high predicted probability of vegetables, with some dispersion between the beef and vegetable outbreaks. Dairy outbreaks were more dispersed between the “dairy” and “beef” apexes. Overall, 80% of outbreaks in the derivation set were correctly classified by the model's first choice, and 98% were classified by the model's first or second choice. Odds ratios for dairy versus beef outbreaks were significant and greater than 1.0 for number of cases, private setting, spring, and serogroup (Table 3). Odds ratios for leafy versus beef outbreaks were significant and greater than 1.0 for percentage female, number of cases, multistate outbreak, and serogroup.

Predicted probability of beef, dairy, and leafy vegetable food sources as a function of actual food sources. The three-way predicted probabilities of beef, dairy, and leafy vegetable food sources in the derivation set
Bold values indicate the number of outbreaks correctly classified by the prediction model for each food source.
Covariates presented by increments of 10%.
There were 49 outbreaks in the withheld validation set scored to the model after 12 outbreaks with incomplete data had been excluded. The model correctly classified 20 of 29 beef outbreaks (sensitivity of 69%, specificity 60%), 3 of 10 dairy outbreaks (sensitivity 30%, specificity 90%), and 7 of 10 vegetable outbreaks (sensitivity 70%, specificity 82%). Distribution of predicted probabilities is shown in Figure 2b. Overall, 61% of outbreaks were correctly classified by the model's first choice, and 96% of outbreaks were classified by the model's first or second choice. The model was also scored to the entire dataset (n = 206) and subsets of the entire dataset (30%, 60%, 90%). The sensitivity and specificity in the data subsets was comparable to the original derivation set and was not impacted by sample size.
Discussion
This study systematically identified factors predictive of food sources in STEC outbreaks reported to the CDC. Factors predictive of three major food sources (beef, vegetables, and dairy) included case demographic and outbreak characteristics. These factors were used to build and validate a prediction model to estimate the probability of each major food source for a given STEC outbreak. This study provides the groundwork for a predictive tool that investigators can use during hypothesis generation in foodborne outbreak investigations and could be applied to other foodborne pathogens.
Gender and age distributions differed between food sources in STEC outbreaks. Vegetable STEC outbreaks had the highest median percentage female, whereas there was no significant gender difference in beef and dairy outbreaks. Vegetable STEC outbreaks also had the lowest median percentage of children and adolescents, whereas the percentage of children and adolescents was the highest for dairy outbreaks. Food consumption surveys have found similar patterns (Mun and Krebs-Smith, 1997; Shiferaw et al., 2000; Patil et al., 2005; Samuel et al., 2007; Shiferaw et al., 2012). For example, the FoodNet Population Survey found that women consume more fruits and vegetables than men and that men consume more meat and poultry; however, there was no gender difference in ground beef consumption (Shiferaw et al., 2012). Other studies reported that younger children consumed less meat and vegetables than adults (Mun and Krebs-Smith, 1997).
Seasonal variation by food source was another key predictive factor noted in this study, with 40% of beef outbreaks occurring in summer and 44% of vegetable outbreaks occurring in fall. Overall, this study found that STEC outbreaks occurred in warmer seasons, with few outbreaks occurring in winter (8% for beef, 5% dairy, 13% vegetables). The incidence of foodborne illness is known to vary seasonally, generally increasing in warmer months (Lal et al., 2012). Bacterial pathogens are sensitive to temperature and moisture during warmer seasons, which increase pathogen survival and proliferation (Money et al., 2010). Studies have found the prevalence of STEC in cattle peaks in summer months (Edrington et al., 2006; Ferens and Hovde, 2011). Food consumption patterns may differ as well due to seasonal events and holidays, availability, and cost (Ravel et al., 2010; Wilson, 2015; Stelmach-Mardas et al., 2016).
Vegetable STEC outbreaks had the highest average number of cases per outbreak. This correlated with the fact that a higher proportion of vegetable outbreaks are multistate, involving large numbers of outbreak-associated cases. Conversely, many dairy outbreaks are due to unpasteurized milk and tend to be local, because unpasteurized dairy products cannot be sold at retail stores, and, therefore, have limited distribution (Angulo et al., 2009; Newkirk et al., 2011). Localized outbreaks with a single exposure site are typically contaminated during preparation and served at a single setting (Murphree et al., 2012). Conversely, multistate outbreaks have contamination points earlier in the production chain, and foods that are prone to contamination at these stages may differ from foods that are contaminated at a single setting (Nguyen et al., 2015).
In the multivariable model, six factors (percentage female, number of cases, multistate outbreak, exposure setting, season, and serogroup) were predictive of food sources. Model classification accuracy was the highest for beef, moderate for vegetable, and poor for dairy. Misclassification of dairy and vegetable outbreaks almost exclusively classified them as beef, with limited misclassification between dairy and vegetable outbreaks. This was likely driven by unequal sample sizes. There were more beef outbreaks; whereas dairy outbreaks were much fewer and, therefore, contributed less information to the model estimates. Each food category represented an aggregated group of food items. For example, dairy outbreaks included food sources that were pasteurized or unpasteurized, and they included products such as ice cream, cheese, and fluid milk. Each of these items may have unique profiles that do not necessarily justify aggregation, therefore making prediction difficult.
The results indicate that the model could be used the most effectively as a rule-out tool. Most misclassified outbreaks had a similar predicted probability for the model's first and second choice. For example, if a dairy outbreak was predicted to be related to beef, the predicted probability could be 48% beef, 47% dairy, and 5% vegetable. Although the model misclassified outbreaks, 98% were correctly classified by the second choice in the derivation set. During hypothesis generation, an investigator would consider the first or second food sources, but could be more confident about excluding the third food source.
The results of this study could be translated into a tool for public health professionals to use during outbreak investigation. In the early stages of an outbreak, descriptive profiles based on univariate results could be used when data are only available for a limited number of variables. For example, if an STEC outbreak occurs in fall, has a higher percentage female, aged 20–49, vegetables may be highly suspected. Public health professionals could use a tool adapted from the multivariable model at the point of food hypothesis generation after non-foodborne routes of transmission have been ruled out. For STEC outbreaks, the investigator would enter percentage female, number of cases, multistate outbreak, exposure setting, season, and serogroup. The tool would output a predicted probability of an STEC outbreak being beef, dairy, or vegetables. The predicted probability would help direct hypothesis generation, or give additional evidence for existing hypotheses.
For this analysis, all predictors were included that improved predictive value for the model using data from finalized investigations. However, some predictors are unlikely to be available at the start of an outbreak, whereas others may change over the course of an investigation. Exposure setting is often determined by the investigation and, therefore, unlikely to be known at the time that this tool would be used. Other predictors change over the investigation. For example, the number of ill cases may increase as investigators actively find additional cases. Values for predictors may be skewed at the start of an outbreak and may not be representative of the total number of cases in an outbreak, which was used to derive the model.
In addition to use as a practical tool during outbreak investigations, the statistical model developed here could be used in foodborne illness source attribution by estimating the food source distribution in outbreaks with a previously undetermined vehicle. Attribution estimates exclude outbreaks with undetermined food vehicles (Painter et al., 2013; IFSAC, 2015). By using a model to estimate the food source distribution in outbreaks with a previously undetermined vehicle, this would increase the sample size of available outbreaks as well as provide more accurate estimates for the relative contribution of food sources.
This study has several limitations in addition to those already discussed. First, age variables were excluded because of incomplete data and each age category was a separate variable, which resulted in non-convergence of the model when included. Exclusion of age, consequently, impacted classification accuracy of dairy outbreaks, which decreased considerably. Second, serogroup was included as a predictor, because it varied by food source; however, inclusion may not be justified. Non-O157, which is increasing in incidence of sporadic illnesses (Gould et al., 2013b), is generally less severe, differs geographically from O157, and is associated with different modes of transmission (animal contact is more common) and different foods (Mathusa et al., 2010; Gould et al., 2013b; Luna-Gierke et al., 2014). Third, there are many other factors that could be predictive of a food source in an outbreak investigation; however, we were limited by what is collected by the surveillance system, as well as by the completeness and consistency of reporting. Finally, an external dataset (e.g., an independent population), often used in clinical rules for validation, was unavailable.
It is recommended that future studies build on this work to apply to other pathogens, such as Salmonella, and refine the model to increase accuracy and generalizability. The model should be translated into a user-friendly online tool for investigators. Future work should explore methods to incorporate novel and complex foods and to further characterize outbreaks with undetermined sources. Additional data sources should be explored to supplement outbreak surveillance data and to explore additional potentially important predictors. Finally, additional work should use outbreak surveillance data to build tools for foodborne outbreak investigators.
Conclusion
This study provides evidence for the feasibility of using prior case and outbreak characteristics to predict food sources in foodborne outbreak investigations. This is the first study to demonstrate statistically that case demographics are associated with food sources in a foodborne outbreak investigation, which are commonly used by experienced investigators to generate hypotheses. In addition, this is the first study to propose a model to consider all factors simultaneously in a single prediction model. The complexity of the global food industry and modern challenges to food safety mean that outbreak investigations are increasingly important to identifying sources of foodborne illness. To combat the challenges inherent in public health investigations, analytical, data-driven tools are essential to efficient outbreak investigations and for improving food source identification.
Footnotes
Acknowledgments
This study was funded by the Centers for Disease Control and Prevention (CDC) through the Colorado Integrated Food Safety Center of Excellence (CoE). The authors thank the National Outbreak Reporting System (NORS) group at CDC for providing the data used in this analysis and for their consultation.
Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
