Abstract
Information on the causative agent in an enteric disease outbreak can be used to generate hypotheses about the route of transmission and possible vehicles, to guide environmental assessments, and to target outbreak control measures. However, only about 40% of outbreaks reported in the United States include a confirmed etiology. The goal of this project was to identify clinical and demographic characteristics that can be used to predict the causative agent in an enteric disease outbreak and to use these data to develop an online tool for investigators to use during an outbreak when hypothesizing about the causative agent. Using data on enteric disease outbreaks from all transmission routes (animal contact, environmental contamination, foodborne, person-to-person, waterborne, unknown) reported to the U.S. Centers for Disease Control and Prevention, we developed random forest models to predict the etiology of an outbreak based on aggregated clinical and demographic characteristics at both the etiology category (i.e., bacteria, parasites, toxins, viruses) and individual etiology (Clostridium perfringens, Campylobacter, Cryptosporidium, norovirus, Salmonella, Shiga toxin–producing Escherichia coli, and Shigella) levels. The etiology category model had a kappa of 0.85 and an accuracy of 0.92, whereas the etiology-specific model had a kappa of 0.75 and an accuracy of 0.86. The highest sensitivities in the etiology category model were for bacteria and viruses; all categories had high specificities (>0.90). For the etiology-specific model, norovirus and Salmonella had the highest sensitivity and all etiologies had high specificities. When laboratory confirmation is unavailable, information on the clinical signs and symptoms reported by people associated with the outbreak, with other characteristics including case demographics and illness severity, can be used to predict the etiology or etiology category. An online publicly available tool was developed to assist investigators in their enteric disease outbreak investigations.
Background
Enteric disease outbreaks are common in the United States with ∼3500 outbreaks reported annually (CDC, 2022). Because many enteric etiologies cause similar gastrointestinal symptoms, identifying the causative bacterial, viral, or parasitic agent is important for solving the outbreak and preventing further illnesses. Specifically, information on the etiology can generate hypotheses about the route of transmission and possible vehicles, guiding environmental assessments and control measures (White et al., 2022; White et al., 2021). Outbreak data with information on the etiology and implicated vehicle are also used to prioritize policy and other interventions aimed at reducing the incidence of enteric illness nationally. Specifically, the Interagency Food Safety Analytics Collaboration (IFSAC) uses foodborne outbreak data to attribute etiology-specific illness estimates to specific foods, providing data for policy development and risk-based decision-making (Batz et al., 2021).
Despite the importance of identifying the causative agent, only ∼40% of reported outbreaks include a confirmed etiology with ∼33% including a suspected etiology (CDC, 2022). Confirming the etiology typically requires the isolation of the organism from the stool specimen of two or more people associated with the outbreak. This can be challenging for several reasons, including hesitancy of ill people to provide a stool sample and limited public health resources to facilitate the collection, storage, and transportation (Torok et al., 2022).
While laboratory testing is the most definitive way to identify the etiology, clinical information on the frequency of specific signs and symptoms (e.g., vomiting, diarrhea) and incubation period has been shown to differentiate between bacteria and norovirus (Hall et al., 2001; Hedberg et al., 2008; Kaplan et al., 1982; Turcios et al., 2006). Other characteristics, including case demographics and illness severity, also differ by etiology (Lund and O'Brien, 2011; Wikswo et al., 2022). The goal of this project was to identify clinical and demographic characteristics that can be used to predict the etiology in an enteric disease outbreak and to use these data to develop an online tool for investigators to use during an outbreak when hypothesizing about the etiology.
Materials and Methods
Data source
We used data on enteric disease outbreaks reported by state, local, territorial, and federal public health agencies through the Centers for Disease Control and Prevention's electronic Foodborne Outbreak Reporting System (eFORS) from 1998 to 2008 (foodborne only) and the National Outbreak Reporting System (NORS) from 2009 to 2020 (all transmission routes: animal contact, environmental contamination, foodborne, person-to-person, waterborne, unknown). Outbreak reports included information on the first reported illness onset date, outbreak setting, aggregated age group (<5 years, ≥50 years), aggregated sex, number of hospitalizations, duration of illness (shortest, median, longest), incubation period (shortest, median, longest), confirmed or suspected etiology, the percentage of outbreak-associated illnesses with specific signs and symptoms (diarrhea, bloody diarrhea, abdominal cramps, fever, and vomiting), and outbreak vehicle (e.g., ground beef, turtles). Data were downloaded on June 6, 2022, and are subjected to change.
Data analysis
Outbreaks with a confirmed, single genus etiology were included in the analysis, excluding those caused by Clostridioides difficile, a predominately hospital-acquired infection, and etiologies that present with distinctive or primarily non-enteric symptoms (ciguatoxin; Clostridium botulinum; hepatitis virus; Legionella; Listeria; Salmonella serotypes Typhi, Paratyphi A, and Paratyphi B; and scombroid toxin).
Candidate predictors of outbreak etiology included those likely to be known early in an investigation: percentage of cases with specific signs and symptoms of enteric infection, percentage of cases who were hospitalized, percentage of cases <5 or ≥50 years of age, percentage of cases who were female, shortest duration of illness, and shortest incubation period. We used the shortest duration and incubation period because they are more likely to be known early in the investigation than the median.
We replaced missing values of the percentage of bloody stools for norovirus with 1%, based on published estimates (Wikswo et al., 2022). We replaced missing values for the shortest incubation period and shortest duration of illness with the median value reported for that outbreak, if available. We categorized the incubation period (<8 h, 8 h to 3 days, >3 days, unknown) and duration (<12 h, 12 h to <1.5 days, 1.5–3.5 days, ≥3.5 days, unknown). We did not exclude missing values and added an unknown category for the incubation period and duration of illness because unknown values could be predictive. Missing data remaining after replacements described above were imputed using a single random forest imputation.
The imputed data were split into 70% training set and 30% testing set, stratified by etiology. Two random forest models were developed using the training set with 500 trees (Breiman, 2001). The first model predicted the etiology category (i.e., bacteria, parasite, toxin, virus). The second model predicted specific etiologies. We limited the etiology category model to enteric etiologies with ≥50 outbreaks and the etiology-specific model to enteric etiologies with ≥300 outbreaks to avoid rare etiologies.
Each outbreak in the training dataset for both models was weighted based on the frequency of the etiology to account for unbalanced data. The model was evaluated on accuracy and kappa values using the mlr (2.19.0) package in R (Bischl et al., 2016). Each etiology category and each etiology were evaluated by sensitivity and specificity for the first predicted etiology and by sensitivity for the first or second predicted etiology. The model was developed into an online tool using R Shiny (Chang et al., 2022). All analyses were conducted using R 4.2.1 (R Core Team, 2022). Random forest imputation, modeling, and variable importance used the R package randomForestSRC (3.1.0) (Ishwaren and Kogalur, 2022).
Results
From 1998 to 2020, 56,005 outbreaks were reported, of which 22,149 had a confirmed, single genus etiology. The final dataset included 20,382 outbreaks, after excluding 1767 caused by an excluded etiology. There were 14 etiologies with ≥50 confirmed etiology outbreaks included in the etiology category model (20,204 outbreaks) and 7 etiologies with ≥300 confirmed etiology outbreaks included in the etiology-specific model (19,272 outbreaks) (Tables 1 and 2).
Aggregated Outbreak Signs, Symptoms, Demographics, and Incubation Period by Major Pathogen Category, National Outbreak Reporting System, 1998–2020
Etiologies with <50 outbreaks included: Acanthamoeba, adenovirus, amnesic shellfish poison, Anisakis, astrovirus, avian schistosomes, Brucella, chloramines, chlorine, chlorine gas, cleaning agents, copper, cyanotoxin, Enterococcus, heavy metals, Leptospira, Microcystis, monosodium glutamate, Mycobacterium, mycotoxins, neurotoxic shellfish poison, nitrite, Pantoea, paralytic shellfish poison, pesticides, plant/herbal toxins, Pseudomonas, puffer fish tetrodotoxin, Streptococcus, Toxoplasma, Trichinella, Yersinia, other, unknown.
IQR, interquartile range.
Aggregated Outbreak Signs, Symptoms, Demographics, Incubation Period, and Illness Duration by Pathogen Genus, National Outbreak Reporting System, 1998–2020
IQR, interquartile range; STEC, Shiga toxin–producing Escherichia coli.
Among etiology category outbreaks, parasitic outbreaks had the longest incubation period (median 7 days; interquartile range [IQR] 6–8 days), and bacterial toxin outbreaks had the shortest incubation period (0.4 day; IQR 0.2–0.5 days) (Table 1). Sex was evenly distributed for all etiology categories except viral outbreaks where the percentage of female was higher (median 63%; IQR 42–78%). The median hospitalization rate was 12% for bacterial outbreaks and 0 for other categories. Bloody stools were more common among individuals in bacterial outbreaks (28%). Viral outbreaks had lower proportions of cases with abdominal cramps (57% vs. bacterial 85%, bacterial toxins 79%, parasites 81%) and diarrhea (86% vs. all others 100%), but more vomiting (72% vs. 40%, 25%, 41%). The proportion of people reporting fever varied among etiology categories (bacteria 65%, parasites 35%, viruses 22%, and bacterial toxins 5%) (Table 1).
Symptoms of diarrhea and cramps were commonly reported among all cases except for those infected with norovirus. Campylobacter, Salmonella, and Shigella had similar median values for bloody stools (25%, 20%, 25%, respectively), whereas Shiga toxin–producing Escherichia coli (STEC) was higher (75%). Norovirus outbreaks had a higher percentage of patients experiencing vomiting (73%) than other etiologies. The median percentage hospitalized was above zero only for Salmonella (17%) and STEC (33%). The percentage of female was ∼50% for all etiologies except norovirus (63%). Shigella and Cryptosporidium had proportionally more outbreak-associated cases aged <5 years (32% and 11%); all other etiologies had a median of 0–1%. Salmonella, Clostridium perfringens, and norovirus had proportionally more outbreak-associated cases aged >50 years (14%, 16%, 25%, respectively) than the other etiologies (0–4%).
The training datasets for the etiology category and the etiology-specific model included 14,143 and 13,490 outbreaks, respectively, after excluding outbreaks with incomplete information. The testing datasets included 6061 outbreaks and 5782 outbreaks, respectively. The etiology category model had a kappa of 0.85 and an accuracy of 0.92, with high specificities for all categories but lower sensitivities for bacterial toxin and parasitic outbreaks (Table 3). The etiology-specific model had a kappa of 0.75 and an accuracy of 0.86. We combined O157 and non-O157 STEC in the etiology-specific model because the model performed worse when only including O157 (<300 non-O157 outbreaks).
Pathogen Category Model Comparing Predicted and Documented Pathogen Category
Among specific etiologies, norovirus and Salmonella, respectively, had the highest sensitivities with 98% (3599) and 83% (873) outbreaks correctly predicted (Table 4). Other etiologies had high specificities but lower sensitivities: Campylobacter (0.99 and 0.12, respectively), C. perfringens (1.00 and 0.75, respectively), Cryptosporidium (0.99 and 0.36, respectively), STEC (0.99 and 0.63, respectively), Shigella (0.98 and 0.49, respectively). Many Campylobacter (64%), Shigella (37%), and STEC (23%) outbreaks were predicted as Salmonella; 22% of Cryptosporidium outbreaks were predicted as Salmonella and 22% as Shigella. Sensitivity increased for each etiology when considering the top two predicted etiologies (Table 4). The etiology-specific model conflicted with the etiology category model in <5% of predictions (e.g., etiology type was predicted as a parasite, but the etiology-specific model predicted Salmonella).
Pathogen-Specific Model Comparing Predicted and Documented Pathogen Genus
STEC, Shiga toxin–producing Escherichia coli.
The random forest importance of each predictor in the etiology category model was highest for bloody stools (0.42), followed by vomiting (0.28), shortest incubation period (0.26), shortest illness duration (0.13), fever (0.07), age <5 years (0.07), hospitalization rate (0.06), diarrhea (0.05), abdominal cramps (0.03), percentage of female (0.03), and age ≥50 years (0.02). For the etiology-specific model, the variable importance from highest to lowest was bloody stools (0.57), vomiting (0.32), shortest incubation period (0.25), shortest illness duration (0.17), age <5 years (0.15), hospitalization rate (0.12), fever (0.09), percentage of female (0.03), abdominal cramps (0.02), age ≥50 years (0.02), and diarrhea (0.02).
We adapted both models into an online publicly available tool for investigators to use during an outbreak investigation when the etiology is unknown (available at

A screenshot of the online tool with an example outbreak of Clostridium perfringens.
Discussion
Knowing the outbreak etiology during an enteric disease outbreak investigation is important for guiding control and prevention efforts and preventing additional illnesses. When laboratory confirmation is unavailable, information on the clinical signs and symptoms reported by people associated with the outbreak, with other characteristics including case demographics and illness severity, can be used to predict the etiology or etiology category. Using data from previous enteric disease outbreaks, we developed an online tool for investigators to use during an outbreak, providing a data-driven approach to predicting the causative agent.
Viral outbreaks in the etiology category model and norovirus outbreaks in the etiology-specific model can be predicted with a high accuracy. This is important because many outbreaks likely to be caused by norovirus do not have an etiology confirmed, yet clear and effective interventions exist to reduce exposure and control the spread of illness (Kambhampati et al., 2015).
Norovirus, which accounts for most viral outbreaks, has a distinctive clinical profile with lower levels of diarrhea and abdominal cramps and higher levels of vomiting compared with other etiologies. For this reason, previous studies have also reported being able to distinguish norovirus from other groups of etiologies, such as diarrheal toxin-like and Salmonella-like illnesses (Hall et al., 2001; Hedberg et al., 2008; Kaplan et al., 1982; Turcios et al., 2006). Norovirus outbreaks also included a greater percentage of females. This finding likely reflects the demographic profile of residents and health care workers at long-term care facilities, a common setting for norovirus outbreaks, and demonstrates how demographic profiles can sometimes be used as a proxy of setting in an outbreak investigation (Caffrey et al., 2021; Calderwood et al., 2021).
Bacterial outbreaks were also predicted with high confidence, and at the etiology level, this was true for only Salmonella and STEC outbreaks. STEC was likely predicted well because it causes bloody stools more frequently than other etiologies, and bloody stools were identified as an important predictor in both the etiology category model and the etiology-specific model. Salmonella was also well predicted, but it has a similar profile to other bacterial etiologies. Many outbreaks caused by Campylobacter and Shigella were predicted as Salmonella, which is more frequently identified as the etiology in outbreaks. To address this, previous studies looking at clinical profiles have grouped Salmonella together with other bacterial etiologies causing a similar illness. For example, Hedberg et al. (2008) combined Salmonella, Shigella, and Campylobacter into a single “Salmonella-like illness.”
However, with additional information, it is possible for investigators to differentiate between these etiologies. For example, an outbreak among children at a day care center may lead investigators to suspect Shigella, whereas Campylobacter may be suspected in an outbreak among raw milk consumers. While our tool does not currently include setting as a predictor, differentiating between bacterial etiologies may assist in narrowing exposures, implementing control measures, and if laboratory testing is not conducted by the end of the outbreak, a suspected etiology can still be useful in research. Suspecting an etiology before laboratory results are available can help indicate whether exclusion policies need to be implemented. The online tool shows the probabilities for both models; the investigator can determine which model is more appropriate given the context of the outbreak.
The models predicting bacterial toxin and parasitic outbreaks and their associated etiologies had lower sensitivities, likely due to fewer outbreaks. One of the distinguishing features of bacterial toxin and parasitic outbreaks is the incubation period; toxins have short incubation periods and parasites have long incubation periods. Using shortest incubation influenced the model enough that the distinct incubation periods overlapped and made it difficult to distinguish from bacterial outbreaks; however, we chose the shortest incubation period because it is more likely to be known early in an investigation. Previous profiles have not included parasites in the predictions as their own category (Dalton et al., 1999; Hall et al., 2001; Hedberg et al., 2008; Kaplan et al., 1982; Turcios et al., 2006).
Although less common and often have a characteristic long incubation period, early in an investigation before an exposure is known, suspecting a parasitic outbreak could help narrow exposures for further investigation (Scallan et al., 2011). In some scenarios, having a predicted etiology category may be sufficient information to direct an investigation; hypothesizing the etiology is a bacterial toxin rather than a virus is enough information to implement control measures and to hypothesize potential transmission modes and vehicles. Additional information about the outbreak not captured in the model can be used in conjunction with the model to suspect one category over another. For example, secondary cases could indicate a virus rather than a bacterial toxin.
The variables differed in their relative importance to the prediction in each model. Bloody diarrhea had the highest relative importance for both models and was a key predictor for identifying STEC and eliminating norovirus as the likely outbreak etiology. Incubation period had the next highest relative importance for the etiology category model; however, investigators may not know the incubation period during an investigation if the exposure is unknown. We accounted for this by including an unknown option for the incubation period. Vomiting was especially important in differentiating norovirus and viruses, which has a higher rate of vomiting than other etiologies. The percentage of age <5 years was the next most important variable for the etiology-specific model. This age group is more common in Shigella and Cryptosporidium outbreaks, aiding in differentiating those two etiologies. Hospitalization rate was next for the etiology category model, which aided the most in differentiating viral outbreaks. Other tools did not include demographics, such as age, although age <5 years was around the middle of relative importance.
The R Shiny tool is available at
This study is subjected to several limitations. First, a subset of etiologies implicated in past outbreaks were included in both models. While these are the most common enteric disease outbreak-associated etiologies, there are other potential outbreak etiologies not included, and investigators should not eliminate consideration of rare etiologies. The small proportion of conflicting results (e.g., predicted parasite and Salmonella) may be related to the subset of etiologies included, and additional etiologies should be considered. Second, the data reported through NORS are from completed, closed-out outbreak investigations, which may be different from an ongoing investigation and potentially bias the tool. Incomplete information may be all that is available during an investigation, so percentages could change as the investigation progresses. If known, incubation period should not change because we used the shortest incubation rather than median as well as including an unknown option.
NORS also only collects information on outbreaks in the United States. Symptoms may be comparable internationally, but clinical characteristics such as hospitalizations may differ. In addition, we only included laboratory-confirmed outbreaks. Thus, the model is biased toward etiologies that are commonly isolated and tested, such as Salmonella. Other etiologies may be common, but less likely to have laboratory confirmation, such as bacterial toxins. We excluded suspected etiology outbreaks to minimize potential mistakes by the model in case the wrong etiology was suspected. We also did not include transmission mode. Transmission mode is likely predictive; however, it can be unclear what the transmission mode is early in an outbreak. Future models could account for additional variables such as transmission mode and setting while allowing for missing values.
We demonstrated success in predicting the etiology category and specific etiologies using random forest models and provided an online, publicly available tool for outbreak etiology prediction. Our online tool is intended to assist with the identification of the causative agent in the absence of laboratory testing. Investigators can use these models with additional information from the investigation to assist with hypothesis generation.
Footnotes
Acknowledgments
The authors would like to thank Conner Jackson and all the state and territorial health departments that report outbreaks through NORS.
Disclaimer
The findings and conclusions of this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Authors' Contributions
H.K. contributed to the project conception, data acquisition, design, analysis, and interpretation. A.W. contributed to the design and interpretation. B.B.B. contributed to the design, analysis, and interpretation. E.B.R. contributed to the analysis and interpretation. E.S.W. contributed to the design and interpretation. All authors contributed to the article draft and revisions. All authors provided approval for submission and agree to accountability.
Disclosure Statement
No competing financial interests exist.
Funding Information
This article was funded in part by the Colorado Integrated Food Safety Center of Excellence, which is supported by the Epidemiology and Laboratory Capacity for Infectious Disease Cooperative Agreement (CK19-1904) through the Centers for Disease Control and Prevention.
