A Prediction Tool to Identify the Causative Agent of Enteric Disease Outbreaks Using Outbreak Surveillance Data

Abstract

Information on the causative agent in an enteric disease outbreak can be used to generate hypotheses about the route of transmission and possible vehicles, to guide environmental assessments, and to target outbreak control measures. However, only about 40% of outbreaks reported in the United States include a confirmed etiology. The goal of this project was to identify clinical and demographic characteristics that can be used to predict the causative agent in an enteric disease outbreak and to use these data to develop an online tool for investigators to use during an outbreak when hypothesizing about the causative agent. Using data on enteric disease outbreaks from all transmission routes (animal contact, environmental contamination, foodborne, person-to-person, waterborne, unknown) reported to the U.S. Centers for Disease Control and Prevention, we developed random forest models to predict the etiology of an outbreak based on aggregated clinical and demographic characteristics at both the etiology category (i.e., bacteria, parasites, toxins, viruses) and individual etiology (Clostridium perfringens, Campylobacter, Cryptosporidium, norovirus, Salmonella, Shiga toxin–producing Escherichia coli, and Shigella) levels. The etiology category model had a kappa of 0.85 and an accuracy of 0.92, whereas the etiology-specific model had a kappa of 0.75 and an accuracy of 0.86. The highest sensitivities in the etiology category model were for bacteria and viruses; all categories had high specificities (>0.90). For the etiology-specific model, norovirus and Salmonella had the highest sensitivity and all etiologies had high specificities. When laboratory confirmation is unavailable, information on the clinical signs and symptoms reported by people associated with the outbreak, with other characteristics including case demographics and illness severity, can be used to predict the etiology or etiology category. An online publicly available tool was developed to assist investigators in their enteric disease outbreak investigations.

Background

Enteric disease outbreaks are common in the United States with ∼3500 outbreaks reported annually (CDC, 2022). Because many enteric etiologies cause similar gastrointestinal symptoms, identifying the causative bacterial, viral, or parasitic agent is important for solving the outbreak and preventing further illnesses. Specifically, information on the etiology can generate hypotheses about the route of transmission and possible vehicles, guiding environmental assessments and control measures (White et al., 2022; White et al., 2021). Outbreak data with information on the etiology and implicated vehicle are also used to prioritize policy and other interventions aimed at reducing the incidence of enteric illness nationally. Specifically, the Interagency Food Safety Analytics Collaboration (IFSAC) uses foodborne outbreak data to attribute etiology-specific illness estimates to specific foods, providing data for policy development and risk-based decision-making (Batz et al., 2021).

Despite the importance of identifying the causative agent, only ∼40% of reported outbreaks include a confirmed etiology with ∼33% including a suspected etiology (CDC, 2022). Confirming the etiology typically requires the isolation of the organism from the stool specimen of two or more people associated with the outbreak. This can be challenging for several reasons, including hesitancy of ill people to provide a stool sample and limited public health resources to facilitate the collection, storage, and transportation (Torok et al., 2022).

While laboratory testing is the most definitive way to identify the etiology, clinical information on the frequency of specific signs and symptoms (e.g., vomiting, diarrhea) and incubation period has been shown to differentiate between bacteria and norovirus (Hall et al., 2001; Hedberg et al., 2008; Kaplan et al., 1982; Turcios et al., 2006). Other characteristics, including case demographics and illness severity, also differ by etiology (Lund and O'Brien, 2011; Wikswo et al., 2022). The goal of this project was to identify clinical and demographic characteristics that can be used to predict the etiology in an enteric disease outbreak and to use these data to develop an online tool for investigators to use during an outbreak when hypothesizing about the etiology.

Materials and Methods

Data source

We used data on enteric disease outbreaks reported by state, local, territorial, and federal public health agencies through the Centers for Disease Control and Prevention's electronic Foodborne Outbreak Reporting System (eFORS) from 1998 to 2008 (foodborne only) and the National Outbreak Reporting System (NORS) from 2009 to 2020 (all transmission routes: animal contact, environmental contamination, foodborne, person-to-person, waterborne, unknown). Outbreak reports included information on the first reported illness onset date, outbreak setting, aggregated age group (<5 years, ≥50 years), aggregated sex, number of hospitalizations, duration of illness (shortest, median, longest), incubation period (shortest, median, longest), confirmed or suspected etiology, the percentage of outbreak-associated illnesses with specific signs and symptoms (diarrhea, bloody diarrhea, abdominal cramps, fever, and vomiting), and outbreak vehicle (e.g., ground beef, turtles). Data were downloaded on June 6, 2022, and are subjected to change.

Data analysis

Outbreaks with a confirmed, single genus etiology were included in the analysis, excluding those caused by Clostridioides difficile, a predominately hospital-acquired infection, and etiologies that present with distinctive or primarily non-enteric symptoms (ciguatoxin; Clostridium botulinum; hepatitis virus; Legionella; Listeria; Salmonella serotypes Typhi, Paratyphi A, and Paratyphi B; and scombroid toxin).

Candidate predictors of outbreak etiology included those likely to be known early in an investigation: percentage of cases with specific signs and symptoms of enteric infection, percentage of cases who were hospitalized, percentage of cases <5 or ≥50 years of age, percentage of cases who were female, shortest duration of illness, and shortest incubation period. We used the shortest duration and incubation period because they are more likely to be known early in the investigation than the median.

We replaced missing values of the percentage of bloody stools for norovirus with 1%, based on published estimates (Wikswo et al., 2022). We replaced missing values for the shortest incubation period and shortest duration of illness with the median value reported for that outbreak, if available. We categorized the incubation period (<8 h, 8 h to 3 days, >3 days, unknown) and duration (<12 h, 12 h to <1.5 days, 1.5–3.5 days, ≥3.5 days, unknown). We did not exclude missing values and added an unknown category for the incubation period and duration of illness because unknown values could be predictive. Missing data remaining after replacements described above were imputed using a single random forest imputation.

The imputed data were split into 70% training set and 30% testing set, stratified by etiology. Two random forest models were developed using the training set with 500 trees (Breiman, 2001). The first model predicted the etiology category (i.e., bacteria, parasite, toxin, virus). The second model predicted specific etiologies. We limited the etiology category model to enteric etiologies with ≥50 outbreaks and the etiology-specific model to enteric etiologies with ≥300 outbreaks to avoid rare etiologies.

Each outbreak in the training dataset for both models was weighted based on the frequency of the etiology to account for unbalanced data. The model was evaluated on accuracy and kappa values using the mlr (2.19.0) package in R (Bischl et al., 2016). Each etiology category and each etiology were evaluated by sensitivity and specificity for the first predicted etiology and by sensitivity for the first or second predicted etiology. The model was developed into an online tool using R Shiny (Chang et al., 2022). All analyses were conducted using R 4.2.1 (R Core Team, 2022). Random forest imputation, modeling, and variable importance used the R package randomForestSRC (3.1.0) (Ishwaren and Kogalur, 2022).

Results

From 1998 to 2020, 56,005 outbreaks were reported, of which 22,149 had a confirmed, single genus etiology. The final dataset included 20,382 outbreaks, after excluding 1767 caused by an excluded etiology. There were 14 etiologies with ≥50 confirmed etiology outbreaks included in the etiology category model (20,204 outbreaks) and 7 etiologies with ≥300 confirmed etiology outbreaks included in the etiology-specific model (19,272 outbreaks) (Tables 1 and 2).

Table 1.

Aggregated Outbreak Signs, Symptoms, Demographics, and Incubation Period by Major Pathogen Category, National Outbreak Reporting System, 1998–2020

Characteristic	Bacteria, n = 6291 outbreaks			Bacterial toxins, n = 752 outbreaks			Parasites, n = 714 outbreaks			Viruses, n = 12,447 outbreaks
Characteristic	Median	IQR	Missing, %	Median	IQR	Missing, %	Median	IQR	Missing, %	Median	IQR	Missing, %
Sign/symptom, %
Bloody stools	28	3–50	35	0	0–5	35	0	0–0	46	1	0–1	62
Cramps	85	60–100	30	79	64–93	12	81	57–100	17	57	11–80	45
Diarrhea	100	100–100	17	100	86–100	8	100	100–100	7	86	74–97	18
Fever	65	40–90	26	5	0–15	24	35	14–50	25	22	7–43	37
Nausea	60	43–81	80	58	37–86	63	70	50–100	63	71	38–91	66
Vomiting	40	20–57	27	25	7–80	14	41	22–62	18	72	58–87	18
Hospitalized, %	12	0–33	10	0	0–0	15	0	0–1	7	0	0–2	15
Demographics, %
Female	50	39–67	9	50	35–66	18	50	38–67	4	63	42–78	17
Age <5 years	0	0–20	2	0	0–0	5	0	0–33	3	0	0–0	15
Age >50 years	3	0–32	2	13	0–34	3	0	0–11	5	24	0–67	15
Incubation period, days
Shortest	1.0	0.4–2.0	63	0.1	0.0–0.2	14	4.0	2.0–6.0	67	0.6	0.3–1.0	72
Median	2.0	1.0–3.0	66	0.4	0.2–0.5	16	7.0	6.0–8.0	68	1.4	1.2–1.5	73
Longest	4.0	2.0–6.0	63	0.6	0.3–1.0	15	10.0	7.9–13.0	67	2.0	1.7–2.7	72
Illness duration, days
Shortest	3.0	1.0–5.0	54	0.2	0.1–0.5	31	4.0	2.0–8.0	56	1.0	0.3–1.0	43
Median	6.0	4.0–7.5	57	1.0	0.5–1.1	30	9.0	6.0–14.0	57	1.8	1.0–2.0	43
Longest	10.0	7.0–14.0	54	2.0	1.0–3.0	30	15.0	10.4–24.0	56	4.0	2.4–5.0	43

Etiologies with <50 outbreaks included: Acanthamoeba, adenovirus, amnesic shellfish poison, Anisakis, astrovirus, avian schistosomes, Brucella, chloramines, chlorine, chlorine gas, cleaning agents, copper, cyanotoxin, Enterococcus, heavy metals, Leptospira, Microcystis, monosodium glutamate, Mycobacterium, mycotoxins, neurotoxic shellfish poison, nitrite, Pantoea, paralytic shellfish poison, pesticides, plant/herbal toxins, Pseudomonas, puffer fish tetrodotoxin, Streptococcus, Toxoplasma, Trichinella, Yersinia, other, unknown.

IQR, interquartile range.

Table 2.

Aggregated Outbreak Signs, Symptoms, Demographics, Incubation Period, and Illness Duration by Pathogen Genus, National Outbreak Reporting System, 1998–2020

Characteristic	Bacteria
	Salmonella, n = 3521 outbreaks			Shigella, n = 1082 outbreaks			STEC, n = 946 outbreaks			Campylobacter, n = 602 outbreaks
	Median	IQR	Missing, %	Median	IQR	Missing, %	Median	IQR	Missing, %	Median	IQR	Missing, %
Signs/symptom, %
Bloody stools	20	0–46	39	25	6–50	34	75	50–100	25	25	0–50	28
Cramps	83	63–100	33	75	49–100	26	95	67–100	33	92	67–100	16
Diarrhea	100	97–100	20	100	100–100	9	100	100–100	20	100	100–100	7
Fever	68	50–92	28	64	43–100	20	33	13–50	35	74	50–100	14
Nausea	65	50–82	78	41	24–73	79	63	44–82	87	61	50–89	72
Vomiting	40	25–60	29	36	17–53	23	38	20–55	32	31	9–50	19
Hospitalized, %	17	0–35	11	0	0–10	10	33	4–50	8	0	0–13	6
Demographics, %
Female	52	40–67	10	50	36–67	7	54	40–73	7	50	33–67	7
Age <5 years	0	0–9	1	32	0–71	4	0	0–36	1	0	0–7	2
Age >50 years	14	0–36	1	0	0–0	4	0	0–20	1	4	0–40	2
Incubation period, days
Shortest	0.8	0.3–1.0	57	1.0	0.6–1.4	84	2.0	1.0–3.0	67	2.0	1.0–2.5	55
Median	1.5	1.0–2.5	61	2.0	1.5–2.1	85	3.8	3.0–4.4	70	3.0	2.0–3.5	59
Longest	3.4	2.0–6.0	58	3.4	2.4–5.0	84	6.0	4.0–8.0	67	4.1	3.0–6.0	56
Illness duration, days
Shortest	3.0	1.0–5.0	53	2.0	1.0–4.0	59	3.0	2.0–5.0	59	3.0	2.0–6.0	45
Median	6.0	4.0–7.6	56	5.0	4.0–7.0	61	6.0	4.0–7.5	63	6.0	4.7–7.5	49
Longest	10.0	7.0–14.0	53	10.0	7.0–14.0	59	9.0	6.9–14.0	59	10.0	7.0–14.0	45

Characteristic	Bacterial toxins			Parasite			Virus
	Clostridium perfringens, n = 428 outbreaks			Cryptosporidium, n = 429 outbreaks			Norovirus, n = 12,264 outbreaks
	Median	IQR	Missing, %	Median	IQR	Missing, %	Median	IQR	Missing, %
Signs/symptom, %
Bloody stools	0	0–6	35	0	0–4	41	1	0–1	61
Cramps	81	69–94	9	85	60–100	17	57	10–80	44
Diarrhea	100	95–100	6	100	100–100	5	85	74–97	18
Fever	5	0–11	25	40	20–51	21	22	7–43	37
Nausea	46	33–60	61	63	50–100	68	71	38–91	66
Vomiting	9	0–19	16	50	29–67	14	73	59–87	18
Hospitalized, %	0	0–0	16	0	0–4	7	0	0–2	15
Demographics, %
Female	50	33–64	15	50	38–67	3	63	42–78	17
Age <5 years	0	0–0	5	11	0–37	4	0	0–0	15
Age >50 years	16	0–36	5	0	0–0	4	25	0–67	15
Incubation period, days
Shortest	0.2	0.1–0.3	13	4.0	2.0–6.0	65	0.6	0.3–1.0	72
Median	0.4	0.4–0.5	14	6.0	5.5–8.0	65	1.4	1.2–1.5	72
Longest	0.8	0.6–1.1	14	10.0	7.0–13.0	66	2.0	1.7–2.7	72
Illness duration, days
Shortest	0.2	0.1–0.5	26	4.0	2.0–7.0	51	1.0	0.3–1.0	43
Median	1.0	0.6–1.1	27	7.5	6.0–10.5	49	1.8	1.0–2.0	43
Longest	2.0	1.5–3.0	23	14.0	10.0–18.0	49	4.0	2.3–5.0	43

IQR, interquartile range; STEC, Shiga toxin–producing Escherichia coli.

Among etiology category outbreaks, parasitic outbreaks had the longest incubation period (median 7 days; interquartile range [IQR] 6–8 days), and bacterial toxin outbreaks had the shortest incubation period (0.4 day; IQR 0.2–0.5 days) (Table 1). Sex was evenly distributed for all etiology categories except viral outbreaks where the percentage of female was higher (median 63%; IQR 42–78%). The median hospitalization rate was 12% for bacterial outbreaks and 0 for other categories. Bloody stools were more common among individuals in bacterial outbreaks (28%). Viral outbreaks had lower proportions of cases with abdominal cramps (57% vs. bacterial 85%, bacterial toxins 79%, parasites 81%) and diarrhea (86% vs. all others 100%), but more vomiting (72% vs. 40%, 25%, 41%). The proportion of people reporting fever varied among etiology categories (bacteria 65%, parasites 35%, viruses 22%, and bacterial toxins 5%) (Table 1).

Symptoms of diarrhea and cramps were commonly reported among all cases except for those infected with norovirus. Campylobacter, Salmonella, and Shigella had similar median values for bloody stools (25%, 20%, 25%, respectively), whereas Shiga toxin–producing Escherichia coli (STEC) was higher (75%). Norovirus outbreaks had a higher percentage of patients experiencing vomiting (73%) than other etiologies. The median percentage hospitalized was above zero only for Salmonella (17%) and STEC (33%). The percentage of female was ∼50% for all etiologies except norovirus (63%). Shigella and Cryptosporidium had proportionally more outbreak-associated cases aged <5 years (32% and 11%); all other etiologies had a median of 0–1%. Salmonella, Clostridium perfringens, and norovirus had proportionally more outbreak-associated cases aged >50 years (14%, 16%, 25%, respectively) than the other etiologies (0–4%).

The training datasets for the etiology category and the etiology-specific model included 14,143 and 13,490 outbreaks, respectively, after excluding outbreaks with incomplete information. The testing datasets included 6061 outbreaks and 5782 outbreaks, respectively. The etiology category model had a kappa of 0.85 and an accuracy of 0.92, with high specificities for all categories but lower sensitivities for bacterial toxin and parasitic outbreaks (Table 3). The etiology-specific model had a kappa of 0.75 and an accuracy of 0.86. We combined O157 and non-O157 STEC in the etiology-specific model because the model performed worse when only including O157 (<300 non-O157 outbreaks).

Table 3.

Pathogen Category Model Comparing Predicted and Documented Pathogen Category

Documented	Predicted				Total
Documented	Bacteria	Parasite	Toxin	Virus	Total
Bacteria	1773 (94%)	47 (2%)	14 (1%)	54 (3%)	1888
Parasite	100 (47%)	97 (45%)	2 (1%)	15 (7%)	214
Toxin	42 (19%)	6 (3%)	146 (65%)	31 (14%)	225
Virus	107 (3%)	12 (0%)	33 (1%)	3582 (96%)	3734
Total	2022	162	195	3682	6061
Sensitivity	0.939	0.453	0.649	0.959
Specificity	0.940	0.989	0.992	0.957

Among specific etiologies, norovirus and Salmonella, respectively, had the highest sensitivities with 98% (3599) and 83% (873) outbreaks correctly predicted (Table 4). Other etiologies had high specificities but lower sensitivities: Campylobacter (0.99 and 0.12, respectively), C. perfringens (1.00 and 0.75, respectively), Cryptosporidium (0.99 and 0.36, respectively), STEC (0.99 and 0.63, respectively), Shigella (0.98 and 0.49, respectively). Many Campylobacter (64%), Shigella (37%), and STEC (23%) outbreaks were predicted as Salmonella; 22% of Cryptosporidium outbreaks were predicted as Salmonella and 22% as Shigella. Sensitivity increased for each etiology when considering the top two predicted etiologies (Table 4). The etiology-specific model conflicted with the etiology category model in <5% of predictions (e.g., etiology type was predicted as a parasite, but the etiology-specific model predicted Salmonella).

Table 4.

Pathogen-Specific Model Comparing Predicted and Documented Pathogen Genus

Documented	Predicted
Documented	Clostridium perfringens	Campylobacter	Cryptosporidium	Norovirus	Salmonella	Shigella	STEC	Total
C. perfringens	96 (75%)	1 (1%)	4 (3%)	8 (6%)	16 (13%)	2 (2%)	1 (1%)	128
Campylobacter	1 (1%)	22 (12%)	5 (3%)	3 (2%)	116 (64%)	21 (12%)	13 (7%)	181
Cryptosporidium	2 (2%)	10 (8%)	46 (36%)	9 (7%)	28 (22%)	28 (22%)	6 (5%)	129
Norovirus	2 (0%)	2 (0%)	7 (0%)	3599 (98%)	60 (2%)	3 (0%)	6 (0%)	3679
Salmonella	6 (1%)	32 (3%)	12 (1%)	38 (4%)	873 (83%)	53 (5%)	42 (4%)	1056
Shigella	1 (0%)	16 (5%)	5 (2%)	9 (3%)	119 (37%)	159 (49%)	16 (5%)	325
STEC	1 (0%)	7 (2%)	6 (2%)	2 (1%)	64 (23%)	23 (8%)	181 (64%)	284
Total	109	90	85	3668	1276	289	265	5782
First predicted pathogen
Sensitivity	0.750	0.122	0.357	0.978	0.827	0.489	0.634
Specificity	0.998	0.988	0.993	0.967	0.915	0.976	0.985
First and second predicted pathogens
Sensitivity	0.836	0.480	0.605	0.990	0.956	0.745	0.803

STEC, Shiga toxin–producing Escherichia coli.

The random forest importance of each predictor in the etiology category model was highest for bloody stools (0.42), followed by vomiting (0.28), shortest incubation period (0.26), shortest illness duration (0.13), fever (0.07), age <5 years (0.07), hospitalization rate (0.06), diarrhea (0.05), abdominal cramps (0.03), percentage of female (0.03), and age ≥50 years (0.02). For the etiology-specific model, the variable importance from highest to lowest was bloody stools (0.57), vomiting (0.32), shortest incubation period (0.25), shortest illness duration (0.17), age <5 years (0.15), hospitalization rate (0.12), fever (0.09), percentage of female (0.03), abdominal cramps (0.02), age ≥50 years (0.02), and diarrhea (0.02).

We adapted both models into an online publicly available tool for investigators to use during an outbreak investigation when the etiology is unknown (available at https://coe-foodsafety.shinyapps.io/pathogen-prediction). All inputs (percentage of bloody stools, cramps, fever, vomiting, diarrhea, hospitalized, female, age <5 years, and age >50 years; shortest incubation period; and shortest duration of illness) are required. After an investigator inputs information about the outbreak, the tool displays two bar graphs with the probability of each etiology category and each specific etiology (Fig. 1).

FIG. 1.

A screenshot of the online tool with an example outbreak of Clostridium perfringens.

Discussion

Knowing the outbreak etiology during an enteric disease outbreak investigation is important for guiding control and prevention efforts and preventing additional illnesses. When laboratory confirmation is unavailable, information on the clinical signs and symptoms reported by people associated with the outbreak, with other characteristics including case demographics and illness severity, can be used to predict the etiology or etiology category. Using data from previous enteric disease outbreaks, we developed an online tool for investigators to use during an outbreak, providing a data-driven approach to predicting the causative agent.

Viral outbreaks in the etiology category model and norovirus outbreaks in the etiology-specific model can be predicted with a high accuracy. This is important because many outbreaks likely to be caused by norovirus do not have an etiology confirmed, yet clear and effective interventions exist to reduce exposure and control the spread of illness (Kambhampati et al., 2015).

Norovirus, which accounts for most viral outbreaks, has a distinctive clinical profile with lower levels of diarrhea and abdominal cramps and higher levels of vomiting compared with other etiologies. For this reason, previous studies have also reported being able to distinguish norovirus from other groups of etiologies, such as diarrheal toxin-like and Salmonella-like illnesses (Hall et al., 2001; Hedberg et al., 2008; Kaplan et al., 1982; Turcios et al., 2006). Norovirus outbreaks also included a greater percentage of females. This finding likely reflects the demographic profile of residents and health care workers at long-term care facilities, a common setting for norovirus outbreaks, and demonstrates how demographic profiles can sometimes be used as a proxy of setting in an outbreak investigation (Caffrey et al., 2021; Calderwood et al., 2021).

Bacterial outbreaks were also predicted with high confidence, and at the etiology level, this was true for only Salmonella and STEC outbreaks. STEC was likely predicted well because it causes bloody stools more frequently than other etiologies, and bloody stools were identified as an important predictor in both the etiology category model and the etiology-specific model. Salmonella was also well predicted, but it has a similar profile to other bacterial etiologies. Many outbreaks caused by Campylobacter and Shigella were predicted as Salmonella, which is more frequently identified as the etiology in outbreaks. To address this, previous studies looking at clinical profiles have grouped Salmonella together with other bacterial etiologies causing a similar illness. For example, Hedberg et al. (2008) combined Salmonella, Shigella, and Campylobacter into a single “Salmonella-like illness.”

However, with additional information, it is possible for investigators to differentiate between these etiologies. For example, an outbreak among children at a day care center may lead investigators to suspect Shigella, whereas Campylobacter may be suspected in an outbreak among raw milk consumers. While our tool does not currently include setting as a predictor, differentiating between bacterial etiologies may assist in narrowing exposures, implementing control measures, and if laboratory testing is not conducted by the end of the outbreak, a suspected etiology can still be useful in research. Suspecting an etiology before laboratory results are available can help indicate whether exclusion policies need to be implemented. The online tool shows the probabilities for both models; the investigator can determine which model is more appropriate given the context of the outbreak.

The models predicting bacterial toxin and parasitic outbreaks and their associated etiologies had lower sensitivities, likely due to fewer outbreaks. One of the distinguishing features of bacterial toxin and parasitic outbreaks is the incubation period; toxins have short incubation periods and parasites have long incubation periods. Using shortest incubation influenced the model enough that the distinct incubation periods overlapped and made it difficult to distinguish from bacterial outbreaks; however, we chose the shortest incubation period because it is more likely to be known early in an investigation. Previous profiles have not included parasites in the predictions as their own category (Dalton et al., 1999; Hall et al., 2001; Hedberg et al., 2008; Kaplan et al., 1982; Turcios et al., 2006).

Although less common and often have a characteristic long incubation period, early in an investigation before an exposure is known, suspecting a parasitic outbreak could help narrow exposures for further investigation (Scallan et al., 2011). In some scenarios, having a predicted etiology category may be sufficient information to direct an investigation; hypothesizing the etiology is a bacterial toxin rather than a virus is enough information to implement control measures and to hypothesize potential transmission modes and vehicles. Additional information about the outbreak not captured in the model can be used in conjunction with the model to suspect one category over another. For example, secondary cases could indicate a virus rather than a bacterial toxin.

The variables differed in their relative importance to the prediction in each model. Bloody diarrhea had the highest relative importance for both models and was a key predictor for identifying STEC and eliminating norovirus as the likely outbreak etiology. Incubation period had the next highest relative importance for the etiology category model; however, investigators may not know the incubation period during an investigation if the exposure is unknown. We accounted for this by including an unknown option for the incubation period. Vomiting was especially important in differentiating norovirus and viruses, which has a higher rate of vomiting than other etiologies. The percentage of age <5 years was the next most important variable for the etiology-specific model. This age group is more common in Shigella and Cryptosporidium outbreaks, aiding in differentiating those two etiologies. Hospitalization rate was next for the etiology category model, which aided the most in differentiating viral outbreaks. Other tools did not include demographics, such as age, although age <5 years was around the middle of relative importance.

The R Shiny tool is available at https://coe-foodsafety.shinyapps.io/pathogen-prediction Investigators can use both models in the tool to assist with hypothesis generation. Investigators should incorporate additional aspects of the outbreak, such as geographic spread, to assist in differentiating between the potential predicted etiologies. The tool may be particularly helpful during outbreaks where it is not possible or practical to conduct laboratory testing for the etiology or when waiting for laboratory results. In an event-based outbreak, the investigator may only want to differentiate between a toxin and a virus to provide more targeted infection prevention education. Model performance improved when considering the top two predicted etiologies, substantially increasing the sensitivity for Campylobacter, Cryptosporidium, and Shigella. Investigators should consider the top two predicted etiologies, particularly when Salmonella is the first due to the overprediction of Salmonella for Campylobacter, Cryptosporidium, and Shigella.

This study is subjected to several limitations. First, a subset of etiologies implicated in past outbreaks were included in both models. While these are the most common enteric disease outbreak-associated etiologies, there are other potential outbreak etiologies not included, and investigators should not eliminate consideration of rare etiologies. The small proportion of conflicting results (e.g., predicted parasite and Salmonella) may be related to the subset of etiologies included, and additional etiologies should be considered. Second, the data reported through NORS are from completed, closed-out outbreak investigations, which may be different from an ongoing investigation and potentially bias the tool. Incomplete information may be all that is available during an investigation, so percentages could change as the investigation progresses. If known, incubation period should not change because we used the shortest incubation rather than median as well as including an unknown option.

NORS also only collects information on outbreaks in the United States. Symptoms may be comparable internationally, but clinical characteristics such as hospitalizations may differ. In addition, we only included laboratory-confirmed outbreaks. Thus, the model is biased toward etiologies that are commonly isolated and tested, such as Salmonella. Other etiologies may be common, but less likely to have laboratory confirmation, such as bacterial toxins. We excluded suspected etiology outbreaks to minimize potential mistakes by the model in case the wrong etiology was suspected. We also did not include transmission mode. Transmission mode is likely predictive; however, it can be unclear what the transmission mode is early in an outbreak. Future models could account for additional variables such as transmission mode and setting while allowing for missing values.

We demonstrated success in predicting the etiology category and specific etiologies using random forest models and provided an online, publicly available tool for outbreak etiology prediction. Our online tool is intended to assist with the identification of the causative agent in the absence of laboratory testing. Investigators can use these models with additional information from the investigation to assist with hypothesis generation.

Footnotes

Acknowledgments

The authors would like to thank Conner Jackson and all the state and territorial health departments that report outbreaks through NORS.

Disclaimer

The findings and conclusions of this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Authors' Contributions

H.K. contributed to the project conception, data acquisition, design, analysis, and interpretation. A.W. contributed to the design and interpretation. B.B.B. contributed to the design, analysis, and interpretation. E.B.R. contributed to the analysis and interpretation. E.S.W. contributed to the design and interpretation. All authors contributed to the article draft and revisions. All authors provided approval for submission and agree to accountability.

Disclosure Statement

No competing financial interests exist.

Funding Information

This article was funded in part by the Colorado Integrated Food Safety Center of Excellence, which is supported by the Epidemiology and Laboratory Capacity for Infectious Disease Cooperative Agreement (CK19-1904) through the Centers for Disease Control and Prevention.

References

Batz

, Richardson

, Bazaco

, et al. Recency-weighted statistical modeling approach to attribute illnesses caused by 4 pathogens to food sources using outbreak data, United States. Emerg Infect Dis, 2021; 27(1):214–222; doi: 10.3201/eid2701.203832

Bischl

, Lang

, Kotthoff

, et al. mlr: Machine learning in R. J Mach Learn Res, 2016; 17(170):1–5.

Breiman

Random forests. Mach Learn, 2001; 45(1):5–32; doi: 10.1023/A:1010933404324

Caffrey

, Sengupta

, Melekin

Residential care community resident characteristics: United States, 2018. National Center for Health Statistics, Division of Health Care Statistics: Hyattsville, MD; 2021.

Calderwood

, Wikswo

, Mattison

, et al. Norovirus outbreaks in long-term care facilities in the United States, 2009–2018: A decade of surveillance. Clin Infect Dis, 2021; 74(1):113–119; doi: 10.1093/cid/ciab808

CDC. National Outbreak Reporting System Dashboard. Atlanta, GA; 2022. Available from: wwwn.cdc.gov/norsdashboard [Last accessed: April 14, 2022 ].

Chang

, Cheng

, Allaire

, et al. shiny: Web Application Framework for R. 2022.

Dalton

, Mintz

, Wells

, et al. Outbreaks of enterotoxigenic Escherichia coli infection in American adults: A clinical and epidemiologic profile. Epidemiol Infect, 1999; 123(1):9–16; doi: 10.1017/s0950268899002526

Hall

, Goulding

, Bean

, et al. Epidemiologic profiling: evaluating foodborne outbreaks for which no pathogen was isolated by routine laboratory testing: United States, 1982–9. Epidemiol Infect, 2001; 127(3):381–387; doi: 10.1017/s0950268801006161

10.

Hedberg

, Palazzi-Churas

, Radke

, et al. The use of clinical profiles in the investigation of foodborne outbreaks in restaurants: United States, 1982–1997. Epidemiol Infect, 2008; 136(1):65–72; doi: 10.1017/S0950268807008199

11.

Ishwaren

, Kogalur

. Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). 2022.

12.

Kambhampati

, Koopmans

, Lopman

. Burden of norovirus in healthcare facilities and strategies for outbreak control. J Hosp Infect, 2015; 89(4):296–301; doi: 10.1016/j.jhin.2015.01.011

13.

Kaplan

, Feldman

, Campbell

, et al. The frequency of a Norwalk-like pattern of illness in outbreaks of acute gastroenteritis. Am J Public Health, 1982; 72(12):1329–1332; doi: 10.2105/ajph.72.12.1329

14.

Lund

, O'Brien

. The occurrence and prevention of foodborne disease in vulnerable people. Foodborne Pathog Dis, 2011; 8(9):961–973; doi: 10.1089/fpd.2011.0860

15.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria; 2022.

16.

Scallan

, Hoekstra

, Angulo

, et al. Foodborne illness acquired in the United States—Major pathogens. Emerg Infect Dis, 2011; 17(1):7–15; doi: 10.3201/eid1701.p11101

17.

Torok

, White

, Butterfield

, et al. Barriers to stool specimen collection during foodborne and enteric illness outbreak investigations in Arizona and Colorado. J Food Prot, 2022; 2022:100012; doi: 10.1016/j.jfp.2022.11.004

18.

Turcios

, Widdowson

, Sulka

, et al. Reevaluation of epidemiological criteria for identifying outbreaks of acute gastroenteritis due to norovirus: United States, 1998–2000. Clin Infect Dis, 2006; 42(7):964–969; doi: 10.1086/500940

19.

White

, Jackson

, Kisselburgh

, et al. Using outbreak data for hypothesis generation: A vehicle prediction tool for disease outbreaks caused by Salmonella and Shiga toxin–producing Escherichia coli . Foodborne Pathog Dis, 2022; 19(4):281–289; doi: 10.1089/fpd.2021.0090

20.

White

, Smith

, Booth

, et al. Hypothesis generation during foodborne-illness outbreak investigations. Am J Epidemiol, 2021; 190(10):2188–2197; doi: 10.1093/aje/kwab118

21.

Wikswo

, Roberts

, Marsh

, et al. Enteric illness outbreaks reported through the National Outbreak Reporting System, United States, 2009–19. Clin Infect Dis, 2022; 74(11):1906–1913; doi: 10.1093/cid/ciab771