Abstract
Escherichia coli O157:H7 strains associated with several recent (2017–2020) multi-state outbreaks linked to leafy green vegetables have been characterized as “reoccurring, emerging, and persistent” (REP). Our recent unpublished work demonstrated that the REP strains had significantly enhanced potential for biofilm formation. In this study, comparative genomic analyses were conducted for a better understanding of the mechanisms behind the enhanced biofilm formation, and thereby potentially increased environmental fitness, by the REP strains. Phylogenetically, the recent outbreak strains formed two distinct clusters represented by REPEXH01 and REPEXH02. Compared with EDL933 and other previous outbreak reference strains, the REP strains (clustering with REPEXH02) exhibiting strong biofilm formation were found to have acquired two genes encoding proteins of unknown functions (hypothetical proteins) and lost certain prophage-related genes. In addition, several single nucleotide polymorphisms in genes related to biofilm formation were identified.
Introduction
Shiga toxin-producing Escherichia coli (STEC), most notably the serotype O157:H7, is a major public health concern due to its ability to cause severe foodborne infections, including hemolytic uremic syndrome. Since its first isolation in 1982, linked to two outbreaks of bloody diarrhea from undercooked hamburgers (Riley et al., 1983), Escherichia coli (E. coli) O157:H7 and other STEC strains have been linked to numerous foodborne outbreaks involving a variety of foods, particularly undercooked meat products. E. coli O157:H7 was even labeled the pathogen of “hamburger disease” (Eastern Health, 2019). However, the large-scale multi-state E. coli O157:H7 outbreak associated with spinach in 2006 (Centers for Disease Control and Prevention (CDC), 2006) challenged this perception and underscored the food safety risks of STEC contamination and persistence in fresh produce growing and processing environments. Recently, fresh produce has surpassed meat and meat products as the most frequent food vehicles associated with STEC outbreaks (Interagency Food Safety Analytics Collaboration (IFSAC), 2023).
Comparative genomic analyses have shown that the emergency of STEC was mostly driven by lateral gene transfer events mediated by prophages and plasmids (Zhang et al., 2007). As STECs evolve, new strains with apparently increased fitness for fresh produce production environments continue to emerge (Carter et al., 2014; Cherry, 2022). Between 2017 and 2020, a series of multi-state E. coli O157:H7 outbreaks were traced to Romaine lettuce or other leafy green products from the central production regions in California and Arizona (CDC, 2024). The isolates from these outbreaks, which are closely clustered genomically, have been characterized as “reoccurring, emerging, and persistent” (REP) by CDC and the Food and Drug Administration (FDA) (CDC, 2023; Chen et al., 2023). FDA has designated these REP STECs as a “reasonably foreseeable hazard” in California’s central leafy green production region and has called for increased longitudinal studies to better understand their persistence and to prevent future resurgence (FDA, 2021).
Our recent research (unpublished data, submission under review) compared several key features of selected REP strains with selected previous landmark outbreak strains (reference strains), focusing on biofilm formation, curli expression, acid tolerance, and cellulose biosynthesis. This phenotypical comparison demonstrated that the REP strains exhibited significantly increased curli and cellulose expression and potential for biofilm formation. These factors likely will enhance the REP strain’s fitness in leafy green production environments, contributing to their persistence. Therefore, this study aims to examine the genomic basis for the enhanced biofilm formation and potentially other factors that may improve the fitness of the REP strains for persisting in the fresh produce production environments.
Materials and Methods
Phylogenetic analysis
A total of 30 E. coli O157:H7 strains, isolated from various food production and clinical sources, were used for genomic analyses. The genome accession numbers and isolation sources are listed in Table 1. Sequencing data for all strains were downloaded from the NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra/) and subsequently uploaded to EnteroBase. All raw reads were assembled in EnteroBase using SPAdes 3.9.0 to ensure uniformity. The phylogenetic relationships among the strains were inferred from their core genomes. A maximum-likelihood phylogenetic tree was constructed using Randomized Axelerated Maximum Likelihood (RAxML), based on non-repetitive core single nucleotide polymorphisms (SNPs), through the EnteroBase SNP project dendrogram module (Zhou et al., 2020), with E. coli O157:H7 strain EDL933 serving as the reference. Default parameters were used for the analysis.
Strains Used in This Study
Strains listed in bold indicate those previously tested for biofilm formation and used for pangenomic analyses in this study.
Pangenome and SNP analyses
To investigate the potential genomic features associated with the environmental persistence of recent outbreak strains, further analysis was conducted on the genomes of seven strains (2705C, 2705D, PNUSAE019890, EC4115, MOD1-EC4306, MOD1-EC4308, and EDL933), previously tested in our study (unpublished data). Assembled genomes were retrieved from NCBI. To identify genes that were variably distributed among genomes, the genome assemblies were annotated using Prokka (Seemann, 2014) with default settings. The resulting GFF files were uploaded to Roary to generate a pangenome matrix under default settings (Page et al., 2015). For identification of protein localization within the cell, predicted gene sequences were translated, and these amino acid sequences were uploaded to SignalP6.0 to identify signal proteins (Teufel et al., 2024).
Core genome SNPs among the study genomes were identified using Parsnp from the Harvest package (Treangen et al., 2014), with EDL933 as the reference (-r) and all genomes included in the analysis (-c). The SNP locations from the Variant Call Format (VCF) file were then analyzed to determine which genes contained the SNPs, along with the resulting amino acid substitutions (S = synonymous, NS = nonsynonymous). Identification of the genes in which SNPs occurred was determined by querying the nucleotide position in the annotated reference genome and these are reported as gene name and gene locus tag of the reference genome (EDL933). The NCBI genome accession numbers of the following strains 2705C, 2705D, PNUSAE019890, EC4115, MOD1-EC4306, MOD1-EC4308, and EDL933 are ABKIUR000000000.1, ABKIWW000000000.1, AARNWN000000000.1, NC_011353.1, NWPR00000000.1, NWPP00000000.1, and CP008957.1, respectively.
Results and Discussion
Phylogenetic analysis
From 2017 to 2020, there were at least seven multi-state E. coli O157:H7 outbreaks linked to the consumption of contaminated Romaine lettuce or mixed leafy green products (CDC, 2024). Based on genomic sequence availability, selected isolates from these outbreaks were used for genomic analyses, along with selected isolates from environmental samples and strains from previous landmark outbreaks. The phylogenetic relationship of these 30 E. coli O157:H7 genomes is inferred in a SNP-based dendrogram generated using EnteroBase (Fig. 1).

Maximum-likelihood phylogenetic tree of 30 E. coli O157:H7 strains from various sources. The tree was based on RAxML of non-repetitive core SNPs using the EnteroBase SNP Project dendrogram module against the reference genome E. coli O157:H7 EDL933. Colors indicate different strain isolation sources or outbreaks. SNPs, single nucleotide polymorphisms.
These analyses identified two distinct clusters. The first group, mostly composed of isolates from the 2018 large outbreak implicating romaine lettuce from Yuma, Arizona (CDC, 2018; Stanton et al., 2020), as represented by REPEXH01 (PNUSAE013425) (CDC, 2023), closely clustered with the outbreak strain from the 2006 outbreak (Teng et al., 2020) implicating spinach from Salinas, CA. It also distantly clustered with a strain (MOD1-EC4308) from a 2006 outbreak associated with lettuce (Yin et al., 2023). The second group, represented by REPEXH02 (PNUSEA020169) (Chen et al., 2023), formed a tight cluster encompassing isolates from outbreaks in 2018 and 2019 implicating romaine lettuce from California’s central coast production regions and isolates from FDA environmental sampling of nearby cattle feeding operations. These REP strains also showed loose clustering with a strain from a 2016 outbreak (PNUSEA005245) and with some strains from 2006 outbreak associated with lettuce (TW14588 and MOD1-EC4306). The REP strains (2705C, 2705D, and PNUSEA19890) that demonstrated high biofilm formation (unpublished data) are all among isolates in this cluster. Interestingly, the strain MOD1-EC4306, which was shown to be intermediate biofilm producer, loosely clustered with these REP strains.
It is noteworthy that the REP strains in the REPEXH01 and REPEXH02 groups were phylogenetically distant. For example, strains in the REPEXH02 group showed closer genomic relatedness to EDL933 than to REPEXH01. Conversely, strains linked to the spring 2018 lettuce outbreak (including REPEXH01) exhibited higher genetic similarity to the strain associated with the 2006 spinach outbreak (EC4115) than to those in the REPEXH02 group.
REP strain SNP analyses
To better understand the environmental persistence of the REP strains, Parsnp was used (Treangen et al., 2014) to identify core genome SNPs among strains that exhibited strong biofilm formation (2705C, 2705D, and PNUSEA19890) when aligned with reference strains (EC4115, MOD1-EC4306, MOD1-EC4308, and EDL933) that showed weak or no biofilm formation. A total of 72 single-nucleotide changes were identified in the genomes of the REP strains compared to the reference strains (Supplementary Table S1), of which, 37 resulted in nonsynonymous changes (missense or nonsense mutations) in the coding regions, leading to protein sequence alterations. Based on annotations of the affected proteins, these mutations could affect multiple cellular functions to varying degrees.
Based on the annotation information, several of these SNPs are considered most likely to be consequential to biofilm formation and the overall environmental fitness of REP strains (Table 2). A nonsense mutation in fdeC would result in a truncated reverse autotransporter adhesin fdeC (Cherry, 2022). Its disruption has been associated with increased adhesion to mammalian cells and enhanced mobility (Aleksandrowicz et al., 2024). Two missense mutations in the gene encoding an Ig-like domain-containing protein would lead to two single amino acid changes in the protein. Ig-like domains in E. coli are commonly found in cell surface proteins, functioning as structural components of pilus and non-pilus fimbrial systems, as well as members of the intimin/invasin family of outer membrane adhesins (Bodelón et al., 2013). These domains play critical roles in host cell adhesion and invasion by pathogenic strains (Bodelón et al., 2013). These SNPs have been previously identified among REP strains (Cherry, 2022).
Unique Mutations in the REP Strains Associated with Biofilm Formation b
aThe start, end, and nucleotide position indicate the location of the gene within the genome of strain EDL933, with the accession number NZ_CP008957.1.
bThe REP strains include 2705C, 2705D, and PNUSAE019890.
*An asterisk (“*”) indicates a stop codon at the corresponding nucleotide positions.
REP, reoccurring, emerging, and persistent.
SNPs resulting in missense mutations were identified in rcsC and bcsA, which encode proteins involved in the production of extracellular polysaccharides (EPS). RcsC is a two-component system sensor histidine kinase that regulates the production and secretion of colanic acid, an EPS important for biofilm maturation (Ferrières and Clarke, 2003). BcsA is a Uridine 5′-diphosphate (UDP)-forming cellulose synthase catalytic subunit responsible for synthesizing and secreting cellulose, the main component of EPS in bacterial biofilms (Acheson et al., 2021). These mutations could play roles in the enhanced curli and cellulose production and biofilm formation among the REP strains. Cherry (2022) also identified an SNP among the “Santa Marina Clade” of the REP strains in the arsR gene encoding a repressor for genes required for resistance to arsenic and antimony. The nonsense mutation which truncates the ArsR protein is expected to enhance the tolerance of the REP stains to soil arsenic in the production regions. However, this SNP is only present in the REP strain PNUSAE019890 (a strain in “Santa Marina Clade”) and is absent in 2705C and 2705D (strains clustered with those in “Salinas Clade”).
REP strain accessory genome
During the ongoing microbial evolution, bacterial strains continuously gain or lose non-essential genes that alter cell functions. Therefore, in addition to mutations in the core genome, the changes in the REP strain accessory genome could also impact their survival or fitness in various environments. Roary (Page et al., 2015) was used to identify the unique genes present or absent in the REP strain compared to the reference strains.
These analyses showed that two genes were exclusively present in the REP strains with enhanced biofilm formation (unpublished data) (Table 3). Both genes were annotated as “hypothetical proteins” with unknown functions, necessitating further laboratory studies to determine their roles in environmental persistence. A signal peptide could not be identified in these predicted proteins, indicating that, if functional, these proteins may not be secreted or transported to specific cellular locations but rather function within the cell. The REP strains were also determined to lack 11 genes which were present in the reference strain pangenome. Among those with annotated functions, these genes encode proteins involved in prophage activities.
Genes That Are Exclusively Present or Absent in the REP Strains
The REP strains include 2705C, 2705D, and PNUSAE019890, while the reference strains consist of EC4115, MOD1-4306, MOD1-4308, and EDL933.
Y and N indicate presence and absence, respectively.
REP, reoccurring, emerging, and persistent.
Comparing the REP strain genomes with the reference strains, Roary also identified 12 sequences common among the REP strains as potential redundant copies of core genome genes (Table 4). These genes are associated with various cellular functions, including antibiotic resistance (ampC), stress tolerance (blc), nutrient transport and utilization (pepD and phoE), polypeptide biosynthesis (prfB), protein synthesis (rsgA, efp, and epmB), toxin production (stxA and stxB), curli formation (crl), and nucleic acid synthesis (gpt). Crl is a global regulator that positively regulates the activity of the alternative sigma factor RpoS (Xu et al., 2019), which plays a central role in bacterial stress responses. It also directly stimulates the transcription of csgBA, genes encoding the subunits of curli (Arnqvist et al., 1992). The presence of a redundant copy of crl would be consistent with the observation of increased curli and cellulose production among the REP strains. However, these REP strain genomes were not closed but rather consisted of multiple contigs, and therefore, the presence of redundant copies of these genes needs further validation.
Genes with Redundant Copies in the REP Strains
The REP strains include 2705C, 2705D, and PNUSAE019890, while the reference strains consist of EC4115, MOD1-4306, MOD1-4308, and EDL933.
Y and N indicate presence and absence, respectively.
REP, reoccurring, emerging, and persistent.
Authors’ Contributions
Y.Y.: Conceptualization, data analyses, writing (original draft). X.Y. and B.J.H.: Data curation and analyses. C.L.: Writing (review/editing). X.N.: Conceptualization, writing (review/editing), resource, supervision. All authors read and approved the submitted article.
Footnotes
Acknowledgment
This research was partly conducted by appointees to the Agricultural Research Service Research Participation Program administered by the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between Department of Energy (DOE) and the U.S. Department of Agriculture (USDA).
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No external funding was received for this article.
Desclaimer
All opinions expressed in this paper are the author’s and do not necessarily reflect the policies and views of USDA, DOE, or Oak Ridge Associated Universities (ORAU)/ORISE. USDA is an equal opportunity provider and employer.
