Abstract

A
Nevertheless, my analysis of these secondary data (detailed in the Supplementary Data) confirms the anomalies noted in my original letter, and finds additional questionable errors in all assemblies. These discrepancies can only be addressed with access to all primary sequence data, including all 454, Illumina and Sanger sequencing output reportedly used to generate and close the assemblies. The plasmid assemblies for pHK02-026 and pM16-13 are wholly inaccurate, and the CPZ-55 prophage locus, from the E. coli J53 host strain chromosome, does not belong in the pHK02-026 plasmid genome.
For plasmids pHK02-026 and pM16-13 they incorrectly reported assembling 8 and 6 contigs from 454 pyrosequencing data (with gaps completed by sequencing of PCR products). Instead the files for pHK02-026 and pM16-13 contained 17 and 35 contigs, respectively. All contigs are fully syntenic with matches to multiple GenBank genomes, indicating they are likely reasonably accurate assemblies of individual reads. However, there are major issues with their inappropriate use in producing the assemblies that violate generally accepted research practices.
Full documentation is in the Supplementary Data. In brief, for pHK02-026, 8 of the 17 contigs (c1–c7, c10) with good read coverage (∼100-fold; shaded green in Supplementary Table S1) map at least partially to their assembly, although many have unexplained single nucleotide polymorphisms (SNPs) (Supplementary Fig. S1). The majority of the assembly has contig coverage, except for a 135 bp gap at 67335-67469 between c5 and c7, and the 2–3 kb of sequence flanking c12, the E. coli prophage locus. Disturbingly, three other high-coverage contigs (c8, c13, c15; shaded red in Supplementary Table S1) with complete matches to other GenBank plasmid entries were not used.
Of six low-coverage (less than fourfold) contigs, two were chimeric and internal to c1, c2, or c3; four other low-coverage contigs (c9, c11, c12, and c14) all showed identity to the E. coli J53 host strain chromosome (Supplementary Table S1), but only c12, derived from only four reads, was used in the assembly as an internal part of the CPZ-55 prophage locus. Read coverage data alone should have excluded this contig, as was done for the other low-coverage contigs. Significant errors have also been made in the plasmid backbone. The majority of contigs (reported to have been joined by PCR and Sanger sequencing) were just directly joined to the adjacent contig (Supplementary Table S2). Contigs appear to have been excessively trimmed of repetitive sequences, as many start and/or finish with short matches to the ends of insertion sequence (IS)1 or IS26, with no attempt to complete the IS elements.
Inexplicably, several contigs have been only partially used—ends discarded, internally deleted, or had sections inverted and/or duplicated or extended, with no explanation or reasoning as to why this was done. Every discontinuity noted in my letter 1 is associated with a contig junction or splice point (Supplementary Fig. S1). As all contigs are contiguous in other genomes, this is cause for concern. A novel 122 kb closed circular assembly can be made, using all of their plasmid-related contigs (c1–c8, c10, c13, and c15), unaltered, joined by full IS elements, and with no prophage sequences, scaffolded on two of their comparator plasmids (pC15-1a and pKPS30) and pNDM5, CP050169 (Supplementary Table S3). Homologies to three other plasmids confirm the junction sequences (Supplementary Fig. S2). This assembly can only be confirmed by mapping the primary 454 read data to it.
The 111 kb genome assembly for pM16-13 may be even more inaccurate; they reported generating 6 contigs, but instead provided 35 contigs assembled from <1,000 reads (Supplementary Table S4)—compared with >22,000 reads for pHK02-026. Contig2, not used in the assembly, has 15-fold coverage, and at 9,294 bp appears to be a separate small plasmid identical to GenBank entry CP052548, pB16KP0089-2; all other contigs show sixfold or lower coverage. Ten very low-coverage contigs that match the E. coli J53 host chromosome were correctly not used; three of these are prophage-related, two from lambda, one from Rac (Supplementary Table S4). Three other unused contigs (c26, c30, c33) match other GenBank plasmid genomes, suggesting they have been erroneously omitted, and >3 kb from the remaining contigs were not used.
Their assembly has only 70 kb of contig coverage, and >38 kb in at least 16 gaps have no coverage (Supplementary Fig. S3). Only 21 contigs mapped at least in part to the assembly (Supplementary Table S3 and Supplementary Fig. S3), most with several SNPs; four were only partially used (Supplementary Table S4). Coupled with the low coverage for all contigs, this is indicative of a low-quality incomplete genomic library. They provide no details as to how the 38 kb of sequence in 16 gaps “were closed by PCR and standard Sanger sequencing,” as reported. Surprisingly, although they report their assembly is “similar to pC15-1a” and “pKPS30-like,” 2 an identical assembly can be directly constructed from a 48.4 kb internally deleted section of pC15-1a (and a 3.26 kb sequence found in many other plasmids) inserted into most of pKPS30 (60.2 kb; Supplementary Data and Supplementary Fig. S4).
In brief, 88558-61551 of pC15-1a (deleted for 1-16947) joined to 25774-29033 of NDM_IncR (CP050169) are inserted at the origin of pKPS30, deleted for 1-936. The resulting junctions match exactly where several contigs abruptly lose homology with the assembly. For example, the 936 bp deletion (and 51,669 bp insertion of pC15-1a and NDM_IncR sequences) is fully internal to the 2.8 kb c13 (Supplementary Fig. S4). Any assembly not using, or accounting for, all (properly assembled) full contigs must be incorrect. Although different contigs map to portions of pC15-1, pKPS30, and NDM-IncR, coverage is so incomplete and fragmented that it is not possible to produce an accurate assembly from the contig data.
Their Illumina-generated assemblies of pIN03-01 and pTH02-34 have fewer but no less significant problems, being described as “very similar”, they differ only by the presence of a 381 duplication in pIN03-01. They provided single linear contigs for each plasmid, assembled from an unknown number of reads, that begin at different positions but that are, however, identical to each other—as I predicted in my original letter. The pIN03-01 contig has a single copy of the 381 bp duplication present in their assembly for pI03-01, KY499796. Both contigs are the same size as the GenBank entry for pTH02-34 KY499797, but with the same four unexplained SNPs. Three of their assembly SNPs are very rare (unique, or only in two or four other plasmids).
In summary, there is no support for a pHK02-026 ancestor gaining a segment of a prophage genome, which ultimately derives from host chromosomal DNA co-purifying with the plasmid DNA used to make the sequencing library. The assembly for pM16-13 may have been wholly produced from other plasmid sequences. None of the plasmids in this study can be accurately described as carrying de facto prophage genes in their genomes. Unless the authors can show otherwise by providing primary sequence data covering all of their genomes, the assemblies for pHK02-026 and pM16-13 should be considered so corrupt and incomplete that they should be retracted and deleted from the GenBank database.
Footnotes
Addendum
The authors volunteered to re-sequence pHK02-026 only. They provided a BAM file containing 4.8 million 300 bp paired reads. These reads mapped with near complete coverage to three genomes - the presumed E. coli host strain J53 (4.3 Mb, NZ_CP028702; 7-fold average coverage), and two plasmids, pHK02-026 (112 kb, KY751926, 5,500-fold average coverage) and pKD46 (6.3 kb, AY048746, 102,000-fold average coverage), an arabinose-inducible clone of lambda exo-red-gam genes, used in recombineering. For the pHK02-026 KY751926 alignment, discontinuities were found at every site noted in my original letter (corresponding to their 454 contig boundaries) and there was extremely low (6-fold, cpz-55 locus) or no (cpz-55 to plasmid backbone junctions) coverage for the prophage region in their assembly, compared to 5,500-fold coverage for the plasmid backbone. There was full and similarly low coverage of the cpz-55 locus, continuing into flanking chromosomal genes, when mapped to the J53 genome. These new reads, however, mapped completely, and with no discontinuities, to my alternate 122 kb assembly (see
) for pHK02-026, with only several SNPs in IS26 elements, 3 1-2 bp insertions or deletions near the end of their contig1, and a 444 bp insertion in a repeat region internal to their contig8 (not used in their assembly). This proves conclusively that the cpz-55 locus is not present in pHK02-026 and that their assembly is wholly incorrect. Despite repeated requests, the authors have declined to provide any insight into how they actually produced their plasmid genome assemblies: the reader is left to draw their own conclusions.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
