Abstract

To the Editor:
This study, using next-generation sequence technology and looking for evidence of clonality in blaCTX-M-15 plasmids of mostly Asian origin, instead found that all five plasmids sequenced by this group had cryptic prophage genes in their genomes. 1 My analysis of the data suggests that this is not the case. Correctly assembling plasmid genomes is challenging, with problems including fragmented genomes and chromosomal contamination. 2 Misassembly is a common problem as plasmids often contain highly repetitive regions such as IS elements that are often beyond the resolution of short read methodologies.3,4 In this study, I find problems with the backbone structure of both IncFII (pHK02-026 and pM16-13) plasmid assemblies, suggesting misassembly, but the major issue I believe is that the presence of the CPZ-55 locus in the pHK02-026 plasmid assembly here is highly questionable, as the prophage region is identical to a locus present in the genome of the Escherichia coli host strain from which the sequencing library was prepared, and is not, in fact, present in the actual plasmid. No data are presented in support of, or to validate, these unique assemblies.
The pHK02-026 genome is here reported to have a 6,745 bp insertion that shows “high similarity” to the CPZ-55 locus of DH10B, a laboratory strain of E. coli K-12; it is, in fact, 100% identical. Other natural isolates of E. coli with chromosomal inserts at the CPZ-55 locus are quite divergent, and they are less than 12 kb in size. All are unlikely to be prophage related, having putative integrase genes (intZ) in common but the following 7–15 genes are mostly unrelated and annotated as hypothetical genes. The methods are unclear on how each plasmid was sequenced. They report using Roche 454 and MiSeq technologies, but they do not specify whether either or both were used for each plasmid; read length, use of paired-end sequencing, and coverage depth were not detailed. Assemblies were reported with 8, 6, 1, and 1 contigs, respectively, but without reference to individual plasmids. Contig gaps of undefined size or position were reported to be closed with polymerase chain reaction (PCR) and Sanger sequencing.
There are at least 11 sequence discontinuities (non-syntenic regions) when comparing pHK02-026 with pNY9_03 and/or pKPS30 (used in their figure 2 comparisons) that indicate that pHK02-026 is not just divergent but also highly misassembled. Individually, each anomaly could potentially be correct, but the number, precise nature, and specificity of many of these discontinuities are causes for concern. For clarity, I will here briefly describe these in order of occurrence in the pHK02-026 assembled sequence, and in more detail in the Supplementary Data; the acquisition of the CPZ-55 locus is the sixth anomaly. This anomaly, and the location of all anomalies, found only in pHK02-026 and in no other sequence in GenBank as of March 17th 2019, are detailed in Fig. 1.

Anomalies in the assembly of pHK02-026. Anomalies are numbered 1–11 and their locations are shown on the genome map in the lower panel. Middle and upper panels show anomaly 6 at the CPZ-55 locus. Upper panel, sequence of pHK02-026 flanking the CPZ-55 insertion point is shown with the 5′ and 3′ ends of CPZ-55 DNA in lower case (6,435 bp not shown). The start and end of the IntZ and YffS proteins are shown in upper case above the sequence. The end of the ISEc23 transposase protein is shown below the sequence. Note that the junctions of the plasmid and CPZ-55 DNA are either TAA stop or ATG start codons. Uppercase sequence 5′ of the CPZ-55 DNA corresponds to the end of a 30 bp-truncated ISEc23 element. The second row shows the sequence of the complete end of the ISEc23 element in pNY9_3 (30 bp missing in pHK02-026 are underlined) insertion into the yehA coding region (translation shown in lower case representing encoding on the opposite strand). Six hundred twenty four bp linking to the yeiA locus are not shown for clarity. The third row shows the sequence that in pHK02-026 (7,713 bp) links directly to the 15 bp 5′ of the yeiA gene; this corresponds to the first anomaly noted in the text; the truncated IS1′ sequence is underlined. The middle panel shows a graphical representation of the pHK02-026 and pNY9_3 regions from the top panel, expanded to show the full CPZ-55 locus and 5′ flanking region showing linkage to the tra locus. The pM16-13 ISEc23 locus is identical to that in pNY9_3. These are compared with the same locus in pK15fos (GenBank accession MK433206.1) that appears to have an ancestral wild-type locus before acquisition of the ISEc23 element, marked by the dashed lines. The relative locations of regions A, B, and C marked by solid lines are shown on the pHK02-026 map in the lower panel; the dashed line for region B indicates that homologous sequences extend beyond the region shown.
The first discontinuity is at bp 7,713 (an almost complete loss of an IS1-like element, present in full in pKSP30 and pM13-16), which is joined proximal to the tra locus; in other plasmids, this region is on the distal side of the ISEc23 insertion. A second discontinuity at 59,085 bp joins sequences following tetR to a truncated catB3 gene, which is, however, also inverted in comparison to pNY9_3 and pKSP30; a third, at 67,469 bp, immediately following the stop codon of tnpR is a deletion of 4,762 bp compared with pNY9_03; the fourth at 79,173 is a 39 bp deletion immediately adjacent to the start codon of the ssb gene; the fifth is followed at 79,702 bp by a 62 bp duplication of the end of the ssb gene, immediately after its stop codon; and sixth, at 86,570 bp, where the CPZ-55 locus begins, an ISEc23 sequence (their annotation is IS682-IS66) is truncated by 30 bp immediately after the stop codon of the encoded transposase (Fig. 1). A complete ISEc23 element is present at the same position in pM13-06, pNY9_3, and many other plasmids; IS-free loci syntenic with pM13-16 are also found in many other plasmids, e.g., Klebsiella pnemoniae plasmid pKPC2_020037, accession CP036372.1) (Fig. 1). Sequences immediately following the CPZ-55 locus in pHK02-026 are discontinuous with all other similar ISEc23 loci; the tra loci 3′ of ISEc23 in pM13-16 and pNY9_3 are not syntenic in pHK02-026, being linked to the first discontinuity noted earlier (Fig. 1). Seventh, at 93,044 bp, immediately following the stop codon of the CPZ-55 encoded yffS gene is a 134 bp duplication of an internal fragment of the following mph(A) gene. Eighth, this mph(A) gene has a premature amber stop codon proximal to a 207 bp deletion of the 3′ end of the gene at 93,892 bp, which ends at the start codon of the following mrx gene. Ninth, at 97,198 bp, the 306 bp padR coding sequence in pHK02-026 from its start to stop codons is specifically inverted relative to all other plasmids with this gene. Tenth, although pHK02-026 and pKPS30 homologies are generally on the complementary strand, beginning at 107,698 bp to the end (stop codon of resD), they show 100% identity on the coding strand, indicating a non-syntenic inversion in the pHK02-026 assembly. The 11th discontinuity in pHK02-026 joins the stop codon of resD (110,970 bp) to the start codon of repB (bp 1); whereas in all other similar plasmids (and here missing), there is a highly conserved 1.1 kb sequence encompassing an IncFIA replicon separating these two genes.
That the CPZ-55 locus begins and ends with sequence anomalies, and includes only the coding regions (from the ATG of intZ to the yffS stop codon), without any flanking sequences, is highly troubling, since the only other matches in GenBank showing >99% identity to the CPZ-55 locus in this plasmid assembly occur in the chromosomes of certain laboratory strains of E. coli K-12. These include J53 5 —here used as the conjugal recipient strain and from which the plasmid DNA was purified for sequencing. This is strongly indicative that its presence in this assembly is spurious. Unique instances like this demand a higher level of proof beyond the uncorroborated output of a computer program before they should be accepted as fact, especially since no details of the library quality, size, or coverage were reported, neither was any experimental support for this phenomenon reported.
Significant problems are also found with the pM13-16 assembly. Both IncFII plasmids in this study (pHK02-026 and pM16-13) were circularized, arbitrarily like many others, starting with bp1 as the start codon of repB (IncFIA locus, pHK02-026) or repA2 (IncFII, pM16-13) and ending with resD. Both assemblies appear anomalous, as pHK02-026 has no sequence between resD and repB, whereas pM16-13 does have the highly conserved 1,107 bp IncFIA origin following resD. However, this is joined directly to the start codon of repA2, part of the IncFII replicon that is found in both plasmids, but in different locations; no other plasmids are known that link resD (IncFIA) with repA2 (IncFII). These two plasmids are highly mosaic when compared with each other, and many of the boundaries involve coding region start or stop codons, which is highly unusual. In addition, pHK02-26 appears to have a full set of F-type conjugal transfer genes, whereas pM16-13 is missing essential components of the apparatus. 6 Completely absent from the pM16-13 assembly (relative to pHK20-026) are all the conjugation genes immediately following the stop codon of traC (missing all sequences up to the start codon of repA2 in the pHK02-026 assembly); oddly, in pM16-13, the stop codon of traC abuts the stop codon of a convergent psiA2 gene. This is one end of a 3.26 kb region (beginning with the psiA2 stop codon and ending with an unannotated conserved open reading frame start codon on the opposite strand that I here denote parX, a parB homolog) in pM13-16 that is 100% identical in many other plasmids, but never found with these flanking sequences. This linkage is again unique to pM13-16. Although pM13-16 is reported to be conjugally proficient (since it was mated into J53), it is missing homologs of the most auxiliary/coupling proteins and the relaxase TraI, and, thus, this assembly is almost certainly not conjugation-proficient. 7 None of these anomalies was noted by the authors.
The two IncX plasmids, pIN03-01 and pTH02-34, are described only as very similar, downplaying the fact that they are 100% identical to each other—more than 50 kb of sequence—except for a 381 bp direct duplication in pIN03-01, unique in GenBank (again starting with an ATG on the coding strand), which is found in single copy in pTH02-34. This duplication is not directly mentioned, but it covers the region described as encoding actX only in pIN03-01, when, in fact, both plasmids have this gene (which is perversely annotated as rfaH in pTH02-34). In addition, the GenBank entry for pIN03-01 is wholly unannotated, complicating comparison. In an effort to support their plasmid-prophage linkage hypothesis, these two plasmids are described as encoding genes similar to an aminoglycoside N(3′)-acetyltransferase (yokD) found in a Bacillus prophage; however, these plasmid gene products are only distantly related to YokD (with only 27% identity, 83/301 residues) whereas they are 99–100% identical to other Klebsiella and enterobacterial plasmid genes similarly annotated. The phage shock operons are not, as implied, prophage related, as identical loci are found in a large number of plasmids, being homologs of a genomic locus originally described in E. coli that is induced in response to filamentous phage infection, 8 and more accurately described as general envelope stress response proteins. 9 Overall, given the earlier facts, the association of “prophage” genes with these plasmids does not hold and is given far more significance than warranted.
The curious joining of homology blocks, that in other plasmids are not contiguous, with many blocks beginning or ending with start or stop codons suggests a problem with the assembly algorithm or programs, or possibly the quality of the plasmid library sample. However, without further experimentation to show otherwise, it is possible that some or even all of these differences between these plasmids and similar resistance plasmids in GenBank may be true, although a more plausible hypothesis is that these assemblies are seriously flawed. I believe in actuality that the CPZ-55 locus is not present in pHK02-026, due to incorrect splicing into the plasmid backbone of a contig assembled from host contaminating DNA, and many of the differences to other plasmids may also be the result of incorrect assembly. For pM13-16 to be conjugally proficient, it must have tra genes not present in this assembly. The accuracy of any genome assembly can be most easily assessed by mapping of the reads back to the assembly and assessing or investigating any regions of low or no (or even high) coverage, as well as percentage of reads not used (indicating either library contamination or regions not incorporated into the assembly). Submission of a BAM file to the database would allow an independent review of the assembly. Independent confirmation of highly unusual assemblies such as these should be required before publication (for example confirming that plasmid and prophage genes are, in fact, physically linked, by PCR across junctions, or that the macro-structure of the plasmid by restriction digest matches that predicted by the assembly). The anomalies outlined here need to be independently investigated and confirmed or refuted, with the presence or absence of the CPZ-55 locus ideally addressed in the original Klebsiella isolate. Rather than “identifying cryptic prophage genes in all five … plasmids” I surmise that bona fide prophage genes are found in none of the plasmid genomes produced from this study, and that these assemblies do not accurately reflect the genome organization of said plasmids. This does not materially affect their conclusion that the prevalence of blaCTX-M-15-producing K. pneumoniae ST11 clones in Asian countries is not due to the dissemination of a single strain, even though the Indonesian and Thai IncX plasmids may, in fact, be identical.
These issues highlight problems that are not unique to this journal or research group, namely that journals need to develop policies requiring authors to document experiments done to validate assemblies, to address the problem of relying on the output of software algorithms as gospel. Overextended reviewers should not be expected to attempt to validate the data from scratch but instead should confirm that researchers have done the necessary controls to validate their assemblies in advance.
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
