Abstract
The principal function of archaeal and bacterial CRISPR-Cas systems is antivirus adaptive immunity. However, recent genome analyses identified a variety of derived CRISPR-Cas variants at least some of which appear to perform different functions. Here, we describe a unique repertoire of CRISPR-Cas-related systems that we discovered by searching archaeal metagenome-assemble genomes of the Asgard superphylum. Several of these variants contain extremely diverged homologs of Cas1, the integrase involved in CRISPR adaptation as well as casposon transposition. Strikingly, the diversity of Cas1 in Asgard archaea alone is greater than that detected so far among the rest of archaea and bacteria. The Asgard CRISPR-Cas derivatives also encode distinct forms of Cas4, Cas5, and Cas7 proteins, and/or additional nucleases. Some of these systems are predicted to perform defense functions, but possibly not programmable ones, whereas others are likely to represent previously unknown mobile genetic elements.
Introduction
CRISPR-Cas is an adaptive immunity system that protects bacteria and archaea from viruses and other invasive genetic elements.1–4 The cas genes generally can be partitioned into three sometimes overlapping modules that are responsible for consecutive stages of the immune response: (1) adaptation (i.e., incorporation of new spacers into CRISPR arrays), (2) processing of pre-CRISPR RNAs (pre-crRNAs) that generates mature crRNAs, and (3) interference when the crRNAs are employed as guides to recognize and cleave the cognate target DNA or RNA molecules.5,6 In addition, many CRISPR-Cas systems include various accessory genes that regulate different functions of the basic CRISPR machinery. 6
In addition to full-fledged, interference-competent CRISPR-Cas systems—a variety of derived, defective variants that lack some of the essential components, often the active moiety of the interference module and also the adaptation module as well as the CRISPR array itself—have been identified by comparative genomics analysis.7,8 These derived CRISPR-Cas variants are generally thought to perform functions distinct from adaptive immunity and in some cases apparently not involved in defense at all. A striking example of such exaptation of derived CRISPR-Cas systems are the interference-defective subtype I-F and subtype V-K variants encoded in different groups of Tn7-like transposons that are involved in RNA-guided transposition.9–12 The functions of other derived variants remain poorly understood. Type IV CRISPR-Cas systems that are carried by numerous plasmids seem to be the most common of such defective variants that seem to have evolved via partial degradation of type III systems, including the loss of the interference module.6,13 The prevalence of spacers targeting plasmid DNA suggests that type IV systems are involved in inter-plasmid competition. 14 Indeed, it has been demonstrated that a type IV-A CRISPR-Cas system from Pseudomonas aeruginosa mediates RNA-guided interference against a plasmid in vivo, although the mechanism of this interference remains obscure. 15 Another notable case is the Halobacterial Repeat-Associated Mysterious Proteins (HRAMP) systems that are widespread in Halobacteria and consist of extremely diverged variants of Cas5 and Cas7 (two families of the RAMP superfamily) along with additional nucleases and uncharacterized conserved proteins. 16 The HRAMP systems are not associated with CRISPR arrays and lack adaptation modules. Given the presence of the Cas5 and Cas7 proteins that form crRNA-binding complexes in Class1 CRISPR-Cas systems, it has been proposed that HRAMPs are involved in RNA-dependent, although probably not programmable, defense functions. 16
Apart from the derived CRISPR-Cas variants, homologs of some individual cas genes, in particular cas1, cas2 and cas4, have been identified in the non-CRISPR context. The most prominent case is that of casposons, a distinct family of transposons that employ a Cas1 homolog as the integrase.17–21 In addition, some cas1 homologs are “solo” genes, without any conserved genomic context and therefore without any hint of the function. 18 Some of the casposons also encode a Cas4-like protein, although the role of this nuclease in the transposon life cycle (if any) remains unknown. Furthermore, Cas4 homologs are encoded in various contexts, suggesting involvement in repair processes or other defense functions. 22
Here, we report a surprising diversity of previously unknown CRISPR-Cas derivatives and highly diverged Cas1 homologs encoded in the genomes of many Asgard archaea. The Asgard archaea are a recently discovered and rapidly expanding, thanks to metagenomic sequencing, archaeal superphylum.23–27 They are best known for their apparent evolutionary relationships with eukaryotes, which are thought to share common ancestry with one of the Asgard lineages. Otherwise, however, the biology of this major group of archaea is poorly understood, in large part because of their recalcitrance to growth in culture. 28
We sought to characterize the antivirus defense mechanisms in Asgard archaea and, in particular, the CRISPR-Cas systems. The presence of a unique repertoire of CRISPR-Cas derivatives and extremely divergent Cas1 variants might reflect some specific aspects of the Asgard mobilome that remain to be investigated.
Methods
The sequences of 41 metagenome-assembled genomes (MAGs) of Asgard archaea with high coverage were obtained from various environments, including marine sediments and seawater (Supplementary Table S1). The only currently available complete genome of an Asgard archaeon, the anaerobic archaeon MK-D1, 28 was obtained from GenBank (GCF_008000775.1). The protein sequences encoded in the Asgard genomic contigs were assigned to the 2015 version of arCOGs 29 using PSI-BLAST, 30 with the arCOG alignments used as the position-specific scoring matrix sources, as previously described. 31
Iterative profile searches using PSI-BLAST, 30 with the cutoff e-value of 0.0001 were employed to search for distantly similar sequences in either the non-redundant (NR) database or the protein sequence database of 41 Asgard genomes. To detect distant sequence similarity, a CD-search 32 with a cutoff e-value of 0.01 and low complexity filtering turned off and a HHpred search with default parameters 33 were run against PDB, Pfam, and CDD profile databases. Protein secondary structure was predicted using Jpred 4. 34
The set of 2,512 Cas1 domain sequences 35 was augmented with 69 sequences from Asgard archaea and 20 sequences from subtype V-F 36 and their homologs in NR. All protein sequences were clustered using MMSEQS2 37 (–min-seq-id 0.9) and aligned using MUSCLE. 38 Alignments were iteratively compared to each other using HHSEARCH and aligned using HHALIGN. 39 An approximate ML tree of Cas1 was reconstructed using FastTree with WAG evolutionary model and gamma-distributed site rates. 40 A similarity dendrogram for TnpB and type V effectors was built, as previously described, using the same alignments of TnpB protein sequences 35 combined with two alignments of Asgard-specific TnpB families.
The CRISPR-Cas loci were identified and annotated, as previously described, using custom profiles derived from multiple alignments of Cas proteins to search for cas genes. 6 CRISPR arrays were detected using the minCED tool (https://github.com/ctSkennerton/minced) 41 with default parameters. The search for protospacers homologous to spacers from Asgard CRISPR arrays was performed using BLASTN with 90% sequence identity, otherwise as previously described. 42
Results
By using the previously developed approaches for the identification of CRISPR-Cas systems and CRISPR arrays, 6 we detected only class 1 systems in four of the six major Asgard lineages (Odin, Thor, Loki, Hel) in at least 12 of the 41 analyzed MAGs (Supplementary Tables S1 and S2). 27 These systems belong to subtypes I-E and D or subtypes III-A, B, and D. Most of these systems were found to be partial because of the small contig size (i.e., truncated at the end of a contig). These are “garden variety” CRISPR-Cas systems, closely resembling those found in other archaea and bacteria. A search for protospacers homologous to the spacers from these CRISPR-Cas systems yielded no hits to any known virus genomes but several hits to putative proviruses integrated in Asgard genomes (Supplementary Table S3), supporting the role of these systems in antivirus immunity.
In addition to the typical Class 1 CRISPR-Cas systems, we identified unusual novel variants that have no counterparts outside the Asgard archaea. Notably, these unique CRISPR-Cas-related entities are more abundant in Asgard than typical CRISPR-Cas systems, with the latter comprising only about 30% of the CRISPR-related loci (see Supplementary Table S1). One group of such apparent derived CRISPR-Cas systems, present in several Asgard MAGs, is distantly related to the HRAMP systems and has been mentioned in the latest publication on CRISPR-Cas classification. 6 The HRAMP systems contain three core genes, two of which—cas7 and cas5—are shared with class 1 CRISPR-Cas systems, where they are subunits of the crRNA-binding complex, and one gene remains uncharacterized; they are often also associated with additional nucleases. 16 In the analyzed set of Asgard MAGs, there are at least three loci with the same organization as a typical HRAMP, but the majority of the related systems from Asgard have a unique organization (Fig. 1A and Supplementary Table S2). Most of these loci lack the uncharacterized core gene present of HRAMP and instead encode a HD nuclease domain fused to Cas5. Furthermore, this Cas5 domain is truncated and contains only the C-terminal region including a glycine-rich loop, the hallmark of the RAMP superfamily. 43 By analogy to HRAMP, we provisionally denote these systems ARAMPs (Asgard RAMPs). The ARAMP loci also typically encode two nucleases: PD-DExK, predicted to cleave DNA, and PIN, a predicted RNase. Additionally, and unlike HRAMPs, several ARAMPs include a gene encoding a large protein that contains a domain distantly related to Cas1 (hereafter aCas1_1), the integrase involved in the insertion of new spacers into CRISPR arrays, as well as the transposition of the casposons. 44 This protein often also contains a Zn finger and, in some cases, a DNA-binding helix-turn-helix (HTH) domain (Supplementary Fig. S1). Like HRAMP, none of the ARAMPs are adjacent to a CRISPR array. The ARAMPs or aCas1_1s are found in several MAGs of the Gerd, Heimdall, Hel, and Thor lineages of Asgard archaea, in most of which other known CRISPR-Cas systems were not identified (Fig. 1A and Supplementary Table S1 and S2). Although the function and mechanism of ARAMP are currently enigmatic, based on the presence of nucleases and remnants of a Class 1 effector complex, it appears highly likely that ARAMP is an RNA-dependent defense system, although perhaps not a programmable one.

Genomic organization of Asgard-specific derivatives of CRISPR-Cas systems and Cas1-containing modules. For each locus, the metagenome-assembled genome contig identifier and the respective nucleotide coordinates are indicated. The genes of a representative locus are shown by arrows. The arrows indicate the transcription direction of the respective gene. Genes are shown roughly to scale. Homologous genes are colored according to the respective insets. Genes are shown to the same scale in
We also identified distant homologs of aCas1-1 (hereafter aCas1_2) that are not linked to a Cas7–Cas5 RNA-binding complex but instead are strongly associated with a Cas4-like DNase and an HTH-domain-containing DNA-binding protein. aCas1_2 is shorter than aCas1_1 and has a distinct Zn finger domain (Supplementary Fig. S1). These variants are also found in the MAGs of Heimdall, Hel, and Gerd lineages (Fig. 1B and Supplementary Table S2). Given the absence of RNA-binding proteins, this could be a defense system that functions at the DNA level.
Another unusual group of CRISPR-Cas-related elements was identified, mostly in Thorarchaeota but also in several Heimdall MAGs in the course of examination of “dark matter islands,” genomic loci enriched in unannotated genes. 45 This variant is centered around a large protein (aCas1_3) that contains three identifiable but highly diverged domains, namely the catalytic domain of Cas1, a PD-DExK nuclease and a P-loop ATPase that is often inactivated, as indicated by the substitution of the aspartate critical for catalysis in the Walker B site 46 (Fig. 1C and D, Supplementary Fig. S1, and Supplementary Table S2). These three domains are located in the N-terminal region of the large protein, whereas the remaining portion (about 800 aa) contains no identifiable domains and shows no detectable sequence similarity to any proteins in the current databases. Nevertheless, this region is predicted to adopt an alpha/beta secondary structure, suggesting that it consists of globular domains (Supplementary Fig. S1). This gene is strongly linked to genes encoding TnpB-like family proteins of two distinct subfamilies, both with intact catalytic RuvC-like motifs, indicating that these are active nucleases (Supplementary Fig. S1). The TnpB-like proteins of both subfamilies are larger than typical transposon-encoded TnpB (∼570 and ∼750 aa, respectively; hereafter, TnpB-570 and TnpB-750 families), which is comparable to the size of the smallest type V effectors (Fig. 2A). 47 The aCas1_3-TnpB modules are not accompanied by CRIPSR repeats, suggesting that these might not be functional CRISPR-Cas systems. Several TnpB-like proteins of both families are encoded by stand-alone genes. For stand-alone TnpB-570 family genes, we identified traces of recent transposition events in AS_002 MAG and detected flanking inverted terminal repeats, suggesting that these genes represent non-autonomous transposable elements (Supplementary Fig. S1). No tyrosine recombinase or serine recombinase genes were identified in this MAG, which makes aCas1_3 the best candidate for the role of the recombinase responsible for the in trans transposition of these elements. The functionality of aCas1_3-TnpB modules remains unclear, given that they share some features with both transposable (IS) elements and type V CRISPR-Cas systems (Fig. 2B). Considering the presence of several copies of this module (albeit not identical) in some of the Thorarchaeota MAGs (up to three of each variety in As_083 genome; Supplementary Table S2) and the clear evidence of transposition of the stand-alone TnpBs-570, the more plausible hypothesis appears to be that this is a novel IS-like mobile genetic element (MGE).

Phylogenomic analysis of Asgard aCas1 and TnpB.
As discussed above, Asgard archaea encode three distinct groups of Cas1 homologs that are more diverged from both the CRISPR-associated Cas1 and the casposases than any previously identified proteins within the Cas1 superfamily. We constructed a phylogenetic tree of the entire Cas1 superfamily, including the Asgard homologs (Fig. 2C). The tree contains a well-supported clade that includes all three aCas1 families; this clade splits into two branches, one of which consists of aCas1-3 and the other that includes aCas1-1 and aCas1-2 along with a minor, distinct variant (aCas1_1_v). One of the genes encoding aCas1_1_v (As_098-p_00945), like most of aCas1 genes, is located next to an aRAMP locus. Mapping of the genomic neighborhoods of aCas1_3 to the tree showed that aCas1_3 did not co-evolve with the two TnpB families. It appears that aCas1 genes freely combined with genes of the two TnpB families, suggesting interchangeable functionalities (Fig. 2C).
Discussion
The unexpected findings presented here show that Asgard archaea encompass a unique repertoire of highly diverged, derived CRISPR-Cas variants as well as putative novel MGE that share domains with CRISPR-Cas. These observations support and expand the previously noted trends in the evolution of CRISPR-Cas systems, namely their apparent recruitment for functions different from adaptive immunity as well as the evolutionary entanglement with MGE.10,48,49
Of special interest is the tight association between aCas1_3 and TnpB-like nucleases. TnpB appears to be the evolutionary ancestor of Cas12 proteins, the effectors of type V CRISPR-Cas systems.47,49 The Cas12 proteins of the multiple subtypes of type V seem to have evolved from different TnpB subfamilies on multiple independent occasions. 47 The likely evolutionary intermediates are large TnpB-like proteins that acquired various additional still poorly characterized domains on the evolutionary route to mature Cas12 effectors. These putative evolutionary intermediates have been shown to function as CRISPR effectors that, however, prefer single-stranded DNA or RNA substrates in contrast to the typical effector proteins, such as Cas12a or Cas12b, that cleave dsDNA.36,50 Strikingly, in the case of aCas1_3 in Asgard archaea, we observed what appears to be yet another independent association between TnpB and a Cas protein homolog, in this case a large protein containing a diverged Cas1 domain. In previously analyzed bacterial and archaeal genomes, the TnpB proteins are encoded either on their own by non-autonomous transposons or together with RayT-like tyrosine or serine superfamily transposases, TnpA, in less abundant autonomous transposons 51 (Fig. 2B). Unlike most DDE superfamily transposases (so denoted after their triad of catalytic amino acids), these two families of transposases do not require terminal inverted sequences. 51 In Asgard genomes, we identified inverted terminal repeats flanking non-autonomous TnpB-570 transposons, indicating that transposition of these elements is unlikely to involve TnpA that indeed was not detected in most of Asgard MAGs. By contrast, the Cas1-like transposases of the casposons (casposase) does require inverted terminal repeats. 21 Thus, the tight link between aCas1_3 and TnpB_570 strongly suggests that aCas1-3 functions as the transposase of these novel MGE that might also act in trans.
Perhaps the most remarkable aspect of these findings is the major expansion of the diversity of Cas1 in terms of both sequence and domain architecture of proteins containing the Cas1 domain (Figs. 1D and 2). Strikingly, the Asgard MAGs alone encompass a greater diversity of Cas1-containing proteins than the rest of bacteria and archaea taken together. The apparent monophyly of all aCas1 families implies a major expansion of the Cas1 superfamily in Asgard archaea. The lack of association of these Cas1 homologs with other cas genes (with the exception of some ARAMP modules) suggests that these proteins function as transposases (recombinases) in Asgard-specific MGE. The unique complex domain architectures of Asgard Cas1 homologs imply that these recombinases employ novel molecular mechanisms. Finally, it is perhaps worth noting that notwithstanding the apparent evolutionary affinity of the Asgard archae with eukaryotes, there is nothing eukaryote-like in Asgard-specific CRISPR-Cas-related elements: these are wonders of the archaeal world. Experimental characterization of the derived CRISPR-Cas variants and unique Cas1 homologs of Asgard archaea should illuminate both the functional plasticity of CRISPR-Cas and Asgard biology.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Information
K.S.M., Y.I.W., S.A.S., and E.V.K. are supported by the Intramural Research Program of the National Institutes of Health of the USA (National Library of Medicine). M.L. and Y.L. are supported by National Natural Science Foundation of China (Grant No. 91851105, 31970105 and 31700430).
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
