Abstract
Abstract
The abstraction augmented Markov model (AAMM) is an extension of a Markov model that can be used for the analysis of genetic sequences. It is developed using the frequencies of all possible consecutive words with same length (p-mers). This article will review the theory behind AAMM and apply the theory behind AAMM in metagenomic classification.
1. Introduction
S
Once the NGS data are available, most metagenomic data analyses proceed in the identification of short reads by aligning sequences within the sample to their closest relatives using reference databases containing full-length sequences (Kim et al., 2013). BLAST (Altschul et al., 1990) is a popular choice for organism classification, and often the top hits from BLAST results are used to identify the taxa of query sequences. However, since reads from metagenomic data are often short DNA fragments, the top hits are not always accurate. Furthermore, the computational load for alignment-based schema increases as a power function with the length of the sequences, which makes it infeasible to search large genetic databases (Zhu, 2014). Other algorithms, such as MEGAN (Huson et al., 2007) and TANGO (Clemente et al., 2011), improve speed and accuracy in taxonomic classification by using BLAST in conjunction with other consensus measures. A naive Bayesian classifier implemented in the Ribosomal Database Project [RDP, Wang et al. (2007)] is very computationally efficient, as it trains sequences using only eight-word alignments. It further gives a probability of a correct assignment along with bootstrapped confidence intervals. While all of the previously mentioned algorithms have produced meaningful results, the algorithms rely on alignment of sequences that may arise from heretofore unknown and unsequenced organisms. As a result of this lack of information, and for the sake of computational efficiency, many researchers have developed so-called “alignment-free” algorithms.
Alignment-free methods to compare biological sequences were first proposed by Blaisdell (Blaisdell, 1986, 1989). Since that time, the pace of development and improvement for alignment-free methods has increased dramatically. There are two main categories of proposed methods: methods based on word (nucleotides/amino acids) counts, and those that are not based on fixed word-length segments. As for the methods in the first group, sequence similarity is measured using procedures based on metrics defined in coordinate space on word-count vectors. Such metrics include Euclidean or Mahalanobis distance (Torney et al., 1990; Hide et al., 1994; Wu et al., 2001) and correlation/covariance methods (Fichant and Gautier, 1987; Gibbs et al., 1971; Solovyev and Makarova, 1993; van Heel, 1991). Other methods do not involve counting the frequency of segments with fixed length, rather, these algorithms use methods from information theory (Wu et al., 1997), Kolmorgorov complexity theory (Li et al., 2001), angle metrics (Stuart et al., 2002), and iterative maps (Almeida and Vinga, 2002).
Alignment-free methods provide a computationally efficient way to overcome the shortcomings of traditional alignment-based methods. However, important sequence structural information is sacrificed (Zhu, 2014). Markov models, in particular Hidden Markov Models (HMMs), provide a scheme for maintaining structure within the sequence (Yoon, 2009). The Markov structure specifies that the probability of observing a particular nucleotide in a sequence depends only on the previous nucleotide. Methods for sequence classification using HMMs typically rely on alignment to a reference sequence. Additionally, HMM algorithms often require long sequences in order to incorporate long-term sequence dependency. Long sequences are not always available in metagenomic data. Furthermore, HMMs require user specification of several parameters that, if misspecified, could produce misleading results (Kotamarti et al., 2010).
There are other Markov-based sequence classification methods that have been shown to model sequence structure, exhibit computational efficiency, and do not need alignment. One such method is “quasi-alignment,” which is based on the extensible Markov model (Dunham et al., 2004), extended to biological sequence analysis (Hahsler and Nagar, 2014; Kotamarti et al., 2010). Quasi-alignment consists of three standard steps: preprocessing sequences, learning target EMMs, and scoring query sequences against target EMMs. In the context of metagenomics, one can construct an EMM using sequences of the same phylogenetic class from a well-known database, such as Greengenes (DeSantis et al., 2008). An example of an EMM constructed from the 302 metagenomic 16S rRNA sequences for the genus Escherischia is shown in chapter 1 of Zhu (2014). Each EMM represents one taxon at a particular phylogenetic rank. When an unknown sequence is to be classified, it is scored against all the EMMs at the specified rank and is assigned to the class with the highest score. Details of the scoring process are given in chapter 2 of Zhu (2014) and summarized in Kotamarti et al. (2010). The method is implemented in the R package QuasiAlign (Hahsler and Nagar, 2014). In this way, one can assign sequences within any environmental sample without having to align the sequences against a reference database. Instead, the query sequences can be quasi-aligned against an EMM that represents an entire phylogenetic taxon. Quasi-alignment has been shown to outperform naive Bayes, as implemented in RDP, by a large margin (Nagar, 2013). In addition, quasi-alignment can provide an accurate classification at the genus and species level, which is not possible for other methods due to the small intra-sequence variability at such levels.
Another Markov-based model is the abstraction augmented Markov model (AAMM). Like quasi-alignment, it is a computationally efficient method that does not require alignment of query sequences to a target sequence database. As a result, neither method is affected by minor sequencing errors or multiple sequence alignment errors. AAMMs effectively reduce the number of numeric parameters of a standard Markov model through abstraction, or the grouping of strings of length p into hierarchical clusters (Caragea et al., 2009). In AAMMs, the abstraction acts as a regularizer that helps minimize overfitting when the training sequence set is limited in size. Therefore, AAMMs can perform much more robustly than an HMM (Caragea, 2009). Previously, AAMMs have been evaluated on three protein subcellular localization prediction tasks (Caragea, 2009). It was shown that AAMMs are able to use significantly smaller number of features than traditional higher order Markov models and perform significantly better than variable Markov models. AAMMs have a theoretical advantage over the EMMs used in quasi-alignment, in that the base model for an AAMM does indeed have the Markov property (Caragea, 2009). Even though EMMs as originally defined are Markov (Dunham et al., 2004), when applying EMM in quasi-alignment, the Markov property is no longer tenable. In quasi-alignment, each state represents a cluster of NSVs that are created from overlapping gene sequence segments of length p. These states cannot be specified prior to creating the model, and the NSV's within the states are not statistically independent (Zhu, 2014).
The purpose of this article is to present the AAMM as a method of metagenomic analysis that carries the theoretical properties of Markov models while retaining the efficiency of alignment-free methods. The formal definition of an AAMM and an algorithm for obtaining the abstractions are presented in section 2. Details of the implementation are in section 3, where we also show that AAMMs are able to classify correctly approximately 95% of sequences to their appropriate taxa, even at the genus level. Finally, we discuss future developments and applications in section 4.
2. Methods
Abstraction augmented Markov models have been evaluated on three protein subcellular localization prediction tasks (Caragea, 2009), where it was shown that AAMMs were able to use a significantly smaller number of parameters than traditional higher order Markov models, and perform significantly better in predictive accuracy than variable Markov models. AAMMs are able to reduce the number of numeric parameters of higher order Markov models through successively grouping all possible consecutive words of length p according to the word frequencies in training data set. When constructing an AAMM, a set of all existing p-mers in the training data set is clustered into an abstraction hierarchy according to some similarity measure of p-mer frequencies. This section gives formal definitions and explains the theory behind the AAMM.
2.1. Definitions
An abstraction hierarchy is a rooted tree model whose root represents a set of all existing p-mers, while each leaf-node represents one individual p-mer. Formally,
• The root of • The leaves of • The children of each node {a1,…,am} ∈
Figure 1 shows an example of abstraction hierarchy on a set

An example of an abstraction hierarchy learned from the training sequence abracadabra. The root node (top) represents the entire set S, and the leaf nodes (bottom) indicate each individual 2-mer. Nodes are merged from the leaf to the root, forming abstraction hierarchies.
The root of the above abstraction hierarchy a12 represents the entire set
Specifically, an m-cut γm partitions the set

An example of a two-cut
AAMMs incorporate new variables 1. Input a set of p-mers 2. Initialize 3. Recursively merge pairwise abstractions ai that are most similar to each other on the basis of Equation 2. 4. An abstraction hierarchy
2.2. Similarity between abstractions
Before discussing the similarity between two abstractions, we need to define the concept of the context of a p-mer, which will be used to calculate the similarity between two abstractions.
In Equation 1, L is the total number of nucleotide sequences in the training set
The distance between two abstractions au and av, based on their contexts, is measured using weighted Jensen–Shannon divergence (Lin, 1991). The distance between au and av is given by Equation 2 below
where aw = {au ∪ av} and p(a) represents the prior probability of an abstraction a, and πu, πv denote the prior probabilities of au and av, respectively, in the union aw. The estimate
Equation 4 can be used to calculate πu as follows:
where π v can be calculated analogously.
Figure 1 showed an abstraction hierarchy
Figure 3 shows an example of abstraction hierarchy for genus Propionibacterium. Members of the genus Propionbacterium are gram-positive bacteria that live either on the skin of humans and other animals (so-called “cutaneous” type) or in dairy products, particularly in cheese (“classical” type) (Charfreitag and Stackebrandt, 1989). This abstraction hierarchy was trained on 3-mers using 4936 16S rRNA sequences of Propionibacterium from the Greengenes database (DeSantis et al., 2008).

An abstraction hierarchy for the genus Propionibacterium using words of length 3. The leaf nodes at the top represent all 64 possible 3-mers.
In this abstraction hierarchy, the terminal nodes {s1,…,s64} represent all 64 possible 3-mers. Nodes were recursively merged on the basis of the distance measure given in Equation 2. This abstraction hierarchy describes the similarity of occurrence behavior of 3-mers in genus Propionibacterium; therefore, it could be viewed as a graphical representation of this genus and act as a target model in sequence classification.
2.3. Selecting an M-cut
Recall that an m-cut γm through an abstraction hierarchy
The purpose of scoring an unknown sequence against a target abstraction hierarchy of a known taxon at a preselected level of m-cut is to obtain the posterior probability of unknown query sequence belonging to that known taxon. Abstraction hierarchies are trained by using sequences from the same phylogenetic taxon. Evaluation of each m-cut is based on the posterior probability of the AAMM for a training data set. Given an unknown query sequence {x = x0, x1,…,x(n−1)} and a target abstraction hierarchy with a preselected m-cut gammam, the posterior probability for the query sequence is calculated using Equation 5:
where p(x0…,xk−1) is the prior probability of p-mer
The process of m-cut selection is given by the following algorithm:
1. Specify all possible values of m, {m0,m1,…}. 2. Score all training sequences against the abstraction hierarchy for each mi. 3. Calculate the average of the posterior probabilities for each abstraction (using Equation 5). 4. Return the maximum average posterior probability and the corresponding m-cut.
The main purpose of m-cut selection is to search for the appropriate level of abstractions that best describes the training data set. The performance for sequence classification varies at different levels of target abstractions. Implementing a more specific level (larger m value) of abstractions does not necessarily improve the accuracy of classification, because the parameters may not be accurately estimated if the amount of data at that level is insufficient. On the other hand, classifying sequences at a more general level (smaller m value) of abstractions would not guarantee an improvement in classification performance because there may not be enough terms on the cut to differentiate each class. A simple classification experiment is performed here to illustrate how the m-cut selection affects the classification accuracy.
In all the experiments to evaluate the validity and accuracy of metagenomic classification using AAMM theory, all model training and test sequences are from 16S rRNA sequences, the most widely used genetic marker for bacterial genomes (Case et al., 2007). The sequences are available from the Greenegenes project website (DeSantis et al., 2008). For all experiments, we mainly use 2-mer abstraction hierarchies as the target models in sequence classification, because 16S rRNA sequences, whose average length is around 1500, are comparatively short genetic sequences. Therefore, the parameters in the AAMMs may not be accurately estimated for p-mer abstraction hierarchies if p is greater than two.
In this experiment, the test sequences are from the genera Bacillus, Streptococcus, Clostridium, and Escherichia, since these genera each contain a large number of sequences. We randomly select 20% of the sequences from the Greengenes database in these four genera as the training data set. The test sequences are classified against the corresponding target abstraction hierarchies at all possible levels of abstraction. Figure 4 shows the classification accuracy at all abstraction levels from 2-cut to 16-cut.

Classification accuracy rate for m-cuts (abstraction levels) from two to sixteen for four genera from the Greengenes database. The x-axis represents the m-cut level from 2-cut to 16-cut, and the y-axis denotes the classification accuracy rate. Each colored dotted line represents the classification rate for a specific genus. The solid black line is the average classification rate.
From Figure 4, the classification rate varies when the classification is performed at different levels of abstraction (i.e., m-cuts). When classifying test sequences at abstraction level of five-cut, the average classification accuracy is only 60%; however, this changes to nearly 100% when moving to a more specific level of abstraction. It is also clear that each genera has a slightly different pattern of response. For example, the classification rate for the genus Escherichia varies dramatically depending on the abstraction level, while the genus Streptococcus displays a more stable classification rate at various values of m. The main reason is due to the amount of variability within each genus. Genera (and, in general, taxa) with larger intra-variability tend to have unstable classification rates at different m-cut levels. Thus, different genera may have different optimal abstraction levels, and it is important to select an appropriate m-cut for each taxon before performing sequence classification. For these particular genera, the classification rate stabilizes for m ≥ 8.
3. Results
Sequence classification experiments were carried out to evaluate the validity and accuracy of metagenomic classification using AAMMs. For the experiments here, both the Greengenes data set and Yellowstone National Park data sets (described in section 3.1) were imported and processed to use in model learning, m-cut selection, and sequence classification. The Greengenes database was used as training for the Yellowstone data, in order to avoid the bias of classification results due to the sequence replication within one data source. The software was implemented in R (R Core Team, 2014), and the code is available at https://github.com/MonnieMcGee/ZhuMcGeeJCB2015.
3.1. Data sets used
The Greengenes database is a 16S rRNA gene database that addresses limitations of public repositories by providing chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies (DeSantis et al., 2008). The unaligned version of the 16S rRNA sequences from the Greengenes databases are used to train AAHMs to represent each taxon. The total number of sequences we use in this dataset is 406,997, which belong to 94 phyla.
The Department of Energy Joint Genome Institute (DOE-JGI) produced a bacteria-specific 16S rRNA gene library from Yellowstone National Park samples. Samples are from 20 geothermal sites with 35 to 50 megabases of sequence data per site in Yellowstone National Park. Standard Sanger shotgun sequencing (Staden, 1979) was used to obtain genomic samples. This method assembles randomly broken-up DNA segments into continuous sequences using multiple overlapping segments. We refer to this dataset as YNP, and it contains 6026 16S rRNA sequences in 30 phyla. This data set is used to test the performance of AAMMs in metagenomic classification.
3.2. Classification at the phylum level
In order to demonstrate the performance of AAMM for classification of metagenomic data at the phylum level, we designed two phases of experiments. For the first phase of experiments, sequences from the five most abundant phyla in Yellowstone National Park data set were selected as the test sequences. Ten percent of sequences from the Greengenes database in the same five phyla were randomly selected to learn abstraction hierarchies. For each target phylum, the m-cut is selected as described in section 2.3. Then, the entire test sequences were combined and scored against the five abstraction hierarchies at the preselected level of m-cut and were assigned to the phylum with the highest score. The predicted phylum was compared with the actual phylum of the sequence to obtain the classification rate. The list of experimental designs and results is shown in Table 1. The average classification rate for this experiment is 94.26%. As for m-cut selection, three out of five abstraction hierarchies have 15-cut as their optimal level of abstractions, and the other two selected m-cuts of 13 and 14.
For the first five rows in the table, the test sequences for each phylum were selected from the YNP data. The same five sequences were selected from the Greengenes database to train the AAMM. For phyla marked with *, all sequences were from the Greengenes database. Ninety percent of the sequences in each phylum were used to train the AAMM, and 10% were used for testing.
In the second phase of phylum level experiments, which are denoted by * in Table 1, sequences from six medium-sized phyla in the Greengenes database were selected to evaluate the performance of gene classification using AAMMs. Each phylum was split up into a ratio of 90% and 10% for training 2-mer abstraction hierarchies and test sequences, respectively. For both large and midsize phyla, all classification rates are above 86%.
3.3. Classification at the genus level
Since it has already been shown that quasi-alignment is more effective than naive Bayes or HMMs for classification at the genus level (Nagar, 2013), we compare the percent of sequences correctly classified from several genera for both AAMM and quasi-alignment. The results are shown in Table 2. For the first phase of this experiment, sequences from the ten most abundant genera in the YNP data set were selected as the test sequences. The training sequences were from the Greengenes data set. We randomly selected 20% of the sequences from the Greengenes database in the same genera to learn abstraction hierarchies and select an optimal m-cut. All test sequences were combined and scored against the ten abstraction hierarchies at preselected abstraction levels and were assigned to the genus with the highest score. The predicted genus was compared with the actual genus of the sequence to obtain the classification rate. The average classification rate for this experiment is 98.62%, and all classification rates are greater than 93%. All ten abstraction hierarchies have 15-cut as their optimal level of abstraction. In each case, the AAMM performs as good as or better than quasi-alignment.
Test sequences for each genus were selected from the YNP data. The same five sequences were selected from the Greengenes database to train the AAMM. The same training and test sets were used for quasi-alignment. In most instances, the AAMM outperforms quasi-alignment in terms of percent correctly classified.
Similarly to the phyla experiments, sequences from 20 medium-sized genera, which contained 154 to 175 sequences, in the Greengenes database were selected to evaluate the performance of gene classification using AAMM's. Each genus was split up into a ratio of 90% and 10% for training and test sequences, respectively. For this experiment, most of the classification rates are 100%; therefore, the results are not shown in detail. For a detailed table of results, see chapter 3 of Zhu (2014).
4. Conclusion
In this article, we reviewed the theory behind AAMM and applied AAMM in metagenomic classification. Detailed algorithms of metagenomic classification we proposed were illustrated, which include model learning, m-cut selection, and sequence classification. Experiments were conducted using 16S rRNA sequences of known microbes. The sequences from the Greengenes data set and Yellowstone National Park data set were divided into training and test sequences at each taxon level. The test sequences were scored against all target abstraction hierarchies at a preselected level in order to determine the predicted taxon. This method has been able to provide accurate classification results at both high and low taxon level, and the average classification rate is above 95%.
The analyses can easily be extended to any set of DNA and RNA sequences. We further compared the sequence classification performance of quasi-alignment method and AAMM method. The two techniques perform similarly; however, the AAMM slightly outperformed quasi-alignment in most cases. These two techniques have the potential to be used in identifying novel species in environmental or biological samples.
With regard to assessing the statistical significance of quasi-alignment and AAMM methods, in future work, we will develop hypothesis tests for the null hypothesis that the query sequence does not belong to a specific taxon in order to provide further information in the classification accuracy. In particular, we will focus on deriving theoretical background distributions for both EMM's supported transition score and AAMM's posterior probability score in order to obtain an accurate representation of the null distribution, and thus an accurate p-value for a hypothesis test. Especially, in regard to EMM construction, further research will concentrate on analytically deriving the distribution of the NSV to the setting where sequence preprocessing step is a renewal process of a Markov Chain. Further research is also required to see how the power of the test behaves in order to better evaluate it. For the extension of application for current quasi-alignment and AAMM methods in sequence classification, further research will aim to apply them in other types of genetic sequences, such as amino acid sequences.
6. Appendix
The following example shows a detailed procedure for calculating the context of the three-cut of the abstraction hierarchy from Figure 1. The procedure for calculating the context of other cuts is similar. In a real sequence classification problem, the value of m is determined by the posterior probability of model training sequences, given in Equation 5. Here, we selected the three-cut as an example for convenience and concreteness.
For the abstraction hierarchy in Figure 1,
which means that the conditional probability of observing character a right after 2-mer s1 = {ra} is one-sixth. We can calculate the context of all two-mers in a similar manner. The results are given in Table 3.
For example, the conditional probability of observing character a right after two-mer s4 = {br} is 3/7, and the conditional probability of observing character c right after two-mer s3 = {da} is 1/6.
Now that we have the context of the 2-mers, we can calculate the context of the abstractions. Since a10 = {ra, ca, da} = {s1, s2, s3}, on the basis of the Poisson distribution (Karlin and Altschul, 1990), the context of abstraction a10 for character a is.
which means that the conditional probability of observing character a right after a set of 2-mers a10 = {ra, ca, da} = 1/6. Similarly, Table 4 lists the context of abstractions {a10, a3, a8}. Once the distance between two abstractions is defined, we can train abstraction hierarchies using known sequences.
For example, the conditional probability of observing character b immediately after a set of two-mers a10 = {ra,ca,da} is 3/14, and the conditional probability of observing character d immediately after a set of two-mers a8 = {br, ac, ad} is also 3/14.
Footnotes
Acknowledgments
This work was funded by the National Institutes of Health Developmental Research Grant Program (Parent R21) to Michael Hahsler (PI) and Monnie McGee (Co-PI), grant number 1 R21 HG005912-01A1.
Author Disclosure Statement
No competing financial interests exist.
