Abstract
Abstract
The functional characterization of genes involved in many complex traits (phenotypes) of plants, animals, or humans can be studied from a computational point of view using different tools. We propose prediction—from the machine learning point of view—to search for the genetic basis of these traits. However, trying to predict an exact value of a phenotype can be too difficult to obtain a confident model, but predicting an approximation, in the form of an interval of values, can be easier. We shall see that trustable and useful models can be obtained from this relaxed formulation. These predictors may be built as extensions of conventional classifiers or regressors. Although the prediction performance in both cases are similar, we show that, from the classification field, it is straightforward to obtain a principled and scalable method to select a reduced set of features in these genetic learning tasks. We conclude by comparing the results so achieved in a real-world data set of barley plants with those obtained with state-of-the-art methods used in the biological literature.
1. Introduction
From a computational point of view, when we do not consider any pedigree information, genetic descriptions may be represented by vectors of feature values or genetic markers that are discrete items codifying the allelic composition, usually via SNP (single nucleotide polymorphism). On the other hand, a phenotype may be either a continuous or a discrete variable.
The type of relationship between genotypes and phenotypes, at its most basic level, is simply an association between a genetic marker and a phenotype that one can measure (Lynch and Walsh, 1998). However, most traits (phenotypes) are genetically complex; they are influenced by environmental causes and involve the joint action of many markers (Mauricio, 2001). This relationship has been recently formalized in terms of prediction power of phenotypes from genotypes (Bedo et al., 2008; Lee et al., 2008; Wray et al., 2007; Meuwissen et al., 2001).
Besides setting a formal framework, prediction has some interesting benefits (Lee et al., 2008). Thus, prediction is a natural way to assess the genetic risk of a disease in healthy individuals (Wray et al., 2007). Moreover, it has been reported (Meuwissen et al., 2001) that artificial selection on genetic values predicted from markers could substantially increase the rate of genetic gain in animals and plants.
In the references quoted above, the tools used for prediction are based on regression, even when the phenotype is represented by a binary variable. Regression is the obvious approach when the phenotype—i.e., the target value to be predicted—is continuous. However, frequently the genotype description is only partially available, and the environmental influence in the phenotype acts as a noise source, hindering the induction of a predictor. The consequence is that regressors may exhibit only modest performance scores, even when a strong relationship between genotype and phenotype exists.
To overcome the weaknesses of regression in this kind of learning tasks, we shall try to learn models to predict approximations to the phenotype values instead of exact values. To implement this idea, we relax the specifications of metric regression changing target points by intervals. Obviously, interval predictors are more reliable than metric regressors: the targets are broader.
There are two basic routes to make effective this approach. From the point of view of regression, to learn hypotheses whose predictions are intervals of a radius ε > 0, we may use the so-called ε-insensitive loss function, the loss used by regression support vector machines.
Additionally, we shall explore a second option, based on classification. We make a simple discretization of phenotype values into intervals of equal frequency; thus, predicting one of such intervals is a classification task. Moreover, we can take advantage of the ordering of those intervals; in fact, in this framework, learning the genetic basis of a phenotype is an ordinal regression task.
Moreover, the explanatory power of these classifiers can be additionally improved if we use set-valued classifiers. These classifiers return only one class (interval) when it is sufficiently sure, but in doubtful situations they may opt for predicting two or more classes (a wider interval). Set-valued classifiers have received different names in the literature. In Corani and Zaffalon (2008), the authors present an algorithm to learn set-valued classifiers called Nave Credal; it is an extension of the Nave Bayes classifier to imprecise probabilities.
In Alonso et al. (2008) and del Coz et al. (2009), these classifiers are called nondeterministic. We used them both for multiclass classification and for ordinal regression. The aim of the so-called nondeterministic hypothesis is to predict a set of classes (consecutive in ordinal regression) as small as possible, while still containing the true class. We proved that an optimal trade-off between size of predictions and successful classifications can be derived from the set of posterior probabilities for each entry of the input space. In other words, nondeterministic classifiers are built over a deterministic classifier (which always predicts one class) that provides posterior probabilities.
In this article, we present a slight modification of the algorithms of Alonso et al., (2008) and del Coz et al. (2009) to approach the genetic learning task introduced above. But the main novelty is that we establish that the 0/1 loss of a deterministic classifier is an upper bound of the nondeterministic loss of its nondeterministic counterpart. The consequence is that those tools specially devised to improve deterministic classifiers' performance can be used in nondeterministic learning tasks. More specifically, feature selection algorithms designed for deterministic classifiers can be used with nondeterministic learners to obtain a fully scalable method, suitable for dealing with large genetic data sets. Notice that there are no methods for selecting features specially devised for regressors producing intervals.
Feature selection is very important in genetic tasks. When the genetic descriptions have physical locations in the genome sequence, the set of relevant attributes (i.e., positions in the genome sequence) to predict a quantitative (continuous valued) phenotype are called QTL (quantitative trait loci).
We shall see that the search for QTLs and for prediction models can be successfully tackled with the approach presented here. For this purpose, in the last section of this article, we report a set of experiments done with a well-known data set of barley plants (Wenzl et al., 2006; Bedo et al., 2008). We studied several traits, with different strength of genetic basis. We compared the prediction scores obtained by metric and interval regression, multiclass classification, ordinal regression, and set-valued or nondeterministic classification. Additionally, the selection of QTLs was compared with state-of-the-art methods used in the biological literature.
2. Regressions and Classifications
From a formal point of view, learning tasks can be presented in the following general framework. Let
where Δ(h(
Depending on the type of the output space
In this kind of task, the aim is to maximize the number of successful classifications.
If
Correlation between predictions and true values is another typical way to measure the goodness of a hypothesis.
On the other hand, when the focus is the relative ordering of predictions and its coherence with the ordering of the true classes, the goodness of a hypothesis is established with the area under a receiver operating characteristic (ROC) curve (AUC for short). If h is a hypothesis, its loss evaluated on a test set S′ is 1 − AUC, where
Given a data set
3. Nondeterministic Predictions
When the output space
Definition 1.
A nondeterministic (or set-valued) hypothesis is a function h from the input space to the set of non-empty intervals (subsets of consecutive ranks) of
In the next subsections, we introduce two approaches to learn these kind of hypothesis, one based on classification and the other based on regression.
In nondeterministic classifications, we would like to favor those decisions of h that contain the true ranks, and a smaller rather than a larger number of ranks. In other words, we interpret the output h(x) as an imprecise answer to a query about the right rank of an entry
On the other hand, in regression we shall make use of the loss used by regression support vector machines. The aim is to optimize the distances of true values to intervals of a given width centered in the values returned by regression hypotheses.
3.1. Measuring the goodness of nondeterministic predictions
In information retrieval, performance is compared using different measures in order to consider different perspectives. The most frequently used are Recall (proportion of all relevant documents that are found by a search) and Precision (proportion of retrieved documents that are relevant). In symbols, in a query (i.e., an entry x), if y is the true (relevant) class, Recall is defined by
Precision is given by
However, it is more informative to measure a tradeoff between Recall and Precision. The harmonic average of the two amounts is used to capture the goodness of a hypothesis in a single measure. In the weighted case, the measure is called Fβ. Moreover, since Fβ is bounded in [0, 1] and it measures the goodness, the loss (Δ
ND
) is given by the complementary: 1 − Fβ. Thus, for a nondeterministic classifier h and a pair (x, y),
Once we have the definition of Fβ for individual entries, it is straightforward to extend it to a test set. So, when S′ is a test set of size m, the average nondeterministic loss on it will be computed by
It is important to realize that for a deterministic hypothesis h this amount is the average 0/1 loss, since all predictions are singletons, |h(
Furthermore, the average Recall and Precision on test sets can be similarly defined. In this case, the Recall on a test set is the proportion of times that h(
Finally, from the point of view of regression, the effect of the discretization of continuous classes can be simulated somehow if we use the so-called ε-insensitive loss function, the loss used by regression support vector machines. If ε is a positive value, this loss does not penalize predictions whose distance to true values is below ε; in symbols,
In this context, the Recall (Eq. 10) can be generalized to
3.2. Deterministic and nondeterministic classifiers
In this section, we present a couple of results that allow us to build nondeterministic from deterministic classifiers. In the general ordinal regression setting presented in Section 2, let x be an entry of the input space
where |I| stands for the number of ranks included in the interval I. We shall prove that an optimal hypothesis hND(x) can be computed by Algorithm 1, a version of the algorithms presented in Alonso et al. (2008) and del Coz et al. (2009).
Proposition 1 (Correctness).
If the conditional probabilities Pr(j|x) are known, Algorithm 1 returns the prediction h(x) of one or two consecutive ranks that minimizes the risk given by the loss 1 − Fβ (Eq. 9).
Algorithm for computing the optimal prediction with one or two consecutive ranks for an entry x provided that the posterior probabilities of ranks are given
with
First we shall prove that when Z is an interval, say Z = [s0, s1], given x, the value of Equation (14) can be expressed in function of the length of the interval l = s1 − s0 + 1 and the probability of the interval. In fact, with a probability of 1 − Pr(Z|
Therefore, the intervals with lowest loss of length 1 (respectively, 2) can be determined only by their posterior probabilities. According to Algorithm 1, let [t, t] and [s, s + 1] be the best intervals of length 1 and 2 respectively. Thus, the interval with the lowest risk, given
and will be [s, s + 1] otherwise, which is equivalent to the predicate of the conditional of Algorithm 1 (line 6) as we wanted to prove. ▪
In practice, the actual role of β in Algorithm 1 is to fix the thresholds to decide the number of ranks to predict. In the experiments reported below, the aim is to optimize F1. With higher values of β in Algorithm 1, the size of predictions would increase; we would obtain a higher Recall with a lower Precision.
Using the nomenclature of this section, the deterministic counterpart of hND is given by
The next proposition establishes the relationship between these hypothesis in terms of the corresponding loss functions.
Proposition 2 (Bound)
The risk given by the 0/1 loss of the deterministic hD is an upper bound of the risk given by the nondeterministic loss of the hypothesis hND
If the size of hND(x) is 1, then both loss functions return the same value. When the length of the nondeterministic prediction is 2, according to Algorithm 1, there are some
Again according to Algorithm 1 (see Eq. 17), we have that
Hence, using (Eq. 16),
The consequence of this proposition is that reducing the loss of hD is a reasonable strategy to reduce the loss of its nondeterministic counterpart, hND. Therefore, all the efforts to improve somehow nondeterministic classifiers will be supported by the deterministic part. This includes the grid search to adjust parameters, and the use of filters designed to select features in deterministic classifiers. ▪
4. Experimental Results
To test the proposals of this article, we performed a number of experiments. The first aim of the experiments reported in this section was to compare the performance of point-valued versus interval-valued predictions. To predict intervals we used ε-regression, multiclass and nondeterministic classifiers. The second objective of the experiments was to test nondeterministic selection of features.
We used a publicly available data set about barley plants whose source is Wenzl et al. (2006). However, in order to compare the results obtained, we used the version described in Bedo et al. (2008). The genotypes considered are a subset of a collection of RFLP, DArT, and SSR markers. The genotypes of individuals of barley plants in the Bedo et al. (2008) version are codified by 367 binary (0/1) features to express the presence/absence of parental allele calls.
The phenotypic data for 9 traits, measured in up to 16 different environments, were downloaded from the GrainGenes website (http://wheat.pw.usda.gov/ggpages/SxM) and preprocessed by the authors of Bedo et al. (2008). These traits are α-amylase, diastatic power, heading date, plant height, lodging, malt extract, pubescent leaves, grain protein content, and yield. The number of individuals described in this data set is 94.
In 8 times out of 9, the traits were represented by continuous values. The exception is the presence of pubescent leaves: a binary predicate. Since this trait clearly induces a classification task, we did not consider it in the experiments reported here given that the focus of this paper is to handle continuous traits. Therefore, we used the 8 continuous traits. To standardize the scores achieved by the learners used in these experiments, we used the percentiles of traits instead of their actual values. Therefore, all trait values range in the interval [0, 100].
Additionally, as was explained in Section 2, we discretized the trait values into 5 bins of equal frequency. The discretized version of traits were used both to employ the filter FCBF (Yu and Liu, 2004) and to handle the data sets as classification learning tasks. Let us recall that FCBF orders the attributes of a learning task according to their relevance to induce class values. For this purpose FCBF uses the so-called symmetrical uncertainty (a normalized version of the mutual information) between attribute and class values, which must be discrete, not necessarily ordered. Then the filter rejects those attributes that somehow are redundant with other attributes higher in the order of relevance. The filter is very fast, and performed very well in these experiments. In fact, FCBF has been frequently used for dealing with genetic data, (Saeys et al., 2007).
In all cases, the performance of predictions reported in this section were estimated using a 10-fold cross validation. In all the scores, we used the filter FCBF in an internal loop.
Metric regression.
First, we discuss the results achieved by metric regression. We used a popular SVM implementation for dealing with regressions tasks, LibSVM (Chang and Lin, 2001) with a linear kernel. To adjust the C parameter, we performed an internal grid search (a 10-fold cross validation) with C = 10
e
,
The first column of Table 1 shows that all traits have a modest performance. In fact, the average of all correlations between traits and predictions is only 0.72. In Bedo et al. (2008), the authors used the fraction of variance explained to measure the goodness of the prediction models; in any case, the scores of the predictive power of phenotypes from genotypes reported in Bedo et al. (2008) are equivalent to those reported in Table 1. Supported on these scores, the authors stated that their regression method (a variant of SVM) identified the relevant genetic features (QTL) for the barley data set that broadly coincided with known QTL locations.
Considering the hypotheses learned as a tool to order barleys according to a given phenotype, using the AUC, we measured the coherence of the orderings induced by the hypotheses and the true orderings; see Section 2. The second column of Table 1 reports these scores. Roughly speaking, AUC scores repeat the information given by correlations; in fact, the correlation between the first two columns is 0.97.
However, the most disappointing results are those concerning the prediction power of regressors. To explain the scores of the remaining columns of Table 1, let
We report the average of the absolute values of residuals in the column labeled by mad (mean absolute deviation), see (Eq. 3). The residuals have normal distributions with a mean and a standard deviation shown in the last two columns of Table 1.
Using approximated values for phenotypes with deterministic classifiers.
As was mentioned above, the way used in this paper to represent phenotypes values as approximations is to discretize them into a reduced number of ordered bins. We used 5 bins of equal frequency. With the learning tasks so transformed, we conducted a number of experiments trying to explain the relationship between genotypes and phenotypes by means of multiclass classifiers and ordinal regressors given that classes (or ranks) are ordered. First we discuss the scores obtained with deterministic classifiers. In all cases, we used a linear kernel.
The first two columns of Table 2 report the scores of SVOR (Chu and Keerthi, 2005), a state-of-the-art learner for ordinal regression tasks. The classifiers learned by SVOR return one rank in the 5-scale range for each individual.
The label suc. stands for the proportion of successful classifications (Recall).
The achievements of SVOR in AUC are similar to those obtained by metric regression. The proportion of successful classifications vary from 0.33 to 0.47; the average is 0.41. Since 0.20 would be the average score of a random classifier, the performance of SVOR is not very good for some traits.
The scores of multiclass classifiers are showed in the remainder columns of Table 2; they are clearly worse than those of ordinal regressors. The reason is that the classes are ordered, and ordinal regressors can take advantage of this fact. We used the Logistic Regression (LR) implementation of LibLinear, a large-scale logistic regression method (Lin et al., 2008) specially devised for handling data sets with thousands or millions of features and thousands of entries; in other words, LibLinear is an adequate learner for dealing with genetic data. To complete the comparisons, we used the multiclass classification version of LibSVM (Chang and Lin, 2001). In both cases, we used a linear kernel, and we implemented an internal grid search (a 10-fold cross validation) for the parameter C = 10
e
,
Approximated values for phenotypes with nondeterministic classifiers and regressors.
In order to improve the proportion of times that predictions include the true rank (Recall, Eq. 10), we used nondeterministic predictors. Let us first deal with nondeterministic classifiers. Thus, using the Algorithm 1, and aiming to optimize the F1 score, we built two nondeterministic classifiers using the multiclass classifiers employed for the experiments reported in Table 2. The first one, based on Logistic Regression is called nd(LR), the second, based on SVM, is called nd(SVM) (See Table 3).
The label suc. stands for the proportion of successful classifications (Recall). The average size of predictions is represented as the average tube
To compute posterior probabilities of ranks, we took advantage of their ordering. Thus, if k is the number of ranks, we trained k − 1 classifiers to learn posterior probabilities Pr(y ≤ i|x) for
assuming that Pr(y ≤ k|x) = 1 and Pr(y ≤ 0|x) = 0. Some problematic cases could return negative probabilities for some ranks; in these cases, we normalize the values.
The scores in Recall (successful classifications) achieved by nondeterministic classifiers outperform the accuracy of SVOR. In 7 out of 8 datasets, nd-classifiers performed better than SVOR. Of course, this is due to the fact that nd-classifiers are allowed to predict more than one rank. But if we want to grasp the relationship between genotypes and phenotypes, the increase in successful classifications compensate for the slight increase in the size of predictions. The average radius (referred to as ε) of such predictions is just 12.3 in the case of nd(LR) and 14.0 for nd(SVM). Notice that SVOR used an ε = 10: all intervals predicted by SVOR include 20 percentiles.
As was mentioned in Section 3.1, we can simulate the set-valued (or nondeterministic) approach extending metric regression to become ε-regression (see Eq. 11). Thus, we have tested the proportion of successful predictions (Recall, Eq. 12) with two different radiuses. With ε = 10 we were trying to compare ε-regression with ordinal regression; the scores of SVOR outperform those of ε-regression in 5 out of 8 datasets. The differences with nd-classifiers are almost always against ε-regression with ε = 10. On the other hand, the scores of nd-classifiers are lower in general than those of ε-regression when we used ε = 15.
4.1. Feature selection for nondeterministic classifiers
Using the Proposition 2 of Section 3.2, we made a first selection of the features of barley datasets using the filter FCBF. The number of features so selected varied from 12 to 17 with an average of 14.3, see Table 4. The second step performed was designed in order to obtain a scalable method able to be used in other situations where the number of features selected by the filter could still be too high. Thus, we used a simple linear search, using a 10-fold cross validation, in the ordered list of features returned by the filter. The subset selected was the one with the best deterministic accuracy, using the deterministic counterpart of the nd-classifiers (Eq. 18). The results were the same with SVM and LR, and the number of features so selected were reduced to a range of 3 to 14 with an average of 8.9.
The scores of Recall and average radius (ε) are quite similar to those achieved without selection (Table 4). The features obtained for each trait are listed in Table 5. This table includes the markers selected by some state-of-the-art methods frequently used in the biological literature (Bedo et al., 2008): Composite interval mapping (CIM), single marker regression (MR), and statistical machine learning (SML). The features (markers in the biological sense) are represented by centimorgans (cM): distance along a chromosome. The features selected by our method are similar to those selected by the others methods. However, the main difference can be found in the computational principles used.
The column labeled by nd selec. reports the ordered list of markers returned by the filter FCBF followed by the selection process described in the text. The next columns report (Bedo et al., 2008) the markers selected by Composite Interval Mapping (CIM), single Marker Regression (MR), and Statistical Machine Learning (SML). The second part of the table lists markers selected by CIM and MR ordered by chromosome. In all cases, markers are represented by centimorgans (cM): distance along a chromosome.
Plant height.
To illustrate the differences between regression-based approaches and the nondeterministic selection method, we are going to discuss in detail the selection about the trait plant height. The prediction power of all alternative methods discussed here have poor scores in Recall. Figure 1 depicts the true plant heights percentiles (represented by bullets), and the intervals predicted (using a 10-fold cross validation) by the nondeterministic classifier (nd(LR) with selection of features) and by metric regression with ε = 15. Notice that the size of the tubes are quite different; in the nd(LR) classifier, the average size of the radius (ε) is 11.8, while the regression predictor uses a constant radius of 15 percentiles. However, in both cases, around 50% of cases are not in the predicted interval.

True plant height percentiles (•) and the intervals predicted (using a 10-fold cross validation) by nd(LR) with selection of features
For plant height, the first attribute selected by the nondeterministic method is 2H(118.9); it is also reported as a relevant (QTL) for this trait by the methods SML, CIM and MR (Bedo et al., 2008). The merits of this attribute can be straightforwardly explained in a discretized phenotype setting. It provides a description of individuals from which it is possible to predict the right class 39% of times. Since the attribute is binary, it is only possible to reach a 40% of successful classifications; notice that it is only possible to produce 2 different predictions while the set of ranks has 5 elements. Therefore, 2H(118.9) is an almost perfect predictor. This classifier is telling us that all (with the exception of 1 sample) barleys with percentiles in (80, 100] have value 0 in 2H(118.9); and all barleys with percentiles in barleys [0, 20] have value 1 in 2H(118.9). This explanation can not be drawn from the report that the correlation of 2H(118.9) and the percentile of plant height is 0.65. In this context, correlation is quite opaque.
The top 5 features selected by our method are the following markers: 2H(118.9), 2H(0.0), 3H(117.4), 3H(74.5), and 5H(65.9). The first, third, and fifth attributes of this list are also reported as relevant (QTL) for this trait by the methods SML, CIM, and MR (Bedo et al., 2008). In general, it seems that a bit more sophisticated selection procedure (instead of a simple linear search) could remove some features from our list of QTLs. But we did not implement such a procedure in order to ensure scalability.
5. Conclusion
Once it is established that there is a genetic basis for a complex behavior expressed by a measurable phenotype, the aim is to explain this relationship in a useful way. In many important cases, phenotypes are represented by continuous numbers and then the genotype/phenotype relations can be typically explained using regression tools.
However, phenotypes are frequently influenced by environmental causes, and the joint action of many genetic features. In these cases, simple regression models provide predictors with poor performance. The approach presented here proposes to handle continuous phenotypes as approximations represented by intervals of percentiles. We have proved, in a real case, that the reliability of interval-valued predictors are considerably higher than that of point-valued regressors. We discussed two implementations of this idea. The first is based on regression with a loss function that uses a tolerance parameter ε. The second one discretizes the target values into a set of k bins of equal frequency, and then employs classifiers. However, the sizes of predicted intervals do not need to be the same for all genotypes, and this option can be incorporated if we use set-valued or nondeterministic classifiers: they may predict wider intervals for doubtful genotypes.
Additionally, handling phenotypes as approximations opens the possibility of using feature selectors of high performance, and whose complexity in terms of computation time is very low. In this sense, we have proved that deterministic classifiers can be delegated to select a set of features for nondeterministic tasks. Therefore, without heuristic jumps, we can use filters designed for deterministic classifiers to obtain a fully scalable method suitable for dealing with large genetic data sets.
Footnotes
Acknowledgments
We acknowledge those researchers who shared the barley data used in this article: Wenzl et al. (2006) and Bedo et al. (
). This work was supported in part by the Spanish Ministerio de Ciencia e Innovacin (grant TIN2008-06247 to O.L., J.R.Q., J.D., J.J.C., and A.B.; and grant AGL2007-65563-C02-01/GAN to M.P.-E.).
Disclosure Statement
No competing financial interests exist.
