Prediction of Rare Single-Nucleotide Causative Mutations for Muscular Diseases in Pooled Next-Generation Sequencing Experiments

Abstract

Next-generation sequencing (NGS) is a new approach for biomedical research, useful for the diagnosis of genetic diseases in extremely heterogeneous conditions. In this work, we describe how data generated by high-throughput NGS experiments can be analyzed to find single nucleotide polymorphisms (SNPs) in DNA samples of patients affected by neuromuscular disorders. In particular, we consider untagged pooled NGS data, where DNA samples of different individuals are combined in a single experiment, still providing information with an uncertainty limited to only two patients. At the moment, only few publications address the problem of SNPs detection in pooled experiments, and existing tools are often inaccurate. We propose a computational procedure consisting of two parts. In the first, data are filtered by means of decision rules. The second phase is based on a supervised classification technique. In the present work, we compare different de facto standard supervised and unsupervised procedures to identify and classify variants potentially related to muscular diseases, and we discuss results in terms of statistical and biological validation.

1. Introduction

Since it has been introduced in 2005, the next- or second-generation sequencing (NGS) has been revolutionizing genetic and genomic research, overcoming some limitations related to the capillary electrophoresis (CE)-based Sanger sequencing (Shendure and Ji, 2008). In particular, the ability to perform parallel reactions on a physical support enables rapid DNA sequencing, increasing dramatically the throughput and, at the same time, decreasing the cost (Metzker, 2010). NGS procedures have been used for several different applications, such as de novo sequencing, resequencing, transcriptome analyses, or metagenomics studies. However, one of the most fascinating purposes of NGS is its use for diagnostics and therapeutics. To this aim, the resequencing of human exome (whole-exome approach) or of specific regions of interest (targeted approach) allows the identification of causative variants for genetic diseases, as widely demonstrated in literature (Gilissen et al., 2012; Goldstein et al., 2013; Rabbani et al., 2012; Torella et al., 2013). This approach is particularly successful for the diagnosis of complex genetic disorders in which mutations in multiple genes can cause the disease.

Neuromuscular disorders (NMD) are a group of muscular diseases that weaken the musculoskeletal system and hamper locomotion. They represent a perfect example of complex genetically heterogeneous disorders in which, as shown by Kaplan (2012) in his annual table, there are 681 diseases phenotypes associated with neuromuscular disorders, 321 causative genes, and 92 mapped loci awaiting gene identification.

In such heterogeneous genetic conditions, about 40% of patients do not obtain a molecular diagnosis by means of a traditional approach, because it would require a high number of gene sequences, which is expensive and time-consuming. As we will show in the following, NGS is useful for the molecular diagnosis (Nigro and Piluso, 2013), because it provides an easy analytic pipeline, together with high specificity and sensitivity. Furthermore, it reduces the overall execution time.

NGS techniques are currently used to sequence single genomes, and the process is still very expensive in terms of cost and time. In studies with a large number of patients, an alternative is represented by grouping individuals in pools to sequence their genomes together. This pooling technique can reduce time and cost, although a model is needed to predict whether the single nucleotide polymorphisms (SNPs) are sufficiently covered and still detectable in the experiment. On the other hand, pooled NGS is often theoretically more effective in mutation discovery, providing more accurate allele frequency estimates, as described by Futschik and Schlotterer (2010). Nevertheless, only few publications address the problem when pooled experiments are considered, and they are lacking of formal mathematical models that can assess the validity of the study. This is the case, for example, of the article by Calvo et al. (2011), who introduce high-throughput, pooled sequencing to identify mutations in NUBPL and FOXRED1 in human complex I deficiency. In that study, seven pools with a total of 103 cases and 42 healthy controls are involved. That study illustrates how large-scale sequencing, coupled with functional prediction and experimental validation, can be used to identify causative mutations in individual cases. The limit of that study is that only two genes are considered. They were able to confirm only the mutations in half of the 103 subjects, which might be related to the lack of a priori experimental design. Wang et al. (2011) are sequencing pooled mtDNA of multiple individuals for estimating allele frequency using the Illumina genome analyzer (GA) II sequencing system. Each pool includes 20 subjects that have been previously sequenced using Sanger sequencing. Each pool is replicated to assess variation of the sequencing error between pools. The proposed technique is not resilient to sequencing errors, thus providing a large number of false positives. Finally, Ding et al. (2012) compare four standard supervised machine learning algorithms to predict causative SNP in tumor/normal pooled NGS experiments. In order to evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine, and logistic regression), features are constructed to represent 3369 candidate somatic SNPs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. The solution depends on third-party software packages and no planning of the experiment has been done.

The aim of this article is to describe an efficient method to detect rare and putative causative mutations in pooled experiments on 98 genes in a population of 128 patients for which a clinical diagnosis of muscular disease was available. The exon regions of the 98 genes have been analyzed by means of the Agilent HaloPlex Target Enrichment system (Agilent, 2014). In a preliminary phase, loci not containing probable mutations are filtered by means of decision rules. Then, features of 6502 candidate causative mutations are constructed and subsequently used to build a classification model. The last part of the computational procedure uses this model to predict the potential mutations, which are ranked and filtered again with respect to the prediction of SIFT (Ng and Henikoff, 2003), which ensures the damage of the mutation on the coded protein.

The article is organized as follows. The following section deals with materials and methods for managing NGS data, DNA preparation and protocols, target genes, and methods for SNP discovery. In Section 3, results are described and compared with those obtained by other methods. In Section 4, results are discussed. Finally, in Section 5, some concluding comments and open problems are addressed.

2. Materials and Methods

2.1. DNA preparation and pooling

A total of 128 DNA samples from patients with a clinical diagnosis of neuromuscular disease have been extracted using standard procedures. DNA quality and quantity have been assessed using both spectrophotometric (Nanodrop ND 1000, Thermo Scientific Inc., Rockford, IL) and fluorometry-based (Qubit 2.0 Fluorometer, Life Technologies, Carlsbad, CA) methods. In all, 8 pools of 16 different samples each have been created. Each of the first 8 pools contains a control sample with known mutations. Pools are replicated using two samples from each original pool, for a total of 16 pools. The pool organization is reported in Tables 1 and 2. Each row represents a pool, where samples are numbered from 1 to 128.

Table 1.

Pooling Organization: The Original Pools (1–8)

Pool
#1	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
#2	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32
#3	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48
#4	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64
#5	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80
#6	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95	96
#7	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111	112
#8	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127	128

The control samples are indicated in bold.

Table 2.

Pooling Organization: The Replicated Pools (9–16)

Pool	#1	#1	#2	#2	#3	#3	#4	#4	#5	#5	#6	#6	#7	#7	#8	#8
#9	1	2	17	18	33	34	49	50	65	66	81	82	97	98	113	114
#10	3	4	19	20	35	36	51	52	67	68	83	84	99	100	115	116
#11	5	6	21	22	37	38	53	54	69	70	85	86	101	102	117	118
#12	7	8	23	24	39	40	55	56	71	72	87	88	103	104	119	120
#13	9	10	25	26	41	42	57	58	73	74	89	90	105	106	121	122
#14	11	12	27	28	43	44	59	60	75	76	91	92	107	108	123	124
#15	13	14	29	30	45	46	61	62	77	78	93	94	109	110	125	126
#16	15	16	31	32	47	48	63	64	79	80	95	96	111	112	127	128

The control samples are indicated in bold.

Since a control sample with a known heterozygous mutation is present in each pool, this means that the control variant is present in 1 out of 32 alleles, that is, approximately 3.12% of reads.

According to manufacturers instructions (HaloPlex Target Enrichment System For Illumina Sequencing, Protocol version D, August 2012, Agilent Technologies, Santa Clara, CA) for each pool, 200 ng of genomic DNA have been digested in eight different restriction reactions to create a library of fragments. These fragments have been hybridized for 16 hours to specific probes, containing index sequences for Illumina sequencing and able to direct circularization of the target DNA fragments. After the capture of biotinylated target DNA using streptavidin beads, nicks in the circularized fragments have been closed by a ligase. Finally, captured target DNA has been eluted by NaOH and amplified by Polymerase Chain reaction (PCR). Amplified target molecules have been purified using Agencourt AMPure XP beads (Beckman Coulter Genomics, Bernried am Starnberger See, Germany). Enriched target DNA in each library sample has been validated and quantified by microfluidics analysis using the Bioanalyzer High Sensitivity DNA Assay kit (Agilent Technologies) and the 2100 Bioanalyzer with the 2100 Expert Software.

2.2. Target genes

A literature search has identified the 98 genes used in the present analysis for a total of 486480 bp. All the genes investigated are causative of muscular dystrophies or congenital myopathies and are listed in Kaplan's table (Kaplan, 2012).

2.3. SNP discovery

2.3.1. Decision rules

The proposed computational procedure consists of two parts: a rule-based filtering and a supervised classification algorithm. The first is devised to eliminate the loci where there is no evidence of an SNP, whereas the latter is devised to predict the positions in which there might be a mutation. We decided not filter out the loci listed in the single nucleotide polymorphism database (dbSNP), because although already known, some of them are not related with specific diseases. In the first step, for each position the frequencies of the four nucleotides aligned in all reads on that position are computed.

Since some patients can be represented in the experiment with a quantity of DNA greater than the others, we applied the following reasoning. The contributions of DNA from each sample are equally divided in two groups. Half of the patients have an average contribution equal to one and the other half contributes in mean two. This situation can be formally represented by means of a mixture of two generalized beta distributions, respectively, with support [0.35, 1.65] and [1.35, 2.65] and both with parameters α = 3 and β = 3.

The generalized beta random variable X_j with support [c_j, d_j] and parameters α_j and β_j has probability density function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} f_j ( x ) = \frac {( x - c_j ) ^ {\alpha_j - 1} ( d_j - x ) ^ {\beta_j - 1}} {B ( \alpha_j , \beta_j )} , \tag {1} \end{align*} \end{document}

where B(α_j, β_j) is the beta function, and it is equal to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$\frac {( \alpha_j - 1 ) ! ( \beta _j- 1 ) !} {( \alpha _j+ \beta_j - 1 ) !}$$ \end{document} . The probability density function of a mixture of k beta distributions is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*}f ( x ) = \sum_{j = 1}^k p_j \, f_j ( x ) , \tag{2}\end{align*} \end{document}

where, for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$j = 1 , \ldots , k$$ \end{document} , 0 < p_j < 1 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$\sum \nolimits_{j = 1}^k p_j = 1$$ \end{document} . The overall mean is equal to 1.5, and the support of the mixture of the above two beta distribution is [0.35, 2.65]. Since a heterozygous mutation is present in approximately 3.12% of reads, if we consider the average contribution of each patient, it is present in approximately 4.7% of reads. Furthermore, since 100% of the contributions ranges from 0.35% to 2.65%, 100% of mutations ranges from 1.1% to 8.3%. Hence, fixing an error rate equal to 0.01, a mutation is characterized by one base with a frequency in the interval (1.1%, 8.3%) and two bases below 1%.

Finally, we required an average of 20 reads for each individual in each pool covering every locus. Taking into account that the average contribution of DNA of each patient is equal to 1.5, this means that we select only the loci with a coverage at least equal to 20 * 1.5 * 16 = 480 ≃ 500 (we use 500 to approximate 480). The obtained dataset is then classified by means of supervised techniques.

2.3.2. Classification

The problem of detecting an SNP can be posed as a classification problem. Once a classification model is determined, each position can be tested and classified as a probable mutation or not. The ingredients to build a classification model are: a set of training positive/negative mutations, described by some features, and a classification algorithm. The next step consists in constructing the features used in the learning phase of the classification procedure. The supervised classification is based on a training and a testing set. In the training phase, each classifier learns from the data on the basis of some features. In this study, taking into account that the aim is to predict causative mutations, we construct 19 features mainly related to the base frequency, the mapping quality, and the base quality. In details, for each position we consider the following: the frequency of each base (‘T’, ‘G’, ‘C’, ‘A’); the minimum, the maximum, and the average mapping quality; and the minimum, the maximum, and the average base quality related to each base. The features related to base quality play an important role, as they indicate if the mutated base is reliable.

A binary classification problem can be formulated as a generalized eigenvalue problem. This idea was firstly introduced by Mangasarian and Wild (2006) to generalize support vector machines (SVM) (Vapnik, 1982), a de facto standard classification algorithm. This generalization has been devised to overcome some issues related to the computational complexity of SVM. The main idea is to describe each class with a line that is as close as possible to the points of that class and the farthest from the other class (Fig. 1).

FIG. 1.

Example of two classes, A and B, and the planes obtained by generalized eigenvalue proximal SVM (GEPSVM).

Let A and B be the matrices containing one point from one class in each row. A and B will have the same number n of columns (features) and as many rows as the number of points in the corresponding class. Let x^T ω − γ = 0 be a plane in the n-dimensional feature space, where ω is the vector of the coefficients and γ the intercept. The solution to the following optimization problem: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*}\min_ {\omega , \gamma \ne 0} \frac {\ \parallel A \omega - e \gamma \parallel ^2} {\ \parallel B \omega - e \gamma \parallel ^2} , \tag {3} \end{align*} \end{document}

provides the coefficients ω_A and constant γ_A of the plane x^T ω_A − γ_A = 0 describing class A. The minimization of the inverse of the same quotient provides the solution (ω_B, γ_b) for class B. To assign label for a new point x_u, for which the class is unknown, we use the following formula: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} class ( x_u ) = \arg \min_ {i \in \ A , B \ } \frac {\mid w_i^Tx_u - \gamma_i \mid} {\ \parallel w_i \parallel} , \tag {4} \end{align*} \end{document}

which assigns the point to the class with the closest plane. The algorithm used to compute the classification model is based on the work of Guarracino et al. (2007), which introduces a regularization technique to speed up computation (ReGEC). Cifarelli et al. (2007) introduce an incremental version of the algorithm, which can be used in the case of a large training set.

3. Results

3.1. Benchmark dataset

The experiments were conducted using a benchmark dataset. The dataset contains 205 positions biologically verified with Sanger sequencing. Among those, 148 were found to contain mutations, whereas the remaining 57 were false positive. The BAM files were obtained filtering out all reads not covering the selected positions.

3.2. Other tools

In this section, we briefly review four software tools that can be used to analyze pooled NGS data and that are de facto standard in SNP detection. GATK is a software package developed at the Broad Institute to analyze next-generation data (DePristo et al., 2011; McKenna et al., 2010); it is composed of different software packages, called walkers. Among the walkers offered, the most important is variant discovery and genotyping. The GATK was designed using the functional programming paradigm of MapReduce (Dean and Ghemawat, 2008).

SNVer is a software tool for calling common and rare variants in pools of individuals (Wei et al., 2011). The software can analyze loci with any (low) coverage. The depth of coverage will be quantitatively taken into account in the detection.

FreeBayes is a Bayesian genetic variant detector software designed to detect small polymorphisms such as SNPs, Indels, and multi-nucleotide polymorphisms (MNPs) smaller than the length of the short reads obtained in the sequencing (Garrison and Marth, 2012).

CRISP uses a cross-pool comparison approach to distinguish sequencing errors from rare variants. It can be used to evaluate pooled sequencing datasets (human and bacterial) generated by the Illumina sequencing platforms (Bansal, 2010).

We check the positions of the benchmark dataset of mutations by means of the above software. GATK recognizes 104 out of 148 as true positive and 44 as false negative. Only 4 out of 57 are classified as false positive. The number of false negatives obtained with FreeBayes is 82 out of 148, while there are no false positives. SNVer has 0% specificity and CRISP accuracy is 50% (less than null classification). The values of accuracy, sensitivity, and specificity are represented in Figure 2.

FIG. 2.

Accuracy, sensitivity and specificity of GATK, SNVer, FreeBayes and CRISP.

3.3. Decision rules results

In the fist step of the computational procedure, the 486480 exon positions are filtered by selecting only those whose coverage is greater than 500, one base frequency in the interval [1.1%, 8.3%] and two bases below 1%. In addition, each exon position has to be replicated in two pools, one of the original group and one of the replicated group. This allows for an imprecision between two samples.

In Figure 3, some positions fulfilling and not fulfilling decision rules are reported. In particular, blue bars are associated with bases whose frequency is lower than 1%. Red bars refer to bases with frequency in the interval [1.1%, 8.3%] and green bars to bases with frequency greater than 8.3%. Hence, each position corresponds to a possible mutation if it is characterized by two blue bars, one red, and one green; is replicated in the two pools; and its coverage is greater than 500. In details, positions 152990583, 152990660, and 152990694 of gene ABCD1 and chromosome X correspond to possible mutations. Position 152991046 and position 152991050 are not selected by decision rules since both have coverage lower than 500. Furthermore, the last one is also characterized by two green bars. Hence, there are not two bases with frequencies lower than 1%.

FIG. 3.

Barplots of bases related to eight positions of gene ABCD1 and chromosome X.

Three positions associated with possible mutations, according to the above rules, are reported in Table 3. In detail, position 152990560 (reported also in Fig. 3) is characterized by a coverage equal to 555; two bases, G and A, below 1%; and base T equal to 4.9% in pool 3. In addition, this position is replicated in pool 15. An analogous situation is present in position 100326823 of gene AGL (chromosome 1) and position 22271787 of gene ANO5 (chromosome 11). By means of the described filtering rules, 6502 mutations have been selected.

Table 3.

Examples of Positions Associated with Mutations According to the Rules Classification

Position	Gene	Chr	T	G	C	A	Coverage	Pool
152990560	ABCD1	X	27	0	528	0	555	3
152990560	ABCD1	X	52	0	968	1	1021	15
100326823	AGL	1	0	9	0	641	650	2
100326823	AGL	1	3	15	6	1149	1191	14
22271787	ANO5	11	635	6	19	2	677	8
22271787	ANO5	11	692	5	13	2	724	15

The values in the interval [1.1%, 8.3%] are indicated in bold.

3.4. Classification results

In order to check the adequacy of the ReGEC algorithm, we use the set of 148 true mutations and 57 nonmutation positions (confirmed previously by biologists) as a training set. By means of a 10-fold cross-validation analysis we obtain robust quantitative accuracy measurements of sensitivity and specificity on labeled training data. For this classifier we obtain an accuracy equal to 94.25%, a sensitivity equal to 93% and a specificity equal to 98%. Furthermore, by comparing ReGEC with other classifiers, it results better in terms of accuracy, sensitivity, and specificity. In detail, we consider support vector machines using sequential minimal optimization (SMO) (Platt, 1998), Bayesian network classifier (Friedman et al., 1997), k-nearest neighbours (k-NN) (Fix and Hodges, 1951), classification via clustering (simple k-means), multivariable functional interpolation and adaptive networks (Broomhead and Lowe, 1988), simple logistic (Landwehr et al., 2005; Sumner et al., 2005), complement naive Bayes (Rennie et al., 2003), Bayesian logistic regression (Mitchell, 1997), and primal estimated sub-gradient solver for SVM (Pegasos) (Shalev-Shwartz et al., 2007). The results are reported in Figure 4.

FIG. 4.

Accuracy sensitivity and specificity of different classification methods.

By means of ReGEC we predict positions related to possible mutations. The set of 148 true mutations and 57 nonmutation positions is used as a training set. The 6502 positions obtained by decision rules are tested, and 1300 are predicted as mutations. We consider only the loci corresponding to mutations in both pools. These predicted mutations were ranked using their distances from the plane of the belonging class. In further detail, for each position x_u we construct the following rank \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} rank = \frac{ \frac {\mid w_ {+} ^Tx_u - \gamma_ {+} \mid} {\parallel w_ {+} \parallel}} {\frac {\mid w^T_ {-} x_u - \gamma_ {-} \mid} {\parallel w_ {-} \parallel}} , \tag {5} \end{align*} \end{document}

where the numerator is the distance of x_u from the belonging class (+) and the denominator of the ration is the distance from the nonbelonging class (−). The ranked list is used by biologists to biologically validate whether loci are true mutations.

3.5. SIFT

In this section a structural analysis on the predicted mutations is conducted using SIFT (sorting intolerant from tolerant). SNP studies identifies amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT is a program that predicts whether an amino acid substitution affects protein function. SIFT can distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms (see, for more details, Ng and Henikoff, 2003). Among the 1300 positions associated with possible mutations, 600 have been identified by SIFT as 100% damaging, that is, as causative mutations.

3.6. Detection rate

The biologists start to check these mutations following the order of ranking and considering only exon positions. Furthermore, 142 positions not containing those used in the training set have been analyzed. In detail, 114 mutations have been confirmed, including the control mutations, and 28 mutations have not been confirmed (false positive). This implies a detection rate equal to 80.3%, as reported in Figure 5.

FIG. 5.

Detection rate.

In detail, the predictive positive value—that is, the fraction true positives divided by the sum of true and false positives—reported on the y-axis and the x-axis represents the number of positions classified as mutations.

4. Discussion

In this article we consider pooled next-generation sequencing data. We propose a computational procedure to identify and classify variants potentially related to human diseases. This procedure consists of two parts. The first step involves decision rules, and in the second step, we use a supervised classifier; we focus on causative mutations for muscular diseases. Taking into account the distribution of each patient contribution in a pool and the percentage of reads associated with a mutation, we construct some decision rules for filtering the 486480 loci. We get 6502 loci corresponding to possible mutations.

Using the set of 148 biologically verified mutations and 57 nonmutation positions as the benchmark dataset, we compute the accuracy, sensitivity, and specificity of the de facto standard SNP software: GATK, SNVer, FreeBayes, and CRISP. The best accuracy of these methods on the benchmark dataset is 77% (GATK), with a specificity of 93%. In order to check the adequacy of that classifier we compare different classification algorithms by means of a cross-validation procedure using the training set previously mentioned. Results show that among eleven de facto standard supervised algorithms, ReGEC has the highest accuracy, sensitivity, and specificity. By means of the ReGEC algorithm, we test the 6502 positions obtained in the filtering step, and we predict 1300 possible mutations. These predicted mutations were ranked using their distances from the plane of the belonging class. Predicted mutations closer to the class of training mutations are more similar to these and therefore have higher probability to be SNPs. These positions were then tested with SIFT to predict whether a mutation would damage protein structure and therefore affect protein function, reducing the total to 600 possibly causative mutations. The biological checking of these mutations follows the ranking order, and 114 mutations have been confirmed, including the control mutations, and 28 mutations have not been confirmed (false positive) so far.

Results show that when biological evidence exists of loci containing true and false SNP, these data can be used to build a supervised classification model predicting mutations with high detection rate. Furthermore, this method also provides a ranked list of possible SNPs, thus providing a useful tool to help decide where to focus attention first. Overall, since sequencing pooled samples is certainly a cost-effective method to detect variants, the bioinformatic tool described in this article provides a powerful procedure to this aim. In particular, this methodology represents a robust option for the analysis of a high number of samples in genetically heterogeneous diseases.

5. Concluding Remarks

Since standard tools are not appropriate to discover SNPs when pooled NGS experiments are considered, we propose a computational procedure to identify and classify variants potentially related to human diseases. This procedure is divided in two phases: First, the loci are filtered by means of decision rules, and second, part positions related to possible mutations are predicted by using a supervised classification. In the near future, it will be interesting to construct more features related with the data and to find integrated software solutions.

Footnotes

Acknowledgments

The authors would like to thank T. Giugliano, M. Iacomino, A. Torella, A. Garofalo, C. Pisano, F. Del Vecchio Blanco, and G. Piluso (Department of Biochemistry, Biophysics and General Pathology, Second University of Naples); M. Mutarelli, V. Singh Marwah, and M. Dionisi (TIGEM); and the Italian LGMD network. This work has been partially funded by the Italian Flagship project Interomics and by project PON02_00619.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Agilent. 2014. Next-Gen sequencing: Advancing sequencing for a better world. Agilent Technologies target enrichment solutions. Available at www.agilent.com/genomics/ngs

Bansal

2010. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics, 26, 318–324.

Broomhead

, and Lowe

1988. Multivariable functional interpolation and adaptive networks. Complex Systems, 2, 321–355.

Calvo

, Tucker

, Compton

, et al. 2011. High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nature Genetics, 42, 851–860.

Cifarelli

, Guarracino

, Seref

, et al. 2007. Incremental classification with generalized eigenvalues. J. Classif., 24, 205–219.

Dean

, and Ghemawat

2008. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51, 107–113.

DePristo

, Banks

, Poplin

, et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet., 43, 491–498.

Ding

, Bashashati

, Roth

, et al. 2012. Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics, 28, 167–175.

Fix

, and Hodges

1951. Discriminatory analysis, non parametric discrimination: Consistency properties. Technical Report 4. USAF School of Aviation Medicine, Randolph Field, Texas.

10.

Friedman

, Geiger

, and Goldszmidt

1997. Bayesian network classifiers. Machine Learning, 29, 131–163.

11.

Futschik

, and Schlotterer

2010. The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics, 186, 207–218.

12.

Garrison

, and Marth

2012. Haplotype-based variant detection from short-read sequencing. Technical report.

13.

Gilissen

, Hoischen

, Brunner

, and Veltman

2012. Disease gene identification strategies for exome sequencing. Eur. J. Hum. Genet., 20, 490–497.

14.

Goldstein

, Allen

, Keebler

, et al. 2013. Sequencing studies in human genetics: design and interpretation. Nat. Rev. Genet., 14, 460–470.

15.

Guarracino

, Cifarelli

, Seref

, and Pardalos

2007. A classification algorithm based on generalized eigenvalue problems. Optim. Method. Softw., 22, 73–81.

16.

Kaplan

2012. The 2013 version of the gene table of monogenic neuromuscolar disorders (nuclear genome). Neuromuscolar Disorders, 22, 1108–1135.

17.

Landwehr

, Hall

, and Frank

2005. Logistic model trees. Machine Learning, 95, 161–205.

18.

Mangasarian

, and Wild

2006. Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Trans. Pattern Anal. Mach. Intell., 28, 69–74.

19.

McKenna

, Hanna

, Banks

, et al. 2010. The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome Res., 20, 1297–1303.

20.

Metzker

2010. Sequencing technologies—the next generation. Nat. Rev. Genet., 11, 31–46.

21.

Mitchell

1997. Machine Learning. McGraw Hill, New York.

22.

, and Henikoff

2003. Sift: predicting amino acid changes that affect protein function. Nucleic Acids Res., 31, 3812–3814.

23.

Nigro

, and Piluso

2013. Next generation sequencing (NGS) strategies for the genetic testing of myopathies. Acta Myol., 31, 196–200.

24.

Platt

1998. Fast training of support vector machines using sequential minimal optimization. In Schoelkopf

, Burges

, and Smola

eds. Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA.

25.

Rabbani

, Mahdieh

, Hosomichi

, et al. 2012. Next-generation sequencing: impact of exome sequencing in characterizing mendelian disorders. J. Hum. Genet., 57, 621–632.

26.

Rennie

, Shih

, Teevan

, and Karge

2003. Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of the Twentieth International Conference on Machine Learning, 616–623.

27.

Shalev-Shwartz

, Singer

, and Srebro

2007. Pegasos: Primal estimated sub-gradient solver for SVM. 24th International Conference on Machine Learning, 807–814.

28.

Shendure

, and Ji

2008. Next-generation DNA sequencing. Nat. Biotechnol., 26, 1135–1145.

29.

Sumner

, Frank

, and Hall

2005. Speeding up logistic model tree induction. 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, 675–683.

30.

Torella

, Fanin

, Mutarelli

, et al. 2013. Next-generation sequencing identifies transportin 3 as the causative gene for lgmd1f. PLoS One, 8, e63536.

31.

Vapnik

1982. Estimation of Dependences Based on Empirical Data [in Russian]. [English translation]: Springer Verlag, New York.

32.

Wang

, Pradhan

, Ye

, et al. 2011. Estimating allele frequency from next-generation sequencing of pooled mitochondrial DNA samples. Frontiers in Genetics. 2.

33.

Wei

, Wang

, Hu

, et al. 2011. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Research, 39, 1–13.

Pool
#1	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
#2	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32
#3	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48
#4	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64
#5	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80
#6	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95	96
#7	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111	112
#8	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127	128

Pool	#1	#1	#2	#2	#3	#3	#4	#4	#5	#5	#6	#6	#7	#7	#8	#8
#9	1	2	17	18	33	34	49	50	65	66	81	82	97	98	113	114
#10	3	4	19	20	35	36	51	52	67	68	83	84	99	100	115	116
#11	5	6	21	22	37	38	53	54	69	70	85	86	101	102	117	118
#12	7	8	23	24	39	40	55	56	71	72	87	88	103	104	119	120
#13	9	10	25	26	41	42	57	58	73	74	89	90	105	106	121	122
#14	11	12	27	28	43	44	59	60	75	76	91	92	107	108	123	124
#15	13	14	29	30	45	46	61	62	77	78	93	94	109	110	125	126
#16	15	16	31	32	47	48	63	64	79	80	95	96	111	112	127	128

Pool
#1	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
#2	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32
#3	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48
#4	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64
#5	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80
#6	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95	96
#7	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111	112
#8	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127	128

Pool	#1	#1	#2	#2	#3	#3	#4	#4	#5	#5	#6	#6	#7	#7	#8	#8
#9	1	2	17	18	33	34	49	50	65	66	81	82	97	98	113	114
#10	3	4	19	20	35	36	51	52	67	68	83	84	99	100	115	116
#11	5	6	21	22	37	38	53	54	69	70	85	86	101	102	117	118
#12	7	8	23	24	39	40	55	56	71	72	87	88	103	104	119	120
#13	9	10	25	26	41	42	57	58	73	74	89	90	105	106	121	122
#14	11	12	27	28	43	44	59	60	75	76	91	92	107	108	123	124
#15	13	14	29	30	45	46	61	62	77	78	93	94	109	110	125	126
#16	15	16	31	32	47	48	63	64	79	80	95	96	111	112	127	128

Pool
#1	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
#2	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32
#3	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48
#4	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64
#5	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80
#6	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95	96
#7	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111	112
#8	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127	128

Pool	#1	#1	#2	#2	#3	#3	#4	#4	#5	#5	#6	#6	#7	#7	#8	#8
#9	1	2	17	18	33	34	49	50	65	66	81	82	97	98	113	114
#10	3	4	19	20	35	36	51	52	67	68	83	84	99	100	115	116
#11	5	6	21	22	37	38	53	54	69	70	85	86	101	102	117	118
#12	7	8	23	24	39	40	55	56	71	72	87	88	103	104	119	120
#13	9	10	25	26	41	42	57	58	73	74	89	90	105	106	121	122
#14	11	12	27	28	43	44	59	60	75	76	91	92	107	108	123	124
#15	13	14	29	30	45	46	61	62	77	78	93	94	109	110	125	126
#16	15	16	31	32	47	48	63	64	79	80	95	96	111	112	127	128