Population Model–Based Inter-Diplotype Similarity Measure for Accurate Diplotype Clustering

Abstract

Classification of the individuals' genotype data is important in various kinds of biomedical research. There are many sophisticated clustering algorithms, but most of them require some appropriate similarity measure between objects to be clustered. Hence, accurate inter-diplotype similarity measures are always required for classification of diplotypes. In this article, we propose a new accurate inter-diplotype similarity measure that we call the population model-based distance (PMD), so that we can cluster individuals with diplotype SNPs data (i.e., unphased-diplotypes) with higher accuracies. For unphased-diplotypes, the allele sharing distance (ASD) has been the standard to measure the genetic distance between the diplotypes of individuals. To achieve higher clustering accuracies, our new measure PMD makes good use of a given appropriate population model which has never been utilized in the ASD. As the population model, we propose to use an hidden Markov model (HMM)–based model. We call the PMD based on the model the HHD (HIT HMM–based Distance). We demonstrate the impact of the HHD on the diplotype classification through comprehensive large-scale experiments over the genome-wide 8930 data sets derived from the HapMap SNPs database. The experiments revealed that the HHD enables significantly more accurate clustering than the ASD.

1. Introduction

Single nucleotide polymorphisms (SNPs) are the most fundamental genetic polymorphisms in human genomes (Kim and Misra, 2007), and classification of individuals with the individual SNPs data is very useful in various kinds of biomedical research, especially in population genetics and genetic epidemiology (Conrad et al., 2006; Jakobsson et al., 2008). Accurate classification of individual SNPs data will help study of genotype variations, especially when different genotypes prevail in different populations or subgroups.

There are various sophisticated clustering methods for general data (not limited for clustering SNPs data), many of which (e.g., Ward's method [Team RDC, 2007; Ward, 1963; Ward and Hook, 1963], k-Medoid [Kaufman and Rousseuw, 1990], DBSCAN [Ester et al., 1996], and most of the phylogenetic clustering algorithms such as the famous neighbor joining method [Saitou and Nei, 1987]) require appropriate similarity measures between target objects. Designing accurate similarity measure for the objects to be clustered is essential for these similarity-based clustering algorithms.

For SNPs data, there have been proposed various clustering algorithms for clustering haplotypes (i.e., haplotype-alleles, not diplotypes),1 and various types of similarity measures have been proposed for haplotype data (Jin et al., 2010; Li and Jiang, 2005; Li et al., 2006).2 But the human genome is diallelic, and in many cases we observe only the unordered (i.e., unphased) pair of alleles at each locus, instead of ordered (i.e., phased) allele data, due to the high costs required for deciphering unphased allele data to accurate phased ones. In this article, we call a phased pair of haplotypes a “haplotype-diplotype,” and we call an unphased pair of haplotypes a “unphased-diplotype.”

Much work has been done on clustering the unphased-diplotype data. They can be categorized into two types: distance-based methods (Bowcock et al., 1994; Gao and Starmer, 2007) and statistics-based methods (Falush et al., 2003; Pritchard et al., 2000). The distance-based methods utilize a distance measure between two objects, while statistics-based methods are based on the statistical behavior of objects. In this article, we focus on the distance-based clustering methods for unphased-diplotype data. Most previous distance-based methods utilize a similarity measure called the allele sharing distance (ASD) (Gao and Martin, 2009; Jakobsson et al., 2008; Mao et al., 2007; Witherspoon et al., 2007) (see Section 2.1.1). The ASD is a simple and straightforward extension of the Hamming distance, and is the most standard and frequently used similarity measure between a pair of unphased-diplotypes.

In genetic analysis, it is very important to consider properties of populations that are different among genetically distinct populations (Beaty et al., 2005; Fallin et al., 2001; Witherspoon et al., 2007). It should also be true with designing similarity measures for unphased-diplotypes. But the measure ASD does not utilize any population information in obtaining the similarity values. Thus, in this article, we will first propose a new similarity measure called the population model-based distance (PMD) for unphased-diplotypes, which incorporates the population information from an appropriate population model. As the model, we will propose to use an hidden Markov model (HMM)–based model predicted by a standard HMM-based phasing software called HIT (Rastas et al., 2005). We call the PMD based on the model the HHD (the HIT HMM-based distance). We will show the superiority of our new measure HHD over the previous standard ASD through comprehensive experiments over the genome-wide HapMap data (International HapMap Consortium, 2005).

The organization of this article is as follows. In Section 2, we describe previous work on which our method is based. In Section 3, we describe our new measure. In Section 4, we compare the ASD and the HHD through comprehensive experiments over large-scale HapMap data sets to evaluate the impact of the HHD. In Section 5, we conclude.

1.1. Notations and definitions

We assume all SNPs are diallelic. We consider n diplotypes over m SNP loci from the same chromosome. These loci are numbered \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$1 , 2 , \cdots , m$$ \end{document} in the physical order. A SNP-allele for a SNP locus is an element in set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal S} = \{1 , 0 \}$$ \end{document} where 1 and 0 denote the major and minor SNP-alleles, respectively. A haplotype-allele is a sequence of SNP-alleles and is represented by a sequence in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal S}^m$$ \end{document} (e.g., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$10101 \in {\cal S}^5$$ \end{document} ). A SNP-diplotype for a SNP locus is an unordered pair of SNP-allele in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal D} = {\cal S} \times {\cal S} ({\rm e.g.} , \{0 , 1 \} \in {\cal D})$$ \end{document} . An unphased-diplotype is a sequence of SNP-diplotype and is represented by a sequence in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal D}^m ({\rm e.g.} , \{1 , 0 \} - \{0 , 0 \} - \{1 , 0 \} - \{1 , 1 \} - \{1 , 0 \} \in {\cal D}^5)$$ \end{document} . Given unphased-diplotypes, the phasing problem is to find the most probable corresponding haplotype-allele pairs that could have generated the unphased-diplotypes. A phased haplotype-allele pair is called a haplotype-diplotype (e.g., {10010, 00111}).

2. Previous Work

In this section, we describe previous work on which our work is based. In Section 2.1, we describe the definitions of measures in previous work (e.g., the ASD). In Section 2.2, we describe the HIT algorithm on which our new distance measure is based. In Section 2.3, we describe a clustering algorithm and an evaluation method for clustering that we will use in the experiments in Section 4.

2.1. Previous measures for inter-individual genetic distances

2.1.1. Allele sharing distance

The most standard inter-diplotype distance is the ASD (Gao and Martin, 2009; Jakobsson et al., 2008; Mao et al., 2007; Witherspoon et al., 2007), defined as follows. For two unphased-diplotypes g, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf g}^{\prime} \in {\cal D}^m$$ \end{document} (i.e., m is the number of SNP loci), the ASD between the diplotypes g and g′ is defined as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} D ({\bf g} , {\bf g}^{\prime}) = \frac {1} {2m} \sum^{m}_{\ell = 1} d ({\bf g} [ \ell ] , {\bf g}^{\prime} [ \ell ]) , \tag {1} \end{align*} \end{document}

where g[ℓ] denotes the ℓ-th SNP-diplotype of unphased-diplotype g, and d(g[ℓ],g′[ℓ]) is the number of SNP-alleles which are not shared between g and g′ at the ℓ-th locus.

2.1.2. Haplotype similarity measure

The most common and simplest measurement for the similarity between DNA sequences, including the haplotype-allele data, is the hamming distance (Cover and Thomas, 1991; Isaev, 2004; Lesk, 2005; Li and Jiang, 2005; Tzeng et al., 2003). For a haplotype-allele \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf h} \in {\cal S}^m$$ \end{document} (where m is the length of h), let h[k] denote the SNP-allele at the k-th locus of h. The hamming distance between two haplotype-alleles h and h′ is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} s (h , h^{\prime}) = \sum_{k = 1}^mI ({\bf h} [ k ] , {\bf h}^{\prime} [ k ]) , \tag{2} \end{align*} \end{document}

where I(a, b) = 0 if a = b and I(a, b) = 1 otherwise. As the hamming distance is length-dependent, we define the following A(h, h′) as a length-independent distance between haplotype-alleles h and h′: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} A ({\bf h} , {\bf h}^{\prime}) = \frac {s ({\bf h} , {\bf h}^{\prime})} {m} . \tag {3} \end{align*} \end{document}

2.2. HIT algorithm

The Haplotype Inference Technique (HIT) algorithm (Rastas et al., 2005) is an HMM-based algorithm for phasing unphased-diplotypes. The algorithm utilizes the HMM (Rabiner and Juang, 1986). The HMM of the HIT is designed to simulate multiple set of ancestors (i.e., founders).3 The HMM is trained from a set of unphased-diplotypes in an unsupervised way with the EM algorithm (Durbin et al., 1998). Figure 1 shows the HMM model used in the HIT. The HIT algorithm phases an unphased haplotype-diplotype by heuristically finding the haplotype-diplotype with the highest emission probability from the HMM.

FIG. 1.

The HMM model of the HIT. In the HMM, a set of nodes in a row corresponds to states of one founder (i.e., ancestor) haplotype-allele. A set of nodes in a column corresponds to states of one locus. Each node (except for the start and end nodes) emits 1 or 0 with some estimated probabilities, which correspond to the major and minor alleles respectively. A path from the start node to the end node corresponds to a haplotype-allele. The HMM emits a haplotype-diplotype as an unordered pair of two paths from the start node to the end node, randomly based on the probabilities estimated for edges. The observers can only see the unphased-diplotype that corresponds to the emitted haplotype-diplotype.

2.3. Clustering methods

In this section, we describe the clustering method and the method for evaluating the results, which we will use in Section 4.

2.3.1. Ward's method

We use Ward's minimum variance algorithm (Team RDC, 2007; Ward, 1963; Ward and Hook, 1963), which is a widely used hierarchical clustering method, to infer clusters based on the ASD or the HHD in Section 4.4 Given n items \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$I_1 , I_2 , \cdots , I_n$$ \end{document} , a distance matrix {w_ij} where w_ij denotes the distance between I_i and I_j, and some fixed positive integer k (k < n), the Ward's method clusters the n items into k clusters by the following n − k − 1 steps.5 At first the algorithm considers n clusters each of which contains only 1 item, i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal C}_1 = \{ \{I_1 \} , \{I_2 \} , \cdots , \{I_n \} \}$$ \end{document} . Then the algorithm reduces the number of clusters one by one in each step as follows. In the m-th step of the algorithm, two clusters are merged into a cluster to minimize \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sum\nolimits_{C \in {\cal C}_{m + 1}} \sum\nolimits_{I_i , I_j \in C} w_{ij}^2 / \mid C \mid$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal C}_i$$ \end{document} denotes the set of clusters before the i-th step of the algorithm. This bottom-up approach is repeated until \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mid {\cal C}_m \mid = k$$ \end{document} .

2.3.2. How to evaluate the clustering results

To evaluate the clustering results, we use the classification error rate (CER) (Gao and Starmer, 2007). The CER is the rate of elements that are assigned to incorrect clusters in clustering results. To know the assignment is correct or not, we need to know the labels of each cluster, but Ward's algorithm does not assign any labels onto the output clusters. In the experiment, we use the minimum CER among all the possible assignments of the population labels, to evaluate the clustering results.

3. New Unphased-Diplotype Distance Measures

In this section, we first propose in Section 3.1 a new measure for the distance between two unphased-diplotypes, the PMD. The PMD is a general concept of distance measures, and we will give an example of the PMD which we call the HHD in Section 3.2. In Section 3.3, we discuss the properties of the proposed measures.

3.1. Population model–based distance

Before defining our new measure called the PMD, we first extend the haplotype similarity measure described in Section 2.1.2 so that we can deal with the distances between two haplotype-diplotypes instead of haplotype-alleles, as follows. Let a = {h₁, h₂} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a^{\prime} = \{{\bf h}_1^{\prime} , {\bf h}_2^{\prime} \}$$ \end{document} be haplotype-diplotypes to be compared, where h₁, h₂, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf h}_1^{\prime} , {\bf h}_2^{\prime} \in {\cal S}^m$$ \end{document} . We define the distance between haplotype-diplotypes a and a′ as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} H (a , a^{\prime}) = \min \left\{\frac {A ({\bf h}_1 , {\bf h}_1^{\prime}) + A ({\bf h}_2 , {\bf h}_2^{\prime})} {2} , \frac {A ({\bf h}_1 , {\bf h}_2^{\prime}) + A ({\bf h}_2 , {\bf h}_1^{\prime})} {2} \right\} , \tag {4} \end{align*} \end{document}

where A is the haplotype similarity measure defined in Section 2.1.2. But we cannot compute this value for unphased-diplotypes, as we cannot know the actual haplotype-diplotypes. To enable it, we extend the above haplotype-diplotype distance H for unphased-diplotypes by utilizing some given population model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}$$ \end{document} as follows.

For any unphased-diplotype, we can enumerate corresponding haplotype-diplotype candidates.6 For example, there are four haplotype-diplotype candidates for unphased-diplotype {1, 0} − {1, 0} − {1, 0}, i.e., {111, 000}, {110, 001}, {101, 010}, and {011, 011}. For unphased-diplotypes g, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf g}^{\prime} \in {\cal D}^m$$ \end{document} , let c_i = {h_i1, h_i2} (1 ≤ i ≤ M) and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$c_j^{\prime} = \{ {\bf h}_{\bf j1}^{\prime} , {\bf h}_{\bf j2}^{\prime} \}$$ \end{document} (1 ≤ j ≤ M′) be the i-th and the j-th candidate haplotype-diplotypes for g and g′, respectively. M and M′ are the numbers of haplotype-diplotype candidates for g and g′, respectively.

If we were given a population model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}$$ \end{document} , we can compute the probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Prob (c \mid {\bf g} , {\cal M})$$ \end{document} that a haplotype-diplotype candidate c is correct for the unphased-diplotype data g. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p_i = Prob (c_i \mid {\bf g} , {\cal M})$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p_j^{\ \prime} = Prob (c_j^{\ \prime} \mid {\bf g}^{\prime} , {\cal M})$$ \end{document} be the conditional probabilities of the candidate haplotype-diplotypes c_i and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$c_j^{\ \prime}$$ \end{document} under the model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}$$ \end{document} . Then the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$PMD_{\cal M}$$ \end{document} between two haplotype-diplotypes g and g′ is defined as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} PMD_{\cal M} ({\bf g} , {\bf g}^{\prime}) = \sum^M_{i = 1} \sum^{M^{\prime}}_{j = 1}H (c_i , c_j^{\ \prime}) \cdot q_i \cdot q_j^{\ \prime} , \tag{5} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q_i = p_i / (\sum\nolimits^{M}_{k = 1}p_k)$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q_j^{\prime} = p_j^{\prime} / (\sum\nolimits^{M^{\prime}}_{k = 1}p_k^{\prime})$$ \end{document} . q_i and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q_j^{\prime}$$ \end{document} are the normalized predicted conditional probabilities of the candidate haplotype-diplotypes c_i and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$c_j^{\prime}$$ \end{document} , respectively.7 Note that the PMD is the expected value of the distance between candidate haplotype-diplotypes, H(c_i, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$c_j^{\prime}$$ \end{document} ), under the population model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}$$ \end{document} .

3.2. HIT HMM-based Distance

To compute the PMD in Section 3.1, we need an appropriate model for the population. In the following, we propose an example of the PMD that we call the HHD.8 To define the HHD, we propose to use the HMM model used in the HIT algorithm (Rastas et al., 2005) (described in Section 2.2) as the population model for the PMD as follows.

The HMM defined in the HIT algorithm can be considered as a predicted population model. Thus, we first train the HMM from all the unphased-diplotype data that are in our hand, and then we define the HHD as follows. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}^*$$ \end{document} denote the HMM model obtained with the HIT. Then we define the HHD as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} HHD ({\bf g} , {\bf g}^{\prime}) = PMD_{\cal M^*} ({\bf g} , {\bf g}^{\prime}). \tag{6} \end{align*} \end{document}

Note that the probability of each haplotype-diplotype candidate is computed as the conditional emission probability of the candidate from the HMM, which can be computed by the forward algorithm (Durbin et al., 1998) for the HMM.

3.3. Discussions on the PMD

3.3.1. The PMD and the multiple founder hypothesis

In many regions (especially in important regions) of the human genome, the haplotype-alleles of the majority in populations can be categorized into a small number of types (Bhatia et al., 2010; Cirulli and Goldstein, 2010), which suggest that only a small number of founder (or ancestral) haplotype-alleles spread over the population on those regions. This hypothesis of the existence of (a few but) multiple founder haplotype-alleles is very important and effective for various kinds of research, for example, the design of the experiments of linkage disequilibrium mapping (Chung et al., 2008; Gonzalez et al., 1999; Haiman et al., 2003) and the evolutionary history analysis of populations (Ahmad et al., 2002; Gaudieri et al., 1997).

The PMD well reflects the existence of the founder haplotype-alleles. In the example given in Figure 2, there are three individuals with haplotype-diplotypes a = {1011, 0110}, b = {1101, 0110}, and c = {1111, 1000}, but we assume that we know only the unphased-haplotypes, i.e., {1, 0} − {1, 0} − {1, 1} − {1, 0}, {1, 0} − {1, 1}-{1, 0} − {1, 0} and {1, 1} − {1, 0} − {1, 0} − {1, 0}, respectively. We can easily see that the ASD between any two of these three individuals is 0.25 (Table 1(1)), and therefore we cannot cluster these three individuals based on the ASD.

FIG. 2.

Haplotype-diplotype examples on which we can observe difference between the ASD and the PMD.

Table 1.

Distances between the Individuals in Figure 2

(1) ASD				(2) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$H = PMD_{{\cal M}_1}$$ \end{document}				(3) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$PMD_{{\cal M}_2}$$ \end{document}
	a	b	c		a	b	c		a	b	c
a	0	0.25	0.25	a	0	0.25	0.5	a	0	0.301	0.450
b	—	0	0.25	b	—	0	0.5	b	—	0	0.500
c	—	—	0	c	—	—	0	c	—	—	0

The distance between two sequences are often measured by the number of point mutations between them (i.e., we consider two sequences to be very distant to each other if there are many mutations between them). We can define the number of mutations under the assumption of existence of multiple founder haplotype-alleles (for details, see the Appendix). Table 2 shows the number under the assumption that there are two founder haplotype-alleles. According to the table, the clustering result of the three individuals should be the one in Figure 3, which cannot be obtained with the ASD. Note that the clustered individuals a and b share the same haplotype-allele, i.e., 0110, which also supports the validity of the clustering result.

FIG. 3.

Clustering results for individuals in Figure 2 based on the numbers of mutations (Table 2), \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$H = PMD_{{\cal M}_1}$$ \end{document} distances (Table 1(2)), or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$PMD_{{\cal M}_2}$$ \end{document} distances (Table 1(3)). On the other hand, the ASD distances (Table 1(1)) cannot deduce this result.

Table 2.

Number of Muations between Each Individual Under the Assumption that There Are Two Founders

	a	b	c
a	0	2	4
b	—	0	4
c	—	—	0

See Appendix how we obtain the number of mutaions for each pair of individuals.

Unlike the ASD, the haplotype-diplotype distance H reflects the numbers in Table 2 very well. The H value between individuals a and b is 0.25, which is the same value as the ASD, but H between a and c and H between b and c are 0.5 (Table 1(2)), which enable us to cluster the individuals as in Figure 3. It means the H values are more appropriate than the ASD values under the existence of the founder haplotype-alleles, at least in this case.

But we cannot compute the real H values unless we know the real haplotype-diplotypes. Instead, we can estimate them by computing the PMD if we are given some population model. Consider the two population models given in Table 3, where haplotype frequencies in the population are given.9 Under the model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}_1$$ \end{document} , we can phase any of the three individuals' unphased-haplotypes correctly with 100% confidence, and the resulting \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$PMD_{{\cal M}_1}$$ \end{document} values are the same as the H values (Table 1(2)). But we cannot predict unphased-haplotypes with such high confidence in many cases, as in the case of the population model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}_2$$ \end{document} where we have multiple haplotype-diplotype candidates for each unphased diplotype (see Table 4 and Table 1(3)).

Table 3.

Population Model Examples Given as Haplotype-Allele Frequencies

	Frequency in population
Haplotype-allele	(i) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}_1$$ \end{document}	(ii) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}_2$$ \end{document}
1111	0.40	0.20
1110	0.00	0.07
1101	0.20	0.08
1011	0.25	0.10
0011	0.00	0.05
0110	0.10	0.30
0101	0.00	0.05
1100	0.00	0.05
1000	0.05	0.10
Others	0.00	0.00

Table 4.

Conditional Probabilities of Candidate Haplotype-Diplotypes for Individuals in Figure 2 Based on the Population Models in Table 3

			Conditional probability
Individual	Unphased-diplotype	Candidate haplotype-diplotype	(i) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}_1$$ \end{document}	(ii) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal M}_2$$ \end{document}
	{1,0}-{1,0}-{1,1}-{1,0}	{1011, 0110}	1.0000	0.8955
a		{1110, 0011}	0.0000	0.1045
		Others	0.0000	0.0000
	{1,0}-{1,1}-{1,0}-{1,0}	{1101, 0110}	1.0000	0.8727
b		{1110, 0101}	0.0000	0.1273
		Others	0.0000	0.0000
	{1,1}-{1,0}-{1,0}-{1,0}	{1111, 1000}	1.0000	0.8000
c		{1011, 1100}	0.0000	0.2000
		Others	0.0000	0.0000

If we cluster the three individuals based on the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$H = PMD_{{\cal M}_1}$$ \end{document} values, we can obtain the same clusters as in Figure 3. Furthermore, we can still get the same clusters even if we use the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$PMD_{{\cal M}_2}$$ \end{document} values instead. Thus, we assume that the PMD is more suitable than the ASD under the multiple founder hypothesis, if we are given an appropriate population model.

3.3.2. Influences of the linkage equilibrium

It is easy to imagine that the linkage equilibrium (LE) and the linkage disequilibrium (LD) should affect the similarity measures. In fact, the variance of the distribution of the ASD values among the individuals should converges to some value in Θ(1/m) where m is the number of the SNP loci in the region according to the central limit theorem, if the loci are independent to each other. It means that the variance of the ASD values should be smaller on the regions of LE. The PMD and its example HHD should also be influenced by the LE/LD. We compared the influences of the LE/LD to the ASD and the HHD by checking distances on the LE/LD regions obtained from the HapMap database (release 24) (International HapMap Consortium, 2005) as follows.

We can determine whether a region is near to LE or to LD by counting the number of haplotype tagging SNPs (htSNPs) (Carlson et al., 2004; Johnson et al., 2001; Ke and Cardon, 2003; Meng et al., 2003; Rinaldo et al., 2005). The htSNPs are selected so that each SNP in the given region has a correlation larger than a threshold with at least one of the htSNPs. Thus, the regions with many htSNPs can be considered to be near the LE, and regions with few htSNPs can be considered to be near the LD.

We divided the set of SNPs in chromosome 1 into 658 blocks, each of which consists of 100 consecutive SNPs. For each block B, we counted the number h_B of htSNPs obtained by the software Tagger (de Bakker et al., 2005) with the default settings. We selected 100 blocks with the 100 smallest h_B values as the LD regions and also selected 100 blocks with the 100 largest h_B values as the LE regions.

For each of all these regions, we computed the ASD and the HHD measures among the 270 individuals in HapMap (which are the same as the 270 individuals used in Section 4), and computed the variances among the obtained 270 × 269/2 = 36315 distances of the ASD and of the HHD. Table 5 shows the difference between the variances of the ASD and the HHD measures. According to the P-values in the table, the HHD reflects the LD/LE effects more than the ASD.

Table 5.

Means of Variances of ASD/HHD Measures on the Regions Where the SNPs Are Weakly Correlated and Highly Correlated in Chromosome 1

	Mean of variances
	LE	LD	P-value
ASD	0.00267	0.00546	2.066·10⁻¹⁶
HHD	0.00248	0.00539	1.637·10⁻¹⁷

The LE and LD columns show the means of variances on the LE regions (i.e., regions with many htSNPs) and those on the LD regions (i.e., regions with a few htSNPs), respectively. The difference of the variances between weakly and highly correlated regions are tested by t-test for each of the measures. The P-value column shows the P-value of the t-test.

4. Application To Hapmap Data Sets

4.1. Data sets

In the experiments in Section 4.2, we will use the unphased-diplotype data sets of 22 autosomal chromosomes and X chromosome derived from HapMap release 24 (International HapMap Consortium, 2005). The data sets consist of unphased-diplotypes of 270 individuals: 90 Yoruba in Ibadan, Nigeria (YRI); 90 Utah residents with ancestry from northern and western Europe (CEU, from the CEPH diversity panel); and 90 Japanese in Tokyo, Japan, and Han Chinese in Beijing, China (CHB + JPT). There are 894,398 SNPs that are genotyped for all the above 270 individuals, which we used for our experiments. We divided the SNP set into 8,930 blocks, each of which consists of consecutive 100 SNPs, and we will perform comprehensive experiments against each of these blocks in Section 4.2.

4.2. Experimental results

In this section, we demonstrate the impact of incorporating the population information, by comparing the clustering accuracies by the ASD and that by the HHD on the HapMap data described in Section 4.1. Against each of the 8,930 blocks, we performed Ward's clustering algorithm (see Section 2.3.1) based on the ASD and also did the same based on the HHD, and compared the CERs (see Section 2.3.2) of their results (Table 6). The difference of the results in relation to the number of htSNPs, i.e., h_B (see Section 3.3.2), is also shown.

Table 6.

The Experimental Results and Their Relationships to the h_B Values

		Mean of CERs		Comparison of CERs
h_B	♯blocks	ASD	HHD	CER_ASD < CER_HHD	CER_HHD < CER_ASD	CER_ASD = CER_HHD	P-value of sign test
0 ∼ 10	1	0.5630	0.5630	0 (0.0)	0 (0.0)	1 (1.0)
10 ∼ 20	44	0.4733	0.4678	9 (0.2045)	13 (0.2955)	22 (0.5)	0.5235
20 ∼ 30	223	0.4363	0.4305	62 (0.2780)	82 (0.3677)	79 (0.3543)	0.1130
30 ∼ 40	993	0.4240	0.4207	380 (0.3827)	418 (0.4209)	195 (0.1964)	0.1902
40 ∼ 50	2364	0.3929	0.3877	975 (0.4124)	1131 (0.4784)	258 (0.1091)	7.276·10⁻⁴^*
50 ∼ 60	3063	0.3567	0.3514	1327 (0.4332)	1528 (0.4989)	208 (0.06793)	1.808·10⁻⁴^*
60 ∼ 70	1822	0.3052	0.2997	772 (0.4237)	970 (0.5324)	80 (0.04391)	2.303·10⁻⁶^*
70 ∼ 80	399	0.2584	0.2465	165 (0.4135)	211 (0.5288)	23 (0.05764)	0.02018^*
80 ∼ 90	21	0.2178	0.1944	6 (0.2857)	13 (0.6190)	2 (0.09524)	0.1671
90 ∼ 100	0	—	—	—	—	—	—
Total	8930	0.3611	0.3557	3696 (0.4139)	4366 (0.4889)	868 (0.09720)	8.98·10⁻¹⁴^*

The ♯blocks column shows the numbers of blocks with the specified h_B values. In the Comparison of CERs columns, the CER_ASD < CER_HHD/CER_ASD > CER_HHD/CER_ASD = CER_HHD columns show the numbers (and the ratios) of data (with the specified h_B values) where the ASD performed better/the HHD performed better/the performance of the two measures are exactly the same, respectively. x ∼ y indicates that x ≤ h_B < y, and ^* means the result of the sign test is significant (i.e., ≤ 0.05).

The mean of CERs based on the HHD (i.e., 0.3557) is better than that for the ASD (i.e., 0.3611). The P-value of the t-test to see the difference between them is 0.004177, which means the CERs of the HHD is significantly better than that of the ASD. The number of data sets where the HHD (or the ASD) shows better performance than the ASD (or the HHD) are checked with the sign test. Among all the data sets, the HHD is superior to the ASD on 4366 data sets and inferior to the ASD in 3696 data sets. The results of two measures were the same in the other 868 data sets. The P-value of the sign test of all of these results is 8.98 · 10⁻¹⁴, which means that the HHD is significantly superior to the ASD.

The CERs decrease with increasing h_B for both the ASD and the HHD, but the differences of CERs between the ASD and the HHD also increases as h_B increase (Fig. 4). We call the result HDD's success if the HHD's CER is lower than that of the ASD, and vice versa. The ratio of the HHD's success increases with increasing h_B. The ratio of ASD's success also increases with increasing h_B. The difference of ratios of success between the ASD and the HHD is getting larger as h_B increases. The ratio of the case when the ASD and the HHD have the same results are getting lower as h_B increases (Fig. 5).

FIG. 4.

The plot of h_B values and the means of CERs for both the ASD and the HHD. x ∼ y indicates that x ≤ h_B < y. The HHD is superior to the ASD in all the cases.

FIG. 5.

The plot of h_B values and the ratios of success for both the ASD and the HHD. The line ASD = HHD indicates the results in which the performance of the two measures are the exactly the same. x ∼ y indicates that x ≤ h_B < y. The HHD is superior to the ASD in all the cases.

The HHD is superior to ASD especially when 80 ≤ h_B < 90. It is a reasonable result as we should be able to better cluster individuals if we have more information (i.e., LE). The difference of ratios of success between the ASD and the HHD also becomes largest when 80 < h_B < 90. In this case, the HHD is superior on 13 data sets, while the ASD is superior only on six data sets among the remaining 18 data sets.

5. Conclusion

We proposed a new inter-diplotype similarity measure that we call the PMD. The PMD improves the previous ASD measure by utilizing a population model. As one of such population models, we propose to use the HMM population model used in the phasing algorithm HIT. We call the PMD based on the HIT's HMM the HHD. The HHD utilizes the predicted conditional probabilities of haplotype-diplotypes of unphased-diplotype emitted from the HIT's HMM. Based on comprehensive experiments over 8930 genome-wide data sets of HapMap, we showed that the HHD significantly outperforms the ASD. We also discussed the relationships between the clustering accuracies and the LD.

There are many future tasks to do related to this work. The HHD requires much larger computation time than the ASD, and one future task should be to improve the computation speed of the HHD. There are still data sets for which the HHD is not superior to the ASD. It would be very interesting if we can predict the regions where the HHD is inferior to the ASD, before computing these measures. Another future task is to improve the population model, as it should directly improve the performance of the PMD. From the biological viewpoint, it would also be very interesting if we can utilize our clustering algorithms to identify gene functions of the target genome regions, especially the regions that affect the disease prevalence and drug responses (Bamshad et al., 2004; Wiencke, 2004; Wilson et al., 2001).

6. Appendix

Counting number of mutations under founder hypothesis

Suppose that founder haplotype-alleles \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf f}_1 , \ldots , {\bf f}_m$$ \end{document} has been evolved into the present-day haplotpye-alleles of individuals p and q, without any recombinations. Let p₁ and p₂ be the haplotype-alleles of p and q₁ and q₂ be the haplotype-alleles of q. We can consider that the number of mutations between p and q under the assumption of founders \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf f}_1 , \ldots , {\bf f}_m$$ \end{document} as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} S_{{\bf f}_1 , \ldots , {\bf f}_m} (p , q) & = \min \bigg \{ \sum_{i = 1}^{2} \min_{j = 1}^m \{dist ({\bf p}_i , {\bf f}_j) + dist ({\bf q}_i , {\bf f}_j) \} , \\ &\quad \sum_{i = 1}^{2} \min_{j = 1}^m \{ dist ({\bf p}_i , {\bf f}_j) + dist ({\bf q}_{2 - i} , {\bf f}_j) \} \bigg\} , & (7) \end{align*} \end{document}

where dist() denotes the ordinary number of mutations between the two sequences.

But we cannot know the appropriate set of founder haplotype-alleles. Instead, we can define the number of mutations between two individuals under the assumption that there are m founders as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} S_m^* (p , q) = \min_{{\bf f}_1 , \ldots , {\bf f}_m} S_{{\bf f}_1 , \ldots , {\bf f}_m} (p , q) . \tag{8} \end{align*} \end{document}

Table 2 shows all the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S_2^* ()$$ \end{document} values for all the pairs among individuals a, b, and c in Figure 2. Figure 6 shows the founder pair f₁, f₂ that minimizes the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S_{{\bf f}_1 , {\bf f}_2} (a , b)$$ \end{document} value.

FIG. 6.

The optimal founder haplotype-allele pair (when m = 2) for the individuals a and b in Figure 2.

Footnotes

Acknowledgments

The experiments in this work were done on the Super Computer System of the Human Genome Center, the Institute of Medical Science, the University of Tokyo.

Disclosure Statement

No competing financial interests exist.

1

There are also many algorithms proposed for clustering SNP loci (Yang and Tabus, ), instead of individuals, but we do not deal with these problems in this article.

2

Various inter-population distances have also been proposed (Cornuet et al., ), but we will not deal with these in this article.

3

According to Rastas et al. (2005), the optimal number of ancestors is around 7 for most cases. Thus, we also use the HMM model with 7 ancestors in the experiments in .

4

We used the statistical software, R, to implement this algorithm.

5

The ASD or the HHD values will be used as w_ij in .

6

Phasing is the process of finding the most probable haplotype-diplotype, utilizing some population information.

7

Note that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sum\nolimits^{M}_{k = 1}p_k = \sum\nolimits^{M}_{k = 1}p^{\prime}_k = 1$$ \end{document} and there is no need to normalize the probabilities if we enumerate all the candidates. But we need to normalize them in case we ignore the candidates with very small probabilities. When we compute the HHD (which will be introduced in ), we ignore candidates with very small probabilities.

8

We also introduce other simpler examples of the PMD in .

9

The population models could be represented by many other methods. For example, we consider HMM-based models in .

References

Ahmad

, Neville

, Marshall

S.E.

et al. 2002. Haplotype-specific linkage disequilibrium patterns define the genetic topography of the human MHC. Hum. Mol. Genet., 12:647–656.

Bamshad

, Wooding

, Salisbury

B.A.

et al. 2004. Deconstructing the relationship between genetics and race. Nat. Rev. Genet., 5:598–609.

Beaty

T.H.

, Fallin

M.D.

, Hetmanski

J.B.

et al. 2005. Haplotype diversity in 11 candidate genes across four populations. Genetics, 171:259–267.

Bhatia

, Bansal

, Harismendy

et al. 2010. A covering method for detecting genetic associations between rare variants and common phenotypes. Plos Comput. Biol., 6:1–12.

Bowcock

A.M.

, Ruiz-Linares

, Tomfohrde

et al. 1994. High resolution of human evolutionary trees with polymorphic microsatellites. Nature, 368:455–457.

Carlson

C.S.

, Eberle

M.A.

, Rieder

M.J.

et al. 2004. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet., 74:106–120.

Chung

P.Y.J.

, Beyens

, Guanabens

et al. 2008. Founder effect in different European countries for the recurrent P392L SQSTM1 mutation in Paget's disease of bone. Calcif. Tissue. Int., 83:34–42.

Cirulli

E.T.

, Goldstein

D.B.

2010. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet., 11:415–425.

Conrad

D.F.

, Jakobsson

, Coop

et al. 2006. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat. Genet., 38:1251–1260.

10.

Cornuet

J.M.

, Sylvain

, Luikart

et al. 1999. New methods employing multilocus genotypes to select or exlcude populations as origins of individuals. Genetics, 153:1989–2000.

11.

Cover

T.M.

, Thomas

J.A.

1991. Elements of Information Theory. John Wiley & Sons: New York.

12.

de Bakker

P.I.W.

, Yelensky

, Pe'er

et al. 2005. Efficiency and power in genetic association studies. Nat. Genet., 37:1217–1223.

13.

Durbin

, Eddy

, Krogh

et al. 1998. Biological Sequence Analysis. Cambridge Press: New York.

14.

Ester

, Kriegel

H.P.

, Sander

et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf. Knowl. Discov. Data Mining, 226–231.

15.

Fallin

, Cohen

, Essioux

et al. 2001. Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE locus variation and Alzheimer's disease. Genome Res., 11:143–151.

16.

Falush

, Stephens

, Pritchard

J.K.

2003. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics, 164:1567–1587.

17.

Gao

, Martin

E.R.

2009. Using allele sharing distance for detecting human population stratification. Hum. Hered., 68:3.

18.

Gao

, Starmer

2007. Human population structure detection via multilocus genotype clustering. BMC Genet., 8:34.

19.

Gaudieri

, Leelayuwat

, Tay

G.K.

et al. 1997. The Major Histocompatability Complex (MHC) contains conserved polymorphic genomic sequences that are shuffled by recombination to form ethnic-specific haplotypes. J. Mol. Evol., 45:17–23.

20.

Gonzalez

, Bamshad

, Sato

et al. 1999. Race-specific HIV-1 disease-modifying effects associated with CCR5 haplotypes. Proc. Natl. Acad. Sci. USA, 96:12004–12009.

21.

Haiman

C.A.

, Stram

D.O.

, Pike

M.C.

et al. 2003. A comprehensive haplotype analysis of CYP19 and breast cancer risk: The Multiethnic Cohort. Hum. Mol. Genet., 12:2679–2692.

22.

International HapMap Consortium. 2005. A haplotype map of the human genome. Nature, 437:1299–1320. www.hapmap.org. 2011 November 1.

23.

Isaev

2004. Introduction to mathematical methods to bioinformatics. Springer: New York.

24.

Jakobsson

, Scolz

S.W.

, Scheet

et al. 2008. Genotype, haplotype and copy-number variation in worldwide human populations. Nature, 451:998–1003.

25.

Jin

, Zhu

, Guo

2010. Genome-wide association studies using haplotype clustering with a new haplotype similarity. Genet. Epidemiol., 34:633–641.

26.

Johnson

G.C.L.

, Esposito

, Barratt

B.J.

et al. 2001. Haplotype tagging for the identification of common disease genes. Nat. Genet., 29:233–237.

27.

Kaufman

, Rousseeuw

1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons: New York.

28.

, Cardon

L.R.

2003. Efficient selective screening of haplotype tag SNPs. Bioinformatics, 19:287–288.

29.

Kim

, Misra

2007. SNP genotyping: technologies and biomedical applications. Annu. Rev. Biomed. Eng., 9:289–320.

30.

Lesk

A.M.

2005. Introduction to Bioinformatics, 2nd. Oxford: New York.

31.

, Jiang

2005. Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics, 21:4384–4393.

32.

, Zhou

, Elston

R.C.

2006. Haplotype-based quantitative trait mapping using a clustering algorithm. BMC Bioinform., 7:258.

33.

Mao

, Bigham

A.W.

, Mei

et al. 2007. A genomewide admixture mapping panel for Hispanic/Latino populations. Am. J. Hum. Genet., 80:1171–1178.

34.

Meng

, Zaykin

D.V.

, Xu

et al. 2003. Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am. J. Hum. Genet., 73:115–130.

35.

Pritchard

J.K.

, Stephens

, Donnelly

2000. Inference of population structure using multilocus genotype data. Genetics, 155:945–959.

36.

Rabiner

L.R.

, Juang

B.H.

1986. An introduction to hidden Markov models. IEEE ASSP Mag., 3:4–16.

37.

Rastas

, Koivisto

P.M.

, Mannila

et al. 2005. A hidden Markov technique for haplotype reconstruction. Lect. Notes Bioinform., 3692:140–151.

38.

Rinaldo

, Bacanu

, Devlin

et al. 2005. Characterization of multilocus linkage disequilibrium. Genet. Epidemiol., 28:193–206.

39.

Saitou

, Nei

1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4:406–425.

40.

Small

K.M.

, Mialet-Perez

, Seman

C.A.

et al. 2004. Polymorphisms of cardiac presynaptic α_2C adrenergic receptors: diverse intragenic variability with haplotype-specific functional effects. Proc. Natl. Acad. Sci. USA, 101:13020–13025.

41.

Team

RDC

. 2007. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.

42.

Tzeng

J.Y.

, Devlin

, Wasserman

et al. 2003. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet., 72:891–902.

43.

Ward

J.H.

1963. Hierarchical grouping procedure to optimize an objective function. J. Am. Stat. Assoc., 58:236–244.

44.

Ward

J.H.

, Hook

M.E.

1963. Application of an hierarchical grouping procedure to a problem of grouping profiles. Educ. Psychol. Measure., 23:69–81.

45.

Wiencke

J.K.

2004. Impact of race/ethnicity on molecular pathways in human cancer. Nat. Rev. Cancer, 4:79–84.

46.

Wilson

F.W.

, Weale

M.E.

, Smith

A.C.

et al. 2001. Population genetic structure of variable drug response. Nat. Genet., 29:265–269.

47.

Witherspoon

D.J.

, Wooding

, Rogers

A.R.

et al. 2007. Genetic similarities within and between human populations. Genetics, 176:351–359.

48.

Yang

, Tabus

2007. Haplotype block partitioning using a normalized maximum likelihood model. Proc. IEEE Genomic Signal Process. Stat., 1–4.