Bayesian interpretation to generalize adaptive mean shift algorithm

Abstract

The Adaptive Mean Shift (AMS) algorithm is a popular and simple non-parametric clustering approach based on Kernel Density Estimation. In this paper the AMS is reformulated in a Bayesian framework, which permits a natural generalization in several directions and is shown to improve performance. The Bayesian framework considers the AMS to be a method of obtaining a posterior mode. This allows the algorithm to be generalized with three components which are not considered in the conventional approach: node weights, a prior for a particular location, and a posterior distribution for the bandwidth. Practical methods of building the three different components are considered.

Keywords

Adaptive mean shift algorithm kernel density estimation

1 Introduction

The mean shift (MS) algorithm is used to find the stationary points of a probability distribution from observed data [1 –3]. This is done by maximizing a non-parametric approximation of the distribution through a kernel. From this maximization process, the algorithm converges from any initial point to a particular stationary region which is called a basin of attraction. In this sense, the MS algorithm is a non-parametric statistical clustering method that clusters data points in the same basin of attraction belong to the same cluster. Moreover, this clustering approach neither requires prior knowledge of the number of clusters nor constrains their shape. It has been widely used in signal/image processing and computer vision applications including brain imaging [4, 5]. Since using an appropriate kernel bandwidth is essential for successful MS implementation, adaptive mean shift (AMS) algorithms attempt to determine an optimized bandwidth.

Many researchers have studied conventional approaches to kernel density estimation such as cross-validation and smoothed bootstrap [6], fast adaptive mean shift (FAMS) based on k-nearest neighbours [7], annealed MS [8] and other adaptive MS algorithms [2 , 9–12]. Alternative methods of determining optimal bandwidth use a pre-processing scheme incorporating the idea of a heterogeneous node weight. This approach incorporates knowledge about the location of each point w.r.t. all others [13].

This paper proposes a generalized framework called Bayesian Adaptive Mean Shift (BAMS). The interpretation of the kernel density estimation (KDE) from a Bayesian perspective allows a more flexible kernel which can incorporate prior information and feature bandwidth tunning.

2 The (Adaptive) mean shift algorithm

The MS algorithm is based on the Parzen window technique [14]. Given N data points $Y = {y_{n}}_{n = 1}^{N}$ in the d-dimensional space $ℝ^{d}$ , the multivariate kernel density estimator of the distribution of y with a kernel k (·) and symmetric positive definite d × d bandwidth matrices B = {B₁, …, B_n} is given by $\hat{f} (y | B) = \frac{1}{n} \sum_{n = 1}^{N} | B_{n} |^{- 1 / 2} k (B_{n}^{- 1 / 2} (y - y_{n})) .$ In this paper, the independent and isotropic Gaussian kernel, where B_n = τ_nI_d×d and k (v) = exp(- ∥ v ∥ ²/2), is used with a vector of specified bandwidths τ = (τ₁, …, τ_N) [15]: $\begin{matrix} \hat{f} (y | τ) = \sum_{n = 1}^{N} \frac{1}{τ_{n}^{d / 2}} exp (- \frac{{∥ y - y_{n} ∥}^{2}}{2 τ_{n}}) . \end{matrix}$ (1) It is noted that in many cases the τ_n are set to be equal, reducing the bandwidth specification to a single value τ. A natural estimator of the gradient of f is the gradient of $\hat{f}$ , which can be derived as: $\begin{matrix} \nabla \hat{f} (y | τ) = \frac{2}{N} [\sum_{n = 1}^{N} \frac{1}{τ_{n}^{(d + 2) / 2}} g (\frac{{∥ y - y_{n} ∥}^{2}}{τ_{n}})] \\ \times [\frac{\sum_{n = 1}^{N} y_{n} g (\frac{{∥ y - y_{n} ∥}^{2}}{τ_{n}})}{\sum_{n = 1}^{N} g (\frac{{∥ y - y_{n} ∥}^{2}}{τ_{n}})} - y], \end{matrix}$ where g (v) = exp(- v/2). Thus, the gradient is a product of two terms [3]: the first term $\sum_{n = 1}^{N} g (\frac{{∥ y - y_{n} ∥}^{2}}{τ_{n}})$ is strictly positive and the second term is $s (y) = \frac{\sum_{n = 1}^{N} y_{n} g (\frac{{∥ y - y_{n} ∥}^{2}}{τ_{n}})}{\sum_{n = 1}^{N} g (\frac{{∥ y - y_{n} ∥}^{2}}{τ_{n}})} - y .$ This latter term is known as the mean shift, being the difference between a weighted mean of the data and y. Furthermore, s (y) = 0 defines the stationary points of $\hat{f} (y | τ)$ , leading to the formation of the MS algorithm: initialize at some value y⁽⁰⁾ and iteratively compute y⁽ⁱ⁺¹⁾ = y⁽ⁱ⁾ + s (y⁽ⁱ⁾) until convergence. This MS procedure is guaranteed to converge to a local stationary point of $\hat{f} (y | τ)$ since MS iterations satisfy the conditions required by the Capture theorem [3, 16]. A cluster is defined by those data points within the same basin of attraction. Therefore the clustering of y₁, …, y_N is achieved by running the MS algorithm N times, with run n using y_n as its initial value, until convergence to a stationary point. Those y_n for which the algorithm converges to the same stationary point are assigned to the same cluster.

Adaptive mean shift algorithms attempt to specify the bandwidth automatically. A simple example is a kth nearest neighbour rule [11]. For a given k, let y_n,k be the kth nearest neighbor of y_n. Then, the bandwidth for the nth observation is $\sqrt{τ_{n}} = ∥ y_{n} - y_{n, k} ∥$ where the L₁ norm is used for ease of implementation. [11] suggests that k should be large enough to ensure that there is an increase in density within the support of most kernels of bandwidth $\sqrt{τ_{n}}$ . A disadvantage of this approach to clustering is that the result is very sensitive to k. This conventional adaptive mean shift algorithm, C-AMS [11], is improved in this paper by embedding the node weighting method outlined in our previous paper [13] to create a node weight embedded AMS algorithmW-AMS.

3 The Bayesian mean shift algorithm

3.1 A Bayesian interpretation of the MS algorithm

For a known τ, a Bayesian interpretation of the goal of kernel density estimation is to construct $p (y | Y, τ)$ , a predictive distribution of y conditional on a random sample of data $Y = {y_{1}, \dots, y_{N}}$ and bandwidths τ. Thus y is treated as an unknown random quantity whose posterior distribution we wish to infer from $Y$ . Each kernel term is interpreted as a likelihood term p (y_n | y, τ_n) e.g. for the isotropic kernel of Equation [1] one has $p (y_{n} | y, τ_{n}) = τ_{n}^{- d / 2} exp (- 0.5 {∥ y - y_{n} ∥}^{2} / τ_{n}) .$ The key novelty of this paper is to generalize the notion of an adaptive kernel through a Bayesian interpretation. Assume that, for any y, i is the component of $Y$ that is allocated to define $\hat{f} (y | τ)$ e.g. we define the conditional distribution of y given i to be $p (y | τ, i, Y) = p (y | τ_{i}, y_{i});$ this is analogous to the component index in a mixture model that facilitates inference through Expectation maximization (EM) [17] and Markov chain Monte Carlo (MCMC) [18]. Hence $\begin{matrix} p (y | τ, Y) = \sum_{n = 1}^{N} p (y | τ, i = n, Y) p (i = n | τ, Y) \\ = \sum_{n = 1}^{N} p (y | τ_{n}, y_{n}, i = n) p (i = n | τ, Y) \\ = \sum_{n = 1}^{N} \frac{p (y_{n} | τ_{n}, y, i = n) p (y | τ_{n}, i = n)}{p (y_{n} | τ_{n}, i = n)} \\ \times p (i = n | τ, Y) \\ \propto \sum_{n = 1}^{N} p (y_{n} | τ_{n}, y, i = n) p (y | τ_{n}, i = n) \\ \times p (i = n | τ, Y) \end{matrix}$ where p (y_n|τ_n, i = n) =1/N for all n ∈ {1, 2, ⋯, N}. It is further assumed that τ and $Y$ hold no information about i that is not given in $Y$ e.g. $p (i = n | τ, Y) = p (i = n)$ and y and τ_n are independent given i = n so that: $p (y | τ, Y) \propto \sum_{n = 1}^{N} p (i = n) p (y | i = n) p (y_{n} | τ_{n}, y) .$ (2)

Thus the kernel density estimate $p (y | τ, Y)$ is a weighted sum of kernels p (y_n|τ_n, y) with component weights p (i = n) p (y|i = n). This estimate replaces $\hat{f} (y)$ in the MS algorithm. While p (y_n|τ_n, y) can be thought of as a likelihood, p (y|i = n) is a prior distribution of a particular point y and p (i = n) is a posterior distribution of the component. A simple option is to assume that both are uniform distributions, which implies $p (y | τ, Y) \propto \sum_{i = 1}^{N} p (y_{n} | τ_{n}, y)$ , the usual kernel density estimate. If, in addition, the likelihood terms are Gaussian with covariances τ_nI_d×d then $p (y | τ, Y) = \sum_{n = 1}^{N} \frac{1}{(2 π τ_{n})^{d / 2}} exp {- \frac{1}{2} \frac{{∥ y - y_{n} ∥}^{2}}{τ_{n}}},$ which gives the conventional mean shift algorithm based on a mixture of N Gaussian kernels. This begs the question, “Can mean shift be improved by specifying alternative but much richer distributions for p (y|i) andp (i)?”

The mean shift clustering method can be defined in terms of a Bayesian interpretation of the kernel density estimate. It can be seen that there are two choices to be made and to explore: the specification of the terms in Equation (2) that go into $p (y | τ, Y)$ and the choice of optimization algorithm for finding the maximum a posteriori (MAP) value $arg max_{y} p (y | τ, Y)$ . For the latter, a fast Laplace approximation is used. In this algorithm, μ is the set of the detected clusters such that μ_{c
_i} is the c_i-th mode. c is a vector of the class ids corresponding to classes so that c_i ∈ {1, 2, ⋯ , C} given C classes for i = 1, 2, ⋯ , N. In the rest of this section, options for the components of Equation (2) are explored.

Algorithm 1 Non-parametric clustering via Kernel density estimator with a fixed bandwidth $\sqrt{τ}$ for C clusters

1: Set μ = [].

2: For n = 1 to N do

3: Set an initial point: commonly set by the n-th measurement.

4: Obtain the mode (local optima) of $p (y | τ, Y)$ by Laplace approximation: ${\hat{y}}_{n} = {arg}_{y} max$ $p (y | Y, τ)$

5: if $| | {\hat{y}}_{n} - μ_{c} | | < ε$ for c ∈ {1, 2, ⋯ , C} then

6: Assign the id to the observation, c_n = c

7: else

8: Update C = C + 1.

9: Add a new mode $μ_{C} = {\hat{y}}_{n}$

10: Assign c_n = C.

11: end if

12: end for

3.2 Adaptive bandwidth based Mean shift algorithm in a Bayesian framework

When τ is assumed unknown then $p (y, τ | Y)$ should be calculated; the clustering is then based either on computing $(\hat{y}, \hat{τ}) = arg max_{y, τ} p (y, τ | Y)$ or by marginalizing to obtain $p (y | Y)$ and finding $\hat{y} = arg max_{y} p (y | Y) .$ Conditioning on i and τ from Equation (2) gives $p (y | Y) \propto \sum_{n = 1}^{N} ω_{n} η_{n} \int_{τ_{n}} p (τ_{n} | Y) p (y_{n} | τ_{n}, y) d τ_{n}$ (3) where ω_n and η_n denote p (i = n) and p (y|i = n) respectively. Equation (3) is the generalized formula which will be used in the proposed general mean shift algorithm. There are four different terms:

ω_n = p (i = n): this term is the weight of the nth data point in the kernel density estimator. [13] have shown the importance of this term empirically by embedding Voronoi diagram and k-nearest neighbours. They insist on that mean shift can obtain approximated global prior information with node weights by using geometric prior such as Delaunay Triangulation and k-NN. A mean shift with heterogeneous node weights speed up the conventional mean shift and corrects misled mean shift. Recently, we have realized that this term is analogous to the weight of EM clustering algorithms with N clusters. In what follows it will be denotedω_n;

η_n = p (y|i = n): this term is the prior on y;

p (y_n|τ_n, y): this term is the likelihood of y_n given a position y and bandwidth τ_n. It can be defined through a common kernel function e.g. Gaussian with mean y and variance τ_n, which is used in this paper: $p (y_{n} | τ_{n}, y) = | 2 π τ_{n} I |^{- \frac{1}{2}} exp {- \frac{(y_{n} - y)^{T} (y_{n} - y)}{2 τ_{n}}}$ . In general it is a function of the distance between y_n and y;

$p (τ_{n} | Y)$ : this is the posterior distribution of τ_n.

Each is discussed in turn below.

3.2.1 Specification of ω_n = p (i = n)

This prior distribution is the first term of interest on the right side of Equation (3), and it corresponds to the node weights of [13]. They showed that conventional MS can be improved by assigning the node weight of a datum as a decreasing function of the distances to other data points. That idea is generalized here by defining ω_n to be the inverse of the average distance between y_n and its k nearest neighbours: $ω_{n} = {\frac{\sum_{j \in {ne}_{k} (n)} | | y_{n} - y_{j} | |}{k}}^{- 1}$ (4) where ne_k (n) denotes the k nearest neighbours(k-NN) of y_n. This approach, referred to as the ‘NN node weighting approach’ in this paper, this approach is rather empirical and has its own critical issues arising from the difficulty of finding an optimal k for k-NN, so an alternative approach is required. Interestingly, we realized that p (i = n) can be interpreted as the weights of N Gaussian mixtures in EM clustering. A revision of the EM clustering algorithm reflecting this realization is next explained in detail. Given a certain model $M$ , we can define a Gaussian Mixture Model (GMM) with K components: $p (Y | M) = \sum_{k = 1}^{K} p (i = k) p (Y | μ_{k}, Σ_{k})$ where $\sum_{k = 1}^{K} ω_{k} = 1$ and μ_k and Σ_k are the mean and covariance of the kth component of the GMM. In this paper, suppose that K = N and the means and covariance of N components are exactly identical to the location and its corresponding bandwidth of the individual data points or an identity matrix, then we have $p (Y | M) = \sum_{n = 1}^{N} p (i = n) p (Y | μ_{n} = y_{n}, Σ_{n} = I) = \sum_{n = 1}^{N} ω_{n} p (Y | μ_{n} = y_{n}, Σ_{n} = I) .$ Now, by using the well-known EM clustering algorithm of GMM we can estimate a value of the responsibilities p_k,n associated with data point y_k and the n-th Gaussian components with a mean y_n (the n-th data point) and a fixed covariance Σ_n = I:

$\begin{matrix} p_{k, n} = \frac{w_{n} N (y_{k}; μ_{n} = y_{n}, Σ_{n} = I)}{\sum_{j = 1}^{N} w_{j} N (y_{k}; μ_{j} = y_{j}, Σ_{j} = I)}, and \\ ω_{n}^{'} = \frac{1}{N} \sum_{k = 1}^{N} p_{k, n}, \end{matrix}$ (5) where $N (\cdot; a, b)$ denotes a normal distribution of mean a and covariance b. In the equation, $ω_{n}^{'}$ denotes the updated node weights. Note that we do not have any further calculation of the mean and covariance of the N components in this EM step since the means are always fixed to the values of each data point and the covariances are fixed by the identity matrix.

3.2.2 Specification of η_n = p (y|i = n)

This is the prior of solution space (y). The simplest non-informative choice for η_n is a uniform distribution but a more informative specification can be applied if we use heuristic adaptivity when finding the modes. Let μ_1:C be the C modes which represent local optima and Σ be an identical covariance matrix with very large elements for flat distribution, i.e. $Σ = Cov (Y)$ . Since the optimal C is unknown, we introduce an adaptivity to the modes. Note that this adaptivity is not due to bandwidth adaptivity but due to the model adaptivity which is well known in approximation algorithms such as adaptive Monte Carlo. Let μ and c be the set of the means and IDs of the detected clusters. Initially, we set μ as an empty set and c = 0 which represents an N × 1 size zero vector. If the mean shift of the nth measurement converges to a particular basins of attraction (local optima), we check whether the converged mode has already been detected and stored in μ. If there exists a local optima $\hat{y}$ such that $∥ {\hat{y}}_{n} - μ_{i} ∥ < ε$ where μ_i is the mode of the ith cluster and ε is the tolerance value for any i ∈ {1, 2, ⋯ , C}, then we set c_n = i and update $μ_{i} = μ_{c_{n}} = \frac{\sum_{j = 1}^{n} y_{j} δ (c_{j} - c_{n})}{\sum_{j = 1}^{n} δ (c_{j} - c_{n})}$ where δ (·) is a delta function. Otherwise, the new mode $μ_{C + 1} = {\hat{y}}_{n}$ and we update C = C + 1. Now, given this structure we obtain an adaptive prior with the following mixture of distribution: $η_{n} = p (y | i = n) = \frac{1}{V} - \frac{λ}{V} δ_{n} + λ N (y; μ_{c_{n}}, Σ_{c_{n}}) δ_{n}$ where δ_n is a delta function which is zero if the n-th observation is not visited and is one otherwise. In other words, if the nth measurement has already visited and assigned into one of clusters, then we apply normal distribution for prior. Otherwise, we simply use a uniform where V is the volume of the searching space. In addition, λ is the probability that the cluster mode of the solution space y is the same as that of the n-th observation which has already been assigned to a cluster. For the examples shown in the paper, for simplicity, we use Σ_{c
_n} = Σ for n ∈ {1, 2, ⋯ , N}.

3.2.3 Specification of $p (τ_{n} | Y)$

The performance of the AMS algorithm depends on the selection of a bandwidth for the kernel. In this paper, we mainly propose a strategy which uses a functional conjugate prior for $p (τ_{n} | Y)$ (see Appendix A.1) although we can use other non-informative priors like a uniform distribution and Jeffrey’s prior as shown in Appendix A.1 and A.2. That is, if the $p (τ_{n} | Y) = IG (τ_{n}; α_{n}, β_{n})$ are assumed to be independent and identically inverse gamma distributed with scale α_n and shape β_n. By using this functional conjugate prior (FCP) the target posterior distribution of Equation. (3) becomes

$\begin{matrix} p (y | Y) \propto \sum_{n = 1}^{N} ω_{n} η_{n} \int_{τ_{n}} p (y_{n} | y, τ_{n}) p (τ_{n} | y_{n}, y_{- n}) d τ_{n} \\ \propto \sum_{n = 1}^{N} ω_{n} η_{n} \frac{Γ ({\tilde{α}}_{n})}{Γ (α_{n})} β_{n}^{α_{n}} {\tilde{β}}_{n}^{- {\tilde{α}}_{n}} \propto \sum_{n = 1}^{N} {\tilde{ω}}_{n} η_{n} {\tilde{β}}_{n}^{- {\tilde{α}}_{n}} \end{matrix}$ (6)

where ${\tilde{α}}_{n} = d / 2 + α_{n}$ , ${\tilde{β}}_{n} = (y_{n} - y)^{T} (y_{n} - y) / 2 + β_{n}$ and ${\tilde{ω}}_{n} = ω_{n} \frac{Γ ({\tilde{α}}_{n})}{Γ (α_{n})} β_{n}^{α_{n}}$ . For the functional conjugate prior, we can specify α_n and β_n with an expectation and a variance of τ_n as follows: $\begin{matrix} E (τ_{n}) = \int_{τ_{n}} τ_{n} p (τ_{n} | Y) d τ_{n} \\ = \int_{τ_{n}} τ_{n} IG (τ_{n}; α_{n}, β_{n}) d τ_{n} = \frac{β_{n}}{α_{n} - 1} and \\ V (τ_{n}) = \int_{τ_{n}} (τ_{n} - E (τ_{n}))^{2} p (τ_{n} | Y) d τ_{n} \\ = \int_{τ_{n}} (τ_{n} - E (τ_{n}))^{2} IG (τ_{n}; α_{n}, β_{n}) d τ_{n} \\ = \frac{β_{n}^{2}}{(α_{n} - 1)^{2} (α_{n} - 2)} . \end{matrix}$

Therefore, we have $α_{n} = \frac{E (τ_{n})^{2}}{V (τ_{n})} + 2$ and $β_{n} = (\frac{E (τ_{n})^{2}}{V (τ_{n})} + 1) E (τ_{n})$ . For simplicity, we set E (τ_n) = ||y_n - y_n,k||² and V (τ_n) =100||y_n - y_n,k||² for an almost flat prior in the inverse gamma distribution $p (τ_{n} | Y)$ in this manuscript. Note that we have also selected k for bandwidths as shown in the conventional AMS of [11]. As already mentioned in the Section conventional AMS is too sensitive to use. However, our generalised model makes the algorithm less sensitive to the number of neighbours selected by adding and integrating prior information. That is, our proposed model handles the selected k not deterministically but statistically, considering uncertainty, so our proposed model can obtain accurate results even when an incorrect k is selected. In addition, the proposed BAMS estimates the bandwidth in a Bayesian framework. The bandwidth can be explicitly calculated in a maximum a posterior (MAP) style with a specified prior in our proposed approach if required, in a similar way to the MAP estimate by [4]. Moreover, our proposed BAMS can implicitly estimate the kernel bandwidth by marginalizing it out as shown in Equation (6). That is, the bandwidth is regarded as a nuisance parameter. The removal of the nuisance parameter using marginalization can make our BAMS more efficient and stable (Rao-Blackwellization) [19] by reducing variance of systems [20].

3.3 Finding Bayesian means in BAMS

As we can see in Equation (6), we have the posterior distribution as a general form: $p (y | Y) = \sum_{n = 1}^{N} {\tilde{ω}}_{n} η_{n} {\tilde{β}}_{n}^{- {\tilde{α}}_{n}}$ where ${\tilde{α}}_{n} = d / 2 + α_{n}$ , ${\tilde{β}}_{n} = (y_{n} - y)^{T} (y_{n} - y) / 2 + β_{n}$ and ${\tilde{ω}}_{n} = ω_{n} \frac{Γ ({\tilde{α}}_{n})}{Γ (α_{n})} β_{n}^{α_{n}}$ . To find the local maxima of the above equation, we calculate $\frac{dp (y | Y)}{d y} = 0$ . Therefore, the new mean of our proposed BAMS algorithm is presented in Equation (7); $y^{'} = B^{- 1} A$ (7) where $\begin{matrix} A & = & \sum_{n = 1}^{N} [{\tilde{ω}}_{n} {\tilde{β}}_{n}^{- {\tilde{α}}_{n}} {η_{n} - \frac{1}{V} + \frac{λ}{V} δ_{n}} Σ_{c_{n}}^{- 1} μ_{c_{n}} \\ + {\tilde{ω}}_{n} η_{n} {\tilde{α}}_{n} {\tilde{β}}_{n}^{- {\tilde{α}}_{n} - 1} y_{n}], and \\ B & = & \sum_{a = 1}^{N} [{\tilde{ω}}_{a} {\tilde{β}}_{a}^{- {\tilde{α}}_{a}} {η_{a} - \frac{1}{V} + \frac{λ}{V} δ_{a}} Σ_{c_{a}}^{- 1} \\ + {\tilde{ω}}_{a} η_{a} {\tilde{α}}_{a} {\tilde{β}}_{a}^{- {\tilde{α}}_{a} - 1} I_{d \times d}] . \end{matrix}$

4 Simulation Results

In this section we demonstrate the applicability, accuracy and flexibility of one of the possible variants of our proposed BAMS on distinct synthetic and real data sets. There are two different ways to evaluate the clustering output: the ‘internal evaluation’, without a ground truth, and the ‘external evaluation’ with a ground truth to which the results can be compared. A discussion regarding the different evaluation metrics has been given in recent works: [21, 22]. In this paper, we measure the goodness of our results with different evaluation metrics such as: the F-measure and the Jaccard index.

Algorithm 2 General adaptive mean shift algorithms

1: Set μ = [] and set λ to a particular value.

2: Calculate {ω_t} _t=1:N using either NN node weighting of Equation (4) or EM algorithm of Equation (5).

3: Select one of priors among the list.

4: Calculate α_t and β_t for t ∈ {1, 2, 3, ⋯ , N}.

5: forj = 1 to N do

6: Set an initial point (commonly set by the j-th measurement, y = y_j) and a random point (y^′ = y + 2ε).

7: While ||y^′ - y|| < ε

8: Set the current mean y = y^′.

9: forn = 1 to N

10: Calculate ${\tilde{α}}_{n}$ and ${\tilde{β}}_{n}$ by ${\tilde{α}}_{n} = d / 2 + α_{n}$ and ${\tilde{β}}_{n} = (y_{n} - y)^{T} (y_{n} - y) / 2 + β_{n}$ where $α_{n} = \frac{E (τ_{n})^{2}}{V (τ_{n})} + 2$ and $β_{n} = (\frac{E (τ_{n})^{2}}{V (τ_{n})} + 1) E (τ_{n})$ .

11: Reassign the weights: $\tilde{ω_{n}} = {\begin{matrix} ω_{n} \frac{β_{n}^{α_{n}} Γ ({\tilde{α}}_{n})}{Γ (α_{n})} & for empirical prior \\ ω_{n} & for simple priors . \end{matrix}$

12: end for

13: ifUse adaptivity for acceleration then

14: A delta function δ_n ← 0 if the n-th observation is not visited and is one otherwise.

\begin{matrix} η_{j} = p (y | i = j) \\ = \frac{1}{V} - \frac{λ}{V} δ_{n} + λ N (y; μ_{c_{n}}, Σ_{c_{n}}) δ_{n} . \\ A = \sum_{n = 1}^{N} [{\tilde{ω}}_{n} {\tilde{β}}_{n}^{- {\tilde{α}}_{n}} {η_{n} - \frac{1}{V} + \frac{λ}{V} δ_{n}} Σ_{c_{n}}^{- 1} μ_{c_{n}} \\ + {\tilde{ω}}_{n} η_{n} {\tilde{α}}_{n} {\tilde{β}}_{n}^{- {\tilde{α}}_{n} - 1} y_{n}] . \\ B = \sum_{a = 1}^{N} [{\tilde{ω}}_{a} {\tilde{β}}_{a}^{- {\tilde{α}}_{a}} {η_{a} - \frac{1}{V} + \frac{λ}{V} δ_{a}} Σ_{c_{a}}^{- 1} \\ + {\tilde{ω}}_{a} η_{a} {\tilde{α}}_{a} {\tilde{β}}_{a}^{- {\tilde{α}}_{a} - 1} I_{d \times d}] \end{matrix}

(8)

15: Update a mean, y^′ = B^-1A

16: else

17: Calculate $ρ_{n} = {\tilde{ω}}_{n} {\tilde{α}}_{n} {\tilde{β}}_{n}^{- {\tilde{α}}_{n} - 1}$ and update a mean, $y^{'} = \frac{\sum_{i = 1}^{N} ρ_{n} y_{n}}{\sum_{e = 1}^{N} ρ_{e}}$ .

18: end if

19: end while

20: ${\hat{y}}_{j} = y'$ .

21: if $| | {\hat{y}}_{j} - μ_{c} | | < ε$ for any c ∈ {1, 2, ⋯ , C} then

22: Assign the id to the observation, c_j = c.

23: else

24: Update C = C + 1 and add a new mode $μ_{C} = {\hat{y}}_{j}$ . Then, assign c_j = C.

25: end if

26: end for.

4.1 Synthetic data

The synthetic datasets are generated adopting traditional finite mixture models with the number of clusters C ∈ {2, 5, 7}: $\begin{matrix} N \sim P (\cdot; 10^{3} C), π \sim D (1 / C, \dots, 1 / C), and \\ c_{n} \sim M (π) for n \in {1, \dots, N} \\ μ_{i}, Σ_{i} \sim NIW (\cdot; 0_{d \times 1}, a, I_{d \times d}, b) \end{matrix}$ for i ∈ {1, ⋯ , C} and $y_{n} \sim N (\cdot; μ_{c_{n}}, Σ_{c_{n}})$ where $P$ , $D$ , $M$ and $NIW$ present Poisson, Dirichlet, Multinomial and Normal-Inverse-Wishart distributions respectively. π is a distribution that is drawn from the Dirichlet distribution and as a parameter of the multinomial distribution. In this paper we simply set a = 50 and b = 5.

As can be seen in Fig. 1, in most cases (K > 20), our proposed BAMS approaches, including W-AMS, outperform conventional C-AMS in accuracy and sensitivity. C-AMS seems the best for K < 20, but the actual accuracy is very low and expected output is not obtained since the bandwidth is too small, i.e. the maximum value of the Jaccard index for C-AMS is only 0.5. However, our proposed BAMS performs best when K > 50. In addition, we find that most of our BAMS algorithms perform similarly with most synthetic data generated by the Gaussian Mixture model.

4.2 Real dataset

For a further complete evaluation of our proposed approach, we tested it on different real datasets: USPS digital data and Yale Face-Databased B data with 10 clusters. The first experiment was conducted on the USPS digital dataset, which consists of 4649 images of handwritten digits maintained by a 256 × 1 size vector. We used both a principal component analysis (PCA) and a locality preserving projection (LPP) to reduce the images into 3 dimensions for feature selection. After applying PCA to remove unwanted dataset noise, LPP is applied to reduce the dimension to 3 while preserving the locality. With the second dataset, Yale Face Database B [23], we perform experiments similar to those presented in [24]. For light computation, we down-sample each image to 30 × 40 pixels, obtaining 5850 images with 1200 dimensions. As for the USPS dataset, we applied PCA and LPP techniques to reduce the images to the dimension 5. The last dataset is trajectories of public buses from Dublin city. Nowadays, one of the key topics in pervasive and mobile computing is the development of accurate vehicle tracking systems using historical trajectories [25]. However, the vehicle trajectory patterns differ according to time and date. In order to predict and filter a particular trajectory of interest more accurately, historical trajectories need to be clustered to allow the development of more accurate tracking systems which can select only the relevant trajectories. For the example dataset, we have collected the GPS trajectories of bus traveling times along line 46 in Dublin for one month (June 2011). The dataset consists of 1500 historical trajectories each with dimension 188 obtained by collecting GPS signals every 100 meters of movement. Before applying our proposed approach, we reduced the dimension of the data from 188 to 7 by principal component analysis (PCA) over 99.5% of the variance. Each row of Fig. 2 shows the performance of conventional adaptive Mean shift algorithms and variable BAMS on the USPS digital data (top), Yale Face-Database B data (centre) and Dublin bus trajectory data (bottom). The columns (a), (b), (c), (d) and (e) describe, respectively, the actual data with hidden labels, the number of clusters, Jaccard index, and F-measures of the estimated clusters and time elapsed during execution. From Fig. 2 considering USPS, we can observe that BAMS approaches including W-AMS outperform C-AMS by the Jaccard index and F-measure. In addition, FCP-NN and AFCP-NN are more accurate than conventional approaches when 30 < K < 50 and W-AMS works efficiently for 10 < K < 20. For the Yale dataset, unlike the USPS dataset, proper node weights result in higher clustering accuracy so FCP-EM and AFCP-EM perform the best. C-AMS fails as K increases, moreover, it fails to cluster the Dublin bus data, achieving a Jaccard index of only 0.2. However, the accuracy of the BAMS approaches is relatively high, scoring from 0.5 to 0.8 for 60 < K < 200. Note that the FCP-EM outperforms the other algorithms with this dataset. In addition, we compared the elapsed times with all three datasets. The elapsed time during execution was ordered as follows: W-AMS<C-AMS<FCP-NN<FCP-EM<AFCP-NN<AFPCP-EM. That is, W-AMS and C-AMS are computationally relatively cheap while other BAMS featuring adaptivity requires heavy computation. As we can see in this figure, there is a trade-off to be made between execution time and accuracy when selecting the type of BAMS to use.

From the Fig. 2 we can see that we need to design BAMSs carefully according to the data being analyzed since the performance of the clustering algorithms varies as different priors and weights are applied to different datasets. In this paper, we provide a formal and generalized framework to flexibly modify and specify adaptive mean shift algorithms to such varied datasets.

5 Discussion

This paper focuses on the generalization of the conventional AMS in a Bayesian framework. This generalized formulation provides a more flexible and richer model to developers and researchers using the mean shift than a conventional AMS. Indeed, it introduces a generalized methodological framework in Bayesian interpretation which can be useful when building variants of the efficient AMS algorithms for the further research. In addition, BAMS can be improved further by adopting two more techniques: truncated kernels and non-Euclidean distance measures. A truncated kernel is commonly used to speed up the AMS. It is straightforward to modify the algorithms featured in this paper with the truncated kernel by excluding away data points from y. Only Euclidean distance has been considered in the paper but we can also use non-Euclidean measures of distance by replacing d_E = |y - y_n| with d_nE = f (y, y_n). Although this paper does not explicitly comment on automatic determination of the number of clusters, the feature is a key property of the conventional mean shift algorithm of which our proposed BAMS is a generalized version. Therefore, BAMS can determine the number of clusters automatically in the same way as conventional mean shift algorithms do.

6 Conclusion

This paper introduces a new general framework for the adaptive mean shift (AMS) algorithm. The AMS is formally modified in a Bayesian framework, named BAMS (Bayesian Adaptive Mean Shift). In this new definition of AMS, three typically underlying and ignored components are taken into account. For instance, a node weighting technique is embedded in the BAMS to obtain geometric structures and to correct misled mean shifts. BAMS can also add prior bandwidth information to make the algorithm less sensitive to the number of neighbours selected, which is an open issue in conventional k-nearest neighbours based AMS.

Footnotes

Acknowledgment

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (NRF-2013R1A1A1012797).

References

Fukunaga

and Hostetler

, The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Trans Inf Theory21(1) (1975), 32–40.

Fukunaga

, Introduction to Statistical Pattern Recognition, Academic Press, 2 edition, 1990.

Comaniciu

and Meer

, Mean shift: A robust approach toward feature space analysis, IEEE Trans Pattern Anal and Mach Intell24(5) (2002), 603–619.

Mahmood

, Chodorowski

, Mehnert

and Persson

, A novel bayesian approach to adaptive mean shift segmentation of brain images, In Computer-Based Medical Systems (CBMS), 2012 25th International Symposium on (2012), pp. 1–6.

Mahmood

, Chodorowski

, Ehteshami

B.B.

and Person

, A fully automatic unsupervised segmentation framework for the brain tissues in MR images, 2014.

Jones

M.C.

, Marron

J.S.

and Sheather

S.J.

, A brief survey of bandwidth selection for density estimation, Journal of the American Statistical Association90 (1995).

Shimshoni

, Georgescu

and Meer

, Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing). The MIT Press, (2006). The Chapter 9: Adaptive Mean Shift Based Clustering in High Dimensions.

Shen

, Brooks

M.J.

and Van

, Den, Hengel, Fast global kernel density mode seeking: Applications to localization and tracking, IEEE Trans Image Process16(5) (2007), 1457–1469.

Silverman

B.W.

, Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, 1986.

10.

Comaniciu

, Ramesh

and Meer

, The variable bandwidth mean shift and data-driven scale selection, In Proceedings of 8th International Conference on Computer Visionvolume 1, 2001, pp. 438–445.

11.

Georgescu

, Shimshoni

and Meer

, Mean shift based clustering in high dimensions: a texture classification example, In Proceedings of 9th IEEE International Conference on Computer Vision, volume 1, 2003, pp. 456–463.

12.

Ren

Y.-Z.

, Domeniconi

, Zhang

and Yu

G.-X.

, A Weighted Adaptive Mean Shift Clustering Algorithm. In SDM, 2014, pp. 794–802.

13.

Yoon

and Wilson

S.P.

, Improved Mean Shift AlgorithmWith Heterogeneous Node Weights, In International Conference on Pattern Recognition (ICPR)2010.

14.

Scott

D.W.

, Multivariate Density Estimation: Theory, Practice, and Visualization, Wiley-Interscience, 1992.

15.

Bradski

G.R.

, Computer vision face tracking for use in a perceptual user interface, In Proceedings of IEEE Workshop on Applications of Computer Vision, 1998, pp. 214–219.

16.

Bertsekas

D.P.

, Nonlinear Programming. Athena Scientific, 2nd Edition, 1999.

17.

Dempster

A.P.

, Laird

N.M.

and Rubin

D.B.

, Maximum like-lihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society, Series B39(1) (1977), 1–38.

18.

Andrieu

, de Freitas

, Doucet

and Jordan

, An introduction to mcmc for machine learning, Machine Learning50(1-2) (2003), 5–43.

19.

Casella

and Robert

C.P.

, Rao-Blackwellisation of sampling schemes, Biometrika83(1) (1996), 81–94.

20.

Liu

J.S.

, Monte Carlo Strategies in Scientific Computing. Springer, first edition2008.

21.

Amigó

, Gonzalo

, Artiles

and Verdejo

, A coparismon of extrinsic clustering evaluation metrics based on formal constraints, Inf Retr12(4) (2009), 461–486.

22.

Liu

, Li

, Xiong

, Gao

and Wu

, Understanding of Internal Clustering Validation Measures, In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, Washington, DC , USA, 2010, pp. 911–916.IEEE Computer Society.

23.

Georghiades

A.S.

, Belhumeur

P.N.

and Kriegman

D.J.

, From Few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE Trans Pattern Anal and Mach Intell23 (2001), 643–660.

24.

Breitenbach

and Grudic

G.Z.

, Clustering through Ranking on Manifolds, In Proceedings of the 22nd international conference on Machine learning (ICML), 2005, pp. 73–80. ACM Press.

25.

Coffey

, Pozdnoukhov

and Calabrese

, Time of arrival predictability horizons for public bus routes, In Proceedings of the 4th ACM SIGSPATIAL International Workshop on Computational Transportation Science, CTS ’11, New York , NY, USA, 2011, pp. 1–5. ACM.

Bayesian interpretation to generalize adaptive mean shift algorithm

Abstract

Keywords

1 Introduction

2 The (Adaptive) mean shift algorithm

3.1 A Bayesian interpretation of the MS algorithm

3.2.3 Specification of p ( τ n | Y )

4.2 Real dataset

5 Discussion

6 Conclusion

Footnotes

Acknowledgment

References

3.2.3 Specification of $p (τ_{n} | Y)$