Abstract
The Adaptive Mean Shift (AMS) algorithm is a popular and simple non-parametric clustering approach based on Kernel Density Estimation. In this paper the AMS is reformulated in a Bayesian framework, which permits a natural generalization in several directions and is shown to improve performance. The Bayesian framework considers the AMS to be a method of obtaining a posterior mode. This allows the algorithm to be generalized with three components which are not considered in the conventional approach: node weights, a prior for a particular location, and a posterior distribution for the bandwidth. Practical methods of building the three different components are considered.
Introduction
The mean shift (MS) algorithm is used to find the stationary points of a probability distribution from observed data [1–3]. This is done by maximizing a non-parametric approximation of the distribution through a kernel. From this maximization process, the algorithm converges from any initial point to a particular stationary region which is called a basin of attraction. In this sense, the MS algorithm is a non-parametric statistical clustering method that clusters data points in the same basin of attraction belong to the same cluster. Moreover, this clustering approach neither requires prior knowledge of the number of clusters nor constrains their shape. It has been widely used in signal/image processing and computer vision applications including brain imaging [4, 5]. Since using an appropriate kernel bandwidth is essential for successful MS implementation, adaptive mean shift (AMS) algorithms attempt to determine an optimized bandwidth.
Many researchers have studied conventional approaches to kernel density estimation such as cross-validation and smoothed bootstrap [6], fast adaptive mean shift (FAMS) based on k-nearest neighbours [7], annealed MS [8] and other adaptive MS algorithms [2, 9–12]. Alternative methods of determining optimal bandwidth use a pre-processing scheme incorporating the idea of a heterogeneous node weight. This approach incorporates knowledge about the location of each point w.r.t. all others [13].
This paper proposes a generalized framework called Bayesian Adaptive Mean Shift (BAMS). The interpretation of the kernel density estimation (KDE) from a Bayesian perspective allows a more flexible kernel which can incorporate prior information and feature bandwidth tunning.
The (Adaptive) mean shift algorithm
The MS algorithm is based on the Parzen window technique [14]. Given N data points in the d-dimensional space , the multivariate kernel density estimator of the distribution of
Adaptive mean shift algorithms attempt to specify the bandwidth automatically. A simple example is a kth nearest neighbour rule [11]. For a given k, let
A Bayesian interpretation of the MS algorithm
For a known τ, a Bayesian interpretation of the goal of kernel density estimation is to construct , a predictive distribution of
Thus the kernel density estimate is a weighted sum of kernels p (
The mean shift clustering method can be defined in terms of a Bayesian interpretation of the kernel density estimate. It can be seen that there are two choices to be made and to explore: the specification of the terms in Equation (2) that go into and the choice of optimization algorithm for finding the maximum a posteriori (MAP) value . For the latter, a fast Laplace approximation is used. In this algorithm, μ is the set of the detected clusters such that μ
c
i
is the c
i
-th mode.
1: Set μ = [].
2: For n = 1 to N do
3: Set an initial point: commonly set by the n-th measurement.
4: Obtain the mode (local optima) of by Laplace approximation:
5:
6: Assign the id to the observation, c n = c
7:
8: Update C = C + 1.
9: Add a new mode
10: Assign c n = C.
11:
12:
When τ is assumed unknown then should be calculated; the clustering is then based either on computing or by marginalizing to obtain and finding Conditioning on i and τ from Equation (2) gives
ω
n
= p (i = n): this term is the weight of the nth data point in the kernel density estimator. [13] have shown the importance of this term empirically by embedding Voronoi diagram and k-nearest neighbours. They insist on that mean shift can obtain approximated global prior information with node weights by using geometric prior such as Delaunay Triangulation and k-NN. A mean shift with heterogeneous node weights speed up the conventional mean shift and corrects misled mean shift. Recently, we have realized that this term is analogous to the weight of EM clustering algorithms with N clusters. In what follows it will be denotedω
n
; η
n
= p ( p ( : this is the posterior distribution of τ
n
.
Each is discussed in turn below.
This prior distribution is the first term of interest on the right side of Equation (3), and it corresponds to the node weights of [13]. They showed that conventional MS can be improved by assigning the node weight of a datum as a decreasing function of the distances to other data points. That idea is generalized here by defining ω
n
to be the inverse of the average distance between
This is the prior of solution space (
Specification of
The performance of the AMS algorithm depends on the selection of a bandwidth for the kernel. In this paper, we mainly propose a strategy which uses a functional conjugate prior for (see Appendix A.1) although we can use other non-informative priors like a uniform distribution and Jeffrey’s prior as shown in Appendix A.1 and A.2. That is, if the are assumed to be independent and identically inverse gamma distributed with scale α n and shape β n . By using this functional conjugate prior (FCP) the target posterior distribution of Equation. (3) becomes
where , and . For the functional conjugate prior, we can specify α
n
and β
n
with an expectation and a variance of τ
n
as follows:
Therefore, we have and . For simplicity, we set E (τ
n
) = ||
As we can see in Equation (6), we have the posterior distribution as a general form: where , and . To find the local maxima of the above equation, we calculate . Therefore, the new mean of our proposed BAMS algorithm is presented in Equation (7);
In this section we demonstrate the applicability, accuracy and flexibility of one of the possible variants of our proposed BAMS on distinct synthetic and real data sets. There are two different ways to evaluate the clustering output: the ‘internal evaluation’, without a ground truth, and the ‘external evaluation’ with a ground truth to which the results can be compared. A discussion regarding the different evaluation metrics has been given in recent works: [21, 22]. In this paper, we measure the goodness of our results with different evaluation metrics such as: the F-measure and the Jaccard index.
1: Set μ = [] and set λ to a particular value.
2: Calculate {ω t } t=1:N using either NN node weighting of Equation (4) or EM algorithm of Equation (5).
3: Select one of priors among the list.
4: Calculate α t and β t for t ∈ {1, 2, 3, ⋯ , N}.
5:
6: Set an initial point (commonly set by the j-th measurement,
7:
8: Set the current mean
9:
10: Calculate and by and where and .
11: Reassign the weights:
12:
13:
14: A delta function δ n ← 0 if the n-th observation is not visited and is one otherwise.
15: Update a mean,
16:
17: Calculate and update a mean, .
18:
19:
20: .
21: if for any c ∈ {1, 2, ⋯ , C} then
22: Assign the id to the observation,
23:
24: Update C = C + 1 and add a new mode . Then, assign
25:
26:
The synthetic datasets are generated adopting traditional finite mixture models with the number of clusters C ∈ {2, 5, 7}:
As can be seen in Fig. 1, in most cases (K > 20), our proposed BAMS approaches, including W-AMS, outperform conventional C-AMS in accuracy and sensitivity. C-AMS seems the best for K < 20, but the actual accuracy is very low and expected output is not obtained since the bandwidth is too small, i.e. the maximum value of the Jaccard index for C-AMS is only 0.5. However, our proposed BAMS performs best when K > 50. In addition, we find that most of our BAMS algorithms perform similarly with most synthetic data generated by the Gaussian Mixture model.
Real dataset
For a further complete evaluation of our proposed approach, we tested it on different real datasets: USPS digital data and Yale Face-Databased B data with 10 clusters. The first experiment was conducted on the USPS digital dataset, which consists of 4649 images of handwritten digits maintained by a 256 × 1 size vector. We used both a principal component analysis (PCA) and a locality preserving projection (LPP) to reduce the images into 3 dimensions for feature selection. After applying PCA to remove unwanted dataset noise, LPP is applied to reduce the dimension to 3 while preserving the locality. With the second dataset, Yale Face Database B [23], we perform experiments similar to those presented in [24]. For light computation, we down-sample each image to 30 × 40 pixels, obtaining 5850 images with 1200 dimensions. As for the USPS dataset, we applied PCA and LPP techniques to reduce the images to the dimension 5. The last dataset is trajectories of public buses from Dublin city. Nowadays, one of the key topics in pervasive and mobile computing is the development of accurate vehicle tracking systems using historical trajectories [25]. However, the vehicle trajectory patterns differ according to time and date. In order to predict and filter a particular trajectory of interest more accurately, historical trajectories need to be clustered to allow the development of more accurate tracking systems which can select only the relevant trajectories. For the example dataset, we have collected the GPS trajectories of bus traveling times along line 46 in Dublin for one month (June 2011). The dataset consists of 1500 historical trajectories each with dimension 188 obtained by collecting GPS signals every 100 meters of movement. Before applying our proposed approach, we reduced the dimension of the data from 188 to 7 by principal component analysis (PCA) over 99.5% of the variance. Each row of Fig. 2 shows the performance of conventional adaptive Mean shift algorithms and variable BAMS on the USPS digital data (top), Yale Face-Database B data (centre) and Dublin bus trajectory data (bottom). The columns (a), (b), (c), (d) and (e) describe, respectively, the actual data with hidden labels, the number of clusters, Jaccard index, and F-measures of the estimated clusters and time elapsed during execution. From Fig. 2 considering USPS, we can observe that BAMS approaches including W-AMS outperform C-AMS by the Jaccard index and F-measure. In addition, FCP-NN and AFCP-NN are more accurate than conventional approaches when 30 < K < 50 and W-AMS works efficiently for 10 < K < 20. For the Yale dataset, unlike the USPS dataset, proper node weights result in higher clustering accuracy so FCP-EM and AFCP-EM perform the best. C-AMS fails as K increases, moreover, it fails to cluster the Dublin bus data, achieving a Jaccard index of only 0.2. However, the accuracy of the BAMS approaches is relatively high, scoring from 0.5 to 0.8 for 60 < K < 200. Note that the FCP-EM outperforms the other algorithms with this dataset. In addition, we compared the elapsed times with all three datasets. The elapsed time during execution was ordered as follows: W-AMS<C-AMS<FCP-NN<FCP-EM<AFCP-NN<AFPCP-EM. That is, W-AMS and C-AMS are computationally relatively cheap while other BAMS featuring adaptivity requires heavy computation. As we can see in this figure, there is a trade-off to be made between execution time and accuracy when selecting the type of BAMS to use.
From the Fig. 2 we can see that we need to design BAMSs carefully according to the data being analyzed since the performance of the clustering algorithms varies as different priors and weights are applied to different datasets. In this paper, we provide a formal and generalized framework to flexibly modify and specify adaptive mean shift algorithms to such varied datasets.
Discussion
This paper focuses on the generalization of the conventional AMS in a Bayesian framework. This generalized formulation provides a more flexible and richer model to developers and researchers using the mean shift than a conventional AMS. Indeed, it introduces a generalized methodological framework in Bayesian interpretation which can be useful when building variants of the efficient AMS algorithms for the further research. In addition, BAMS can be improved further by adopting two more techniques: truncated kernels and non-Euclidean distance measures. A truncated kernel is commonly used to speed up the AMS. It is straightforward to modify the algorithms featured in this paper with the truncated kernel by excluding away data points from
Conclusion
This paper introduces a new general framework for the adaptive mean shift (AMS) algorithm. The AMS is formally modified in a Bayesian framework, named BAMS (Bayesian Adaptive Mean Shift). In this new definition of AMS, three typically underlying and ignored components are taken into account. For instance, a node weighting technique is embedded in the BAMS to obtain geometric structures and to correct misled mean shifts. BAMS can also add prior bandwidth information to make the algorithm less sensitive to the number of neighbours selected, which is an open issue in conventional k-nearest neighbours based AMS.
Footnotes
Acknowledgment
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (NRF-2013R1A1A1012797).
