Bayesian hierarchical K -means clustering

Abstract

Clustering algorithm is the foundation and important technology in data mining. In fact, in the real world, the data itself often has a hierarchical structure. Hierarchical clustering aims at constructing a cluster tree, which reveals the underlying modal structure of a complex density. Due to its inherent complexity, most existing hierarchical clustering algorithms are usually designed heuristically without an explicit objective function, which limits its utilization and analysis. $K$ -means clustering, the well-known simple yet effective algorithm which can be expressed from the view of probability distribution, has inherent connection to Mixture of Gaussians (MoG). At this point, we consider combining Bayesian theory analysis with $K$ -means algorithm. This motivates us to develop a hierarchical clustering based on $K$ -means under the probability distribution framework, which is different from existing hierarchical $K$ -means algorithms processing data in a single-pass manner along with heuristic strategies. For this goal, we propose an explicit objective function for hierarchical clustering, termed as Bayesian hierarchical $K$ -means (BHK-means). In our method, a cascaded clustering tree is constructed, in which all layers interact with each other in the network-like manner. In this cluster tree, the clustering results of each layer are influenced by the parent and child nodes. Therefore, the clustering result of each layer is dynamically improved in accordance with the global hierarchical clustering objective function. The objective function is solved using the same algorithm as K-means, the Expectation-maximization algorithm. The experimental results on both synthetic data and benchmark datasets demonstrate the effectiveness of our algorithm over the existing related ones.

Keywords

K-means clustering hierarchical clustering Bayesian hierarchical probability

1. Introduction

Clustering is a basic and important technique for exploratory data analysis [26, 25, 31, 32]. However, flat clustering(partition-based) methods, e.g., $K$ -means and spectral clustering, can not always group data well to satisfy the flexibility for user’s needs. Unlike flat clustering, hierarchical clustering decomposes data into tree structure, whose leaves correspond to data points and internal nodes denote the nested clusters of various sizes. Accordingly, hierarchical clustering is flexible for exploratory data analysis due to its multiple levels of granularity and has great potential use for computer vision, e.g., image segmentation [35], action recognition [36] and hierarchical datasets collection [41]. Hierarchical clustering is a very challenging task since the complex structure is usually unobserved and to be uncovered.

Although hierarchical structure is ubiquitous in real world [15, 4, 10, 21, 28], it is usually difficult to explore the underlying hierarchical structure in practice due to its intrinsic complexity. Therefore, most existing methods usually employ heuristic or/and greedy strategies to constructing hierarchical structure. Generally, these methods can be roughly categorized into two lines, i.e., agglomerative [8, 11, 7] and divisive [12, 9, 13] based. Both of these two types usually construct a hierarchical structure in the single-pass manner, i.e., constructing a tree in one pass, and thus there is no guarantee for the hierarchical structure to be globally reasonable and it is generally difficult to be further improved. For example, agglomerative hierarchical clustering performs tree construction with a bottom-up merging manner, which starts with assigning each data point with a separate cluster and then recursively merges clusters that are similar according to predefined metric. The merging rules are usually heuristics (e.g. Euclidean distance between cluster means or distance between nearest points). These existing methods may be effective in practice, but they are not yet well-understood. The main reason lies in that these methods are specified procedurally rather than in terms of the objective functions they are trying to optimize [14]. With an explicit objective function, the problem definition is more clear and convenient to analyze.

Figure 1.

Bayesian hierarchical $K$ -means clustering. The data at the $l^{\text{th}}$ layer, i.e., $\{\mu_{j}^{l}\}$ , should not only be the cluster centroids of the $(l-1)^{\text{th}}$ layer, but also should be of good clustering property (as input data) for the $(l+1)^{\text{th}}$ layer as well. The dash red circle denotes the dummy node of the top layer, where all data points are merged into one cluster.

The recent work in [14] proposes a cost function which assigns a score to any possible tree with given pairwise similarities between data points. The tree corresponds to the hierarchical decomposition of the data and its score reflects the quality of the solution. Unfortunately, the objective function of this method is NP-hard to optimize. Although the performance is improved by introducing the ultrametrics [23], the computation complexity is still an issue. Instead of complex objective functions, under the hierarchical Bayesian model [20], we introduce a novel explicit objective function for hierarchical clustering from the view of probability distribution, which could be easily optimized. Bayesian hierarchical models have been widely used in many applications, including computer vision [1, 4, 33], decision making [3, 5], signal processing [6]. Specifically, based on the well-known $K$ -means algorithm, an elegant hierarchical clustering model is developed which links hierarchical $K$ -means clustering and Bayesian hierarchical model with principled probabilistic foundation. Different from traditional greedy manner, our method jointly considers different layers of a tree in a bi-direction network-like manner. Specifically, different layers are stacked together thus each layer could be updated iteratively by simultaneously taking the other layers into consideration. It is worth noting that the term “network-like” and “bi-direction” indicate that our model is layer cascaded and can be updated in both directions, i.e., bottom-up and top-down. The model is shown in Fig. 1, and Fig. 2 demonstrates the clustering result by our model on real-world data.

The existing Bayesian methods (e.g., [2]) usually predefine a prior placed over all possible trees and then perform sampling from the posterior distribution given the observations. Thus, these methods tend to be rather complex and difficult to optimize, which has also been recognized by the reference [14]. In our paper, we provide a simple, interpretable and effective global objective function for hierarchical clustering. Based on the Bayesian hierarchical model, our model is derived and in the form of hierarchical $K$ -means with interactive cascaded layers, which has an explicit global objective function and makes the inference much easier and effective. The experiments on both synthetic data and benchmark datasets demonstrate the effectiveness of our model over state-of-the-art ones.

To summarize, the main contribution of this work includes:

We propose an explicit objective function for hierarchical clustering, termed as Bayesian hierarchical $K$ -means (BHK-means).

In our method, we construct a cascade clustering tree in which all layers interact in the network-like manner.

According to the global hierarchical clustering objective function, the clustering results of each layer are improved dynamically, and be solved with Expectation-maximization algorithm.

Figure 2.

Hierarchical clustering result on toy data (selected from Caltech-256 Object Category Dataset [30]). The non-leaf nodes are visualized by averaging the images contained in the corresponding clusters. The dashed green rectangles indicate clusters discovered, while the highlighted images with red borders are wrongly clustered.

2. Related work

The most well-known hierarchical clustering models construct the tree structure based on a bottom-up agglomerative way, i.e., single linkage (SL), average linkage (AL), and complete linkage (CL) [8], and the variants [42]. These methods are usually equipped with heuristic strategies instead of designing explicit objective functions to optimize. Thus the performances are much lower than those of other clustering methods with global objective functions. On the other hand, the method [14] proposes an explicit cost function for hierarchical clustering by considering the entire structure of the tree, and is further improved by HCSM [23] in computational complexity under the reasonable approximation. However, its computational cost is still high, which makes it is difficult to handle a large dataset. In the context of systematic genetics and taxonomy [37, 38, 39], some hierarchical clustering methods have been proposed. The LP (linear programming) relaxation of hierarchical clustering was studied in [40], where the goal is fitting a tree metric to given pairwise dissimilarities of data. Different from above methods, our method utilizes an interpretable, effective and efficient global objective function for hierarchical clustering under the hierarchical probability framework. Our method achieves a balance between the performance and computation. The computation complexity of our method is comparable with the complexity of $K$ -means clustering. In the experiments, we run the code of the all methods at a computer with Intel(R) Xeon(R) CPU E5-2640 v3@2.60 GHz.

3. Background

3.1 $K$ -means clustering with Mixture of Gaussians

$K$ -means clustering can be expressed with Mixture of Gaussians (MoG), which inspires us to extend traditional $K$ -means from the view of probability distribution, specifically the hierarchical probability model, for hierarchical clustering. Firstly, let us recall the connection between $K$ -means clustering and MoG model. Considering that data are generated from the following mixture of $K$ Gaussian distributions ( $\mathcal{N}$ ( ) is defined as the Gaussian distribution of all data $\mathbf{x}$ ), it is

$\displaystyle p(\mathbf{x})=\sum_{k=1}^{K}\pi_{k}\mathcal{N}(\mathbf{x}|{\mu}_% {k},{\Sigma}_{k}){\,s.t.\,}\sum_{k=1}^{K}\pi_{k}=1,$ (1)

where ${\mu}_{k}$ and ${\Sigma}_{k}$ are mean and covariance matrix of the $k^{\text{th}}$ component, respectively. $\pi_{k}$ weights the $k^{\text{th}}$ Gaussian component. To connect $K$ -means and MoG, the latent variable $\mathbf{z}$ is introduced, and then given the $K$ -dimensional binary random variable $\mathbf{z}$ with 1-of- $K$ representation, the joint distribution is expressed as

$\displaystyle p(\mathbf{x},\mathbf{z})=p(\mathbf{z})p(\mathbf{x}|\mathbf{z}){% \,s.t.\,}p(z_{k}=1)=\pi_{k}.$ (2)

Based on the above, the assignment probability can be induced as

$\displaystyle\gamma(z_{k})=p({z}_{k}=1|\mathbf{x})=\frac{p(z_{k}=1)p(\mathbf{x% }|z_{k}=1)}{\sum_{j=1}^{K}p(z_{j}=1)p(\mathbf{x}|z_{j}=1)}=\frac{\pi_{k}p(% \mathbf{x}|{\mu}_{k},{\Sigma}_{k})}{\sum_{j=1}^{K}\pi_{j}p(\mathbf{x}|{\mu}_{j% },{\Sigma}_{j})}$ (3) $\displaystyle\text{{with}}p(\mathbf{x}|{\mu}_{k},{\Sigma}_{k})=\frac{1}{(2\pi% \epsilon)^{\frac{1}{2}}}\text{exp}\bigg{\{}\frac{-1}{2\epsilon}||\mathbf{x}-{% \mu}_{k}||^{2}\bigg{\}},$

where the covariance matrices of these mixture components are given by $\epsilon\mathbf{I}$ , with $\epsilon$ being a variance parameter shared by all these components, and $\mathbf{I}$ is the identity matrix. For a particular data $\mathbf{x}_{n}$ , we have

$\displaystyle\gamma(z_{nk})=\frac{\pi_{k}\text{exp}\{-||\mathbf{x}_{n}-{\mu}_{% k}||^{2}\}/2\epsilon\}}{\sum_{j}\pi_{j}\text{exp}\{-||\mathbf{x}_{n}-{\mu}_{j}% ||^{2}/2\epsilon\}}.$ (4)

where $\gamma(z_{nk})$ indicates the probability of assigning the sample $\mathbf{x}_{n}$ to the $k^{\text{th}}$ component or cluster. By considering the limit case $\epsilon\rightarrow 0$ , we have

$\displaystyle\gamma(z_{nk})=\left\{\begin{array}[]{ll}1,&\text{if\,}k={\arg% \min}_{k}||\mathbf{x}_{n}-{\mu}_{k}||^{2}\\ 0,&\text{otherwise}.\\ \end{array}\right.$ (5)

Based on the above analysis, it is observed that there is a close connection between $K$ -means and MoG. Inspired by this observation, in this work we aim at extending $K$ -means to hierarchical clustering under the hierarchical probability framework.

3.2 Bayesian hierarchical model

The connection between MoG and $K$ -means clustering algorithm inspires us to further investigate the hierarchical clustering under the probability framework. Generally, Bayesian hierarchical model utilizes two important concepts in deriving the posterior distribution [20], namely: (1) hyperparameter: parameter of the prior distribution; (2) hyperprior: distribution of a hyperparameter. Taking the 2-layer hierarchical probability for example, we have

$\displaystyle\mathbf{x}|{\theta},\phi\sim p(\mathbf{x}|{\theta},\phi)$ $\displaystyle{\theta}|\phi\sim p({\theta}|\phi)$ (6) $\displaystyle\phi\sim p(\phi),$

where $p(\mathbf{x}|{\theta},\phi)$ is the likelihood and $p({\theta},\phi)$ is its prior distribution. $\phi$ is the hyperparameter with hyperprior distribution $p(\phi)$ . The prior distribution can be broken down into

$\displaystyle p({\theta},\phi)=p({\theta}|\phi)p(\phi).$ (7)

Accordingly, by using Bayes’ theorem, we can maximize the following posterior to obtain model parameters, it is

$\displaystyle p({\theta},\phi|\mathbf{x})=\frac{p(\mathbf{x}|{\theta},\phi)p({% \theta},\phi)}{p(\mathbf{x})},$ (8) $\displaystyle p({\theta},\phi|\mathbf{x})\propto p(\mathbf{x}|{\theta})p({% \theta}|\phi)p(\phi).$ (9)

The likelihood $p(\mathbf{x}|{\theta})$ depends on $\phi$ only through ${\theta}$ . This is the key for constructing hierarchical structure in our model, which builds connections between different layers.

4. Bayesian hierarchical K-means clustering

In this section, we will introduce our Bayesian hierarchical $K$ -means clustering (BHK-means) from the view of hierarchical probability framework. Our method is derived from Bayesian hierarchical model which is based on Bayes’ theorem. In our model, the adjacent layers are linked under Bayesian hierarchical model, which is the key difference from the flat clustering model, e.g., $K$ -means.

4.1 Bayesian hierarchical Mixture of Gaussian

Our goal is multi-layer hierarchical clustering, so we extend the above two-stage model to multi-stage Bayesian hierarchical one. Specifically, we extend MoG to hierarchical MoG under Bayesian hierarchical model. By denoting the parameter of the $l^{\text{th}}$ layer as ${\theta}^{l}$ , the parameter of the $l^{\text{th}}$ layer has the following distribution

$\displaystyle{\theta}^{l}|{\theta}^{l+1}\sim p({\theta}^{l}|{\theta}^{l+1}),$ (10)

where parameter (e.g., ${\theta}^{l}$ ) corresponding to each layer is generated from a common population with distribution governed by a hyperparameter (e.g., ${\theta}^{l+1}$ ).

Specifically, for multi-layer model we have

$\displaystyle\mathbf{x}|{\theta}^{1},{\theta}^{2},\ldots,{\theta}^{L}\sim p(% \mathbf{x}|{\theta}^{1},{\theta}^{2},\ldots,{\theta}^{L}).$ (11)

Note that, similarly to the above mentioned two-layer hierarchical model, the likelihood depends on ${\theta}^{l}$ only through ${\theta}^{l-1}$ , thus, we break down the posterior distribution into

$\displaystyle p({\theta}^{1},{\theta}^{2},\ldots,{\theta}^{L}|\mathbf{x})=% \frac{p(\mathbf{x}|{\theta}^{1})p({\theta}^{1}|{\theta}^{2})\ldots p({\theta}^% {L-1}|{\theta}^{L})p({\theta}^{L})}{p(\mathbf{x})}.$ (12)

The generation of the parameter corresponding to the $l^{\text{th}}$ layer, i.e., ${\theta}^{l}$ is controlled by a distribution governed by the hyperparameter ${\theta}^{l+1}$ . Then, as mentioned above that the covariance matrices of these mixture components are given by $\epsilon\mathbf{I}$ , it has the following form

$\displaystyle p({\theta}^{l}|{\theta}^{l+1})=p({\mu}^{l}|\{{\mu}_{k}^{l+1}\}_{% k=1}^{C^{l+1}})=\sum_{k=1}^{C^{l+1}}\pi_{k}^{l+1}\mathcal{N}({\mu}^{l+1}_{k},% \epsilon\mathbf{I}),$ (13)

where $C^{l+1}$ is the number of Gaussian components corresponding to the ${(l+1)}^{\text{th}}$ layer. Therefore, the hyperparameters of each layer are actually the means of certain number of Gaussian components. Accordingly, each data point is basically generated from the hierarchical probability model with the distribution governed by the parameters $\{{\theta}^{l}\}_{l=1}^{L}$ lying on a hierarchical structure, as shown in Fig. 1.

4.2 Objective function

According to Eqs (3)–(5), we maximize the likelihood of the observed data and obtain the following $K$ -means clustering objective

$\displaystyle\min_{\gamma_{nk},{\mu}_{k}}\sum_{n=1}^{N}\sum_{k=1}^{K}\gamma_{% nk}||\mathbf{x}_{n}-{\mu}_{k}||^{2}$

(14) $\displaystyle s.t.\,\gamma_{nk}\in\{0,1\},\sum_{k=1}^{K}\gamma_{nk}=1,$

where $\gamma_{nk}\in\{0,1\}$ indicates the assignment of the $n^{\text{th}}$ point to the $k^{\text{th}}$ cluster.

Similarly, by maximizing the posterior probability $p({\theta}^{1},{\theta}^{2},\ldots,{\theta}^{L}|\mathbf{x})$ in the hierarchical model, the following objective function of hierarchical $K$ -means clustering is induced as

$\displaystyle\min_{\gamma_{ij},{\mu}_{j},\gamma_{jk}^{l},{\mu}_{k}^{l}}\sum_{i% =1}^{N}\sum_{j=1}^{K}\gamma_{ij}||\mathbf{x}_{i}-{\mu}_{j}||^{2}+\sum_{l=2}^{L% }\sum_{j=1}^{C^{l-1}}\sum_{k=1}^{C^{l}}\gamma_{jk}^{l}||{\mu}_{j}^{l-1}-{\mu}_% {k}^{l}||^{2}.$ (15)

Intuitively, by directly summing together SSE (Sum Square Error) of different layers as in Eq. (15), the adjacent layers are dependent and could interact with each other. This makes the hierarchical clustering result towards a global reasonable structure which is different from the greedy manner.

For conciseness, we define $\mathbf{x}_{i}={\mu}_{i}^{0}$ , $C^{0}=N$ and therefore have the following objective function

$\displaystyle\min_{\gamma_{jk}^{l},{\mu}_{j}^{l}}\sum_{l=1}^{L}\lambda^{l}\sum% _{j=1}^{C^{l-1}}\sum_{k=1}^{C^{l}}\gamma_{jk}^{l}||{\mu}_{j}^{l-1}-{\mu}_{k}^{% l}||^{2}$ (16) $\displaystyle s.t.\,\gamma_{jk}^{l}\in\{0,1\},\sum_{k=1}^{C^{l}}\gamma_{jk}^{l% }=1,$

where weight factor $\lambda^{l}>0$ encodes the prior belief degree for the $l^{\text{th}}$ layer. There are two reasons for setting weights for different layers: 1) Since the cluster number decreases for the higher layers, the SSE of higher layers may become much smaller. Therefore, setting proper weights can avoid the unbalanced SSE of different layers; 2) In practice, user can set different weights for different layers according to prior knowledge. For example, if we clearly know that there are $K$ categories on the $l^{\text{th}}$ layer, then we can give a much larger weight for the $l^{\text{th}}$ layer.

We optimize our objective function in Eq. (4.2) layer by layer, and alternatively optimize the objective function with respect to $\gamma_{jk}^{l}$ and ${\mu}^{l}_{j}$ (taking the $l^{\text{th}}$ layer for example). Specifically, by taking the derivative of the objective function with respect to ${\mu}_{j}^{l}$ , we have

$\displaystyle-2\lambda^{l}\sum_{i=1}^{C^{l-1}}\gamma_{ij}^{l}({\mu}_{i}^{l-1}-% {\mu}_{j}^{l})+2\lambda^{l+1}\sum_{k=1}^{C^{l+1}}\gamma_{jk}^{l+1}({\mu}_{j}^{% l}-{\mu}_{k}^{l+1})=0.$ (17)

Accordingly, we can update ${\mu}_{j}^{l}$ with the following rule

$\displaystyle{\mu}_{j}^{l}=\frac{2\lambda^{l}\sum_{i=1}^{C^{l-1}}\gamma_{ij}^{% l}{\mu}_{i}^{l-1}+2\lambda^{l+1}\sum_{k=1}^{C^{l+1}}\gamma_{jk}^{l+1}{\mu}_{k}% ^{l+1}}{2\lambda^{l}\sum_{i=1}^{C^{l-1}}\gamma_{ij}^{l}+2\lambda^{l+1}\sum_{k=% 1}^{C^{l+1}}\gamma_{jk}^{l+1}}.$ (18)

Note that, the updating of cluster centers of the current layer ${\mu}_{j}^{l}$ only depends on the adjacent layers, i.e., the ${(l-1)}^{\text{th}}$ and ${(l+1)}^{\text{th}}$ layers. Similar to $K$ -means clustering, the assignments can be updated with the following rule

$\displaystyle\gamma_{ij}^{l}=\left\{\begin{array}[]{ll}1,&\text{if\,}j={\arg% \min}_{j}||{\mu}_{i}^{l-1}-{\mu}_{j}^{l}||^{2}\\ 0,&\text{otherwise}.\\ \end{array}\right.$ (19)

In practice, we initialize our hierarchical model with $K$ -means clustering in bottom-up manner. Then, we optimize our objective function with EM (Expectation Maximization) algorithm and thus the local optimum can be guaranteed [27]. Our model is based on the Bayesian hierarchical model which is powerful because the parameters for different levels are coupled, allowing the different layers to globally share statistic strength. For clarity, the proposed method is summarized in Algorithm 1.

Algorithm 1: Bayesian Hierarchical $K$ -means Clustering

Input: $N$ data vectors $\{\mathbf{x}_{i}\}_{i=1}^{N}$ and the number of clusters on the $l^{\text{th}}$ level: $C^{l}$ .

1: $\textit{iter}=0$

2: repeat

3: $\textit{iter}=\textit{iter}+1$

4: for $l=1:L$

5: Update $\{{\mu}_{j}^{l}\}_{j=1}^{C^{l}}$ according to Eq. (18)

6: end

7: for $l=1:L$

8: Update $\{\gamma_{ij}^{l}\}_{i=1,j=1}^{i=N,j=C^{l}}$ according to Eq. (19)

9: end

10: until convergence reached.

Output: Clustering results ${\mu}_{j}^{l}$ and $\gamma_{ij}^{l}$ .

Remarks

Firstly, our method is exactly induced from the Bayesian hierarchical model. Both $K$ -means and ours are the approximation of the mixture of Gaussians and Bayesian hierarchical model, respectively. The approximation provides the elegant form for objective function and convenience for optimization, and the probabilistic interpretation for our objective function as well. Secondly, note that, although the number of cluster for each layer should be manually specified, actually, there are two flexibilities for this setting: (1) The number of layers can be set according user’s need or specific applications; (2) The prior knowledge about the number of clusters can be directly incorporated to guide the tree construction, which benefits the hierarchical clustering. Thanks to the non-increasing property of optimization, user could balance the performance and computation time by manually defining the number of iterations. Thirdly, the complexity of our method in each iteration is $O(\textit{LPNKD})$ , where $L$ , $P$ , $K$ and $N$ are the numbers of layers, the size of the dataset, the number of clusters, and the dimensionality, respectively. It is worth noting that the computation complexity is comparable with the complexity (i.e., $O(\textit{PKN})$ ) of $K$ -means clustering.
5. Experiments

Algorithm 1: Bayesian Hierarchical $K$ -means Clustering
Input: $N$ data vectors $\{\mathbf{x}_{i}\}_{i=1}^{N}$ and the number of clusters on the $l^{\text{th}}$ level: $C^{l}$ .
1: $\textit{iter}=0$
2: repeat
3: $\textit{iter}=\textit{iter}+1$
4: for $l=1:L$
5: Update $\{{\mu}_{j}^{l}\}_{j=1}^{C^{l}}$ according to Eq. (18)
6: end
7: for $l=1:L$
8: Update $\{\gamma_{ij}^{l}\}_{i=1,j=1}^{i=N,j=C^{l}}$ according to Eq. (19)
9: end
10: until convergence reached.
Output: Clustering results ${\mu}_{j}^{l}$ and $\gamma_{ij}^{l}$ .

5.1 Experiment setting

Dataset

There are 4 datasets from the UCI Machine Learning Repository,1

¹
http://archive.ics.uci.edu/ml/.

and these datasets were also widely used in existing hierarchical clustering methods [7, 23]. Specifically, the 4 real-world datasets used in our experiments are: glass (214 examples, 6 classes, 9 attributes), wdbc (569 examples, 2 classes, 30 attributes), auto (205 examples, 6 classes, 25 attributes) and ionosphere (351 examples, 2 classes, 34 attributes). It is worth noting that, due to the high computational cost, the authors [7, 23] sampled a small subset of data from each original dataset. We run 10 times and report the average performance and standard derivation.

Compared methods

We compare our proposed approach with several hierarchical clustering methods, including conventional agglomerative single linkage (SL)/complete linkage (CL)/average linkage(AL), binary $K$ -means (BiKM) [34] which integrates the conventional partitional clustering ( $K$ -means) and divisive hierarchical clustering. We also compared our method to the most recently proposed HCSM [23] which improves the method in [14] with a global objective function. The naive hierarchical $K$ -means clustering, termed as SPHK (Single Pass Hierarchical $K$ -means clustering), acts as the initialization method for the hierarchical structure in our experiments. Therefore, we considered it as the baseline to evaluate the effectiveness of our improvement. SPHK performs $K$ -means clustering for each layer with bottom-up manner in single pass.

Evaluation metrics

Due to Euclidean distance involved in all compared methods except the method in [23], SSE (Sum Square Error) is reasonable to evaluate the cluster compactness. Different metrics prefer different properties, thus we employ three diverse types of metrics, i.e., SSE as the internal criterion, Normalized mutual information (NMI) and Accuracy (ACC) as external criterion and dendrogram purity as external criterion for hierarchical structure. For all these metrics, the higher value indicates the better clustering performance. Note that, like the work [7], we employ dendrogram purity (DPurity) as the main metric since it is used to evaluate the hierarchical structure. We specify the definition of accuracy used in our experiments as follows. Given a sample $\mathbf{x}_{i}$ , we denote the cluster label and class label as $\omega_{i}$ and $c_{i}$ , respectively, then we have

$\displaystyle\textit{ACC}=\frac{\sum_{i=1}^{N}\delta(c_{i},\text{map}(\omega_{% i}))}{N},$ (20)

where $\delta(a,b)=1$ when $a=b$ , otherwise $\delta(a,b)=0$ . $\text{map}(\omega_{i})$ is the permutation map function, which maps the cluster labels into class labels. The best map can be obtained by Kuhn-Munkres algorithm. The definition of normalized mutual information (NMI) [24] is as

$\displaystyle\textit{NMI}(\Omega,\mathbb{C})=\frac{I(\Omega;\mathbb{C})}{(H(% \Omega)+H(\mathbb{C}))/2}$ $\displaystyle\text{with}\,I(\Omega;\mathbb{C})=\Sigma_{k}\Sigma_{j}P(\omega_{k% }\cap c_{j})\log\frac{P(\omega_{k}\cap c_{j})}{P(\omega_{k})P(c_{j})}=\Sigma_{% k}\Sigma_{j}\frac{|\omega_{k}\cap c_{j}|}{N}\log\frac{N|\omega_{k}\cap c_{j}|}% {|\omega_{k}||c_{j}|},$ (21) $\displaystyle H(\Omega)=-\Sigma_{k}P(\omega_{k})\log(\omega_{k})=\Sigma_{k}% \frac{|\omega_{k}|}{N}\log\frac{|\omega_{k}|}{N},$ $\displaystyle H(\mathbb{C})=-\Sigma_{j}P(c_{j})\log(c_{j})=\Sigma_{j}\frac{|c_% {j}|}{N}\log\frac{|c_{j}|}{N},$

where the probabilities $P(\omega_{k})$ , $P(c_{j})$ , and $P(\omega_{k}\cap c_{j})$ correspond to a data point being in cluster $\omega_{k}$ , class $c_{j}$ , and in the intersection of $\omega_{k}$ and $c_{j}$ , respectively. $\Omega=\{\omega_{1},\omega_{2},\ldots,\omega_{K}\}$ denotes the set of clusters and $\mathbb{C}=\{c_{1},c_{2},\ldots,c_{J}\}$ is the set of classes. For the definition of dendrogram purity, please refer to the work in [7].

5.2 Experiment results

Figure 3.

Visualization of clustering results on synthetic data with t-SNE [22]. The first row and bottom row with dash borders correspond to the results of SPHK and our BHK-means, respectively. The parts marked with solid rectangles demonstrate the results with clear differences between SPHK and our BHK-means.

5.2.1 Results on synthetic data

Firstly, we generate synthetic data from three different Gaussian distributions, with 100 data points sampled from each Gaussian, as shown in Fig. 3d. The means of these three Gaussians are [0.5, 0.5, 0.5], [1.5, 1.5, 1.5] and [ $-$ 1.25, 1.25, $-$ 1.25] respectively, with the same covariance matrix [0.3, 0, 0; 0, 0.3, 0; 0, 0, 0.3]. With $t$ -distributed stochastic neighbor embedding (t-SNE) [22], Fig. 3a–c provides the intuitive comparison for the clustering results of different layers between SPHK and ours. The top and bottom rows correspond to SPHK and BHK-means, respectively. From Fig. 3a, we can observe that the clustering result of ours is not much better than that of SPHK due to the greedy manner of SPHK. Differently, according to Fig. 3b and c, it can be observed that the clustering results of our method in higher layers are more consistent with ground truth than those of SPHK.

5.2.2 Results on real-world datasets

To specify the number of clusters at each layer, firstly we suppose the ground truth number of classes is $K$ , which corresponds to the number of clusters at the ${(L-1)}^{\text{th}}$ layer (the dummy node acts as the $L^{\text{th}}$ layer as shown in Fig. 1). Therefore, the number of clusters at each layer is limited to $\{K,K+1,\ldots,N\}$ . Accordingly, we denote the number of clusters at the $l^{\text{th}}$ layer as $C^{l}=\min(K\alpha^{L-1-l},N)$ with $l\in\{0,\ldots,L-1\}$ , where $N$ is the size of the total data points and $\alpha$ is a positive integer. That is to say, the number of clusters at the $(l-1)^{\text{th}}$ layer is $\alpha$ times of the size of the $l^{\text{th}}$ layer. We conduct our experiments by setting $\alpha$ with different numbers and the results of ours are consistently better than those of other comparisons thus we reported the results with $\alpha=6$ for conciseness.

Evaluation with internal criterion

Firstly, we provide experimental results to evaluate our method on 4 UCI datasets in terms of the internal criterion, i.e., two SSE metrics: cluster-oriented SSE and data-oriented SSE. The first one measures the compactness of a cluster on the $l^{\text{th}}$ layer with the centroids of clusters on the $(l-1)^{\text{th}}$ layer as inputs, which is consistent with the objective function of hierarchical $K$ -means, and thus is used for the comparison between SPHK and ours. The latter provides a measure in terms of original data points, which is applicable for all methods. Specifically, as the top row of Fig. 4 shown, different methods are compared for the layers with same number of clusters. The clustering results of our method are generally better than SPHK for most layers. Moreover, our method achieves better performances than the others. Although SL is simple to realize, its performances are usually relatively low. Similarly, according to the bottom row of Fig. 4, the clustering result of ours is not better than that of SPHK at the first layer due to the greedy property of SPHK, but the performances of our model are much better than those of SPHK for the higher layers, which is consistent with the results on the synthetic data and our global goal.

Figure 4.

Hierarchical clustering comparison in terms of SSE (normalized to [0, 1]) with respect to original data points and cluster centers, receptively. For the bottom row, dash lines are performance curves of 5 random runs and solid lines indicate the averaged ones. Unlike SPHK in the greedy manner, our model seeks the optimal hierarchical tree structure from a holistic view.

Table 1

Performance (mean $\pm$ standard deviation) of comparisons. The red numbers indicate the top 1 performers, and the blue numbers indicate the top 2 performers. Our approach consistently outperforms the naive hierarchical $K$ -means clustering method SPHK on all datasets. BiKM outperforms ours on wdbc and ionosphere in terms of NMI and ACC. The main reason is there are 2 classes for these datasets which is rather advantageous to the (top-down) divisive BiKM. However, for the hierarchical metric DPurity, the performances of our approach are much better due to the global objective

Method	Metrics	Glass		Wdbc		Auto		Ionosphere
SL	NMI (%)	7.24	$\pm$ 0.00	0.52	$\pm$ 0.00	9.48	$\pm$ 0.00	0.87	$\pm$ 0.00
	Accuracy (%)	36.44	$\pm$ 0.00	62.91	$\pm$ 0.00	32.68	$\pm$ 0.00	64.38	$\pm$ 0.00
	DPurity (%)	45.02	$\pm$ 0.00	79.68	$\pm$ 0.00	22.28	$\pm$ 0.00	59.34	$\pm$ 0.00
CL	NMI (%)	37.89	$\pm$ 0.00	8.81	$\pm$ 0.00	9.32	$\pm$ 0.00	1.71	$\pm$ 0.00
	Accuracy (%)	48.59	$\pm$ 0.00	66.25	$\pm$ 0.00	32.19	$\pm$ 0.00	64.67	$\pm$ 0.00
	DPurity (%)	41.91	$\pm$ 0.00	78.74	$\pm$ 0.00	22.95	$\pm$ 0.00	57.00	$\pm$ 0.00
AL	NMI (%)	11.45	$\pm$ 0.00	8.81	$\pm$ 0.00	9.32	$\pm$ 0.00	0.87	$\pm$ 0.00
	Accuracy (%)	37.85	$\pm$ 0.00	66.25	$\pm$ 0.00	32.19	$\pm$ 0.00	64.38	$\pm$ 0.00
	DPurity (%)	47.27	$\pm$ 0.00	81.21	$\pm$ 0.00	23.29	$\pm$ 0.00	58.55	$\pm$ 0.00
BiKM	NMI (%)	38.26	$\pm$ 0.35	26.48	$\pm$ 0.00	9.35	$\pm$ 0.27	13.49	$\pm$ 0.00
	Accuracy (%)	42.67	$\pm$ 1.34	85.41	$\pm$ 0.00	32.68	$\pm$ 0.49	71.22	$\pm$ 0.00
	DPurity (%)	40.93	$\pm$ 0.12	74.93	$\pm$ 0.04	23.35	$\pm$ 0.05	47.99	$\pm$ 0.34
HCSM	NMI (%)	36.42	$\pm$ 0.61	3.32	$\pm$ 0.66	11.46	$\pm$ 0.85	3.26	$\pm$ 0.81
	Accuracy (%)	48.28	$\pm$ 1.38	67.09	$\pm$ 1.32	34.57	$\pm$ 0.13	67.50	$\pm$ 3.53
	DPurity (%)	31.26	$\pm$ 4.36	54.65	$\pm$ 2.23	25.21	$\pm$ 2.27	55.44	$\pm$ 3.38
SPHK	NMI (%)	37.18	$\pm$ 3.82	10.08	$\pm$ 2.40	11.33	$\pm$ 1.99	5.89	$\pm$ 3.10
	Accuracy (%)	48.57	$\pm$ 1.40	66.85	$\pm$ 1.14	35.36	$\pm$ 2.39	65.43	$\pm$ 1.12
	DPurity (%)	43.46	$\pm$ 0.96	68.58	$\pm$ 0.33	29.95	$\pm$ 0.94	66.88	$\pm$ 2.66
Ours	NMI (%)	38.42	$\pm$ 3.02	14.95	$\pm$ 2.39	11.64	$\pm$ 1.83	8.24	$\pm$ 4.71
	Accuracy (%)	50.47	$\pm$ 1.63	69.17	$\pm$ 1.18	35.57	$\pm$ 2.41	68.00	$\pm$ 2.66
	DPurity (%)	45.19	$\pm$ 1.46	79.96	$\pm$ 0.48	32.38	$\pm$ 0.48	67.52	$\pm$ 2.43

Evaluation with external criterion

With the ground truth label, we evaluate clustering results at the layer with $K$ clusters in terms of external criterion: NMI and ACC, where $K$ is the number of ground truth classes. Meanwhile, we also measure the performance in terms of dendrogram purity. The detailed results are shown in Table 1. In accordance to expectation, since our approach takes advantage of global objective, the clustering results are significantly improved over SPHK, which suggests advantage of the proposed model in hierarchical clustering. The conventional approaches, SL/CL/AL achieve promising performance in terms of dendrogram purity, however, the clustering results are not satisfying for the high level, i.e., the $(L-1)^{\text{th}}$ layer with $K$ clusters, where the NMI and ACC scores are relatively low. As shown in Fig. 5, the value of our objective function decreases monotonically with the iterative optimization, which is consistent with the theoretical analysis. It is worth noting that, although the computational cost is high for large-scale data, our model holds the flexibility to balance the performance and computation time due to the fast convergence rate at the beginning of a small number of iterations.

Different from above methods, our method utilizes an interpretable, effective and efficient global objective function for hierarchical clustering under the hierarchical probability framework. Our method achieves a balance between the performance and computation. The computation complexity of our method is comparable with the complexity of $K$ -means clustering. In the experiments, our method deals with more than 500 samples (e.g., wdbc) within one minute, and simultaneously, maintains a satisfactory hierarchical clustering result. The specific time consumption in the experiment is shown in Table 2.

Table 2

Time (mean $\pm$ standard deviation) of comparisons. In the table, the unit of time is in Seconds

Method	Glass		Wdbc		Auto		Ionosphere
SL	0.0021	$\pm$ 0.0003	0.0089	$\pm$ 0.0031	0.0025	$\pm$ 0.0002	0.0029	$\pm$ 0.0002
CL	0.0016	$\pm$ 0.0002	0.0091	$\pm$ 0.0016	0.0017	$\pm$ 0.0003	0.0028	$\pm$ 0.0002
AL	0.0023	$\pm$ 0.0002	0.0092	$\pm$ 0.0022	0.0025	$\pm$ 0.0005	0.0029	$\pm$ 0.0005
BiKM	0.0184	$\pm$ 0.0003	0.0175	$\pm$ 0.0007	0.0277	$\pm$ 0.0055	0.0094	$\pm$ 0.0008
HCSM	103.04	$\pm$ 6.41	530.66	$\pm$ 13.51	165.89	$\pm$ 9.44	443.10	$\pm$ 10.24
SPHK	0.0223	$\pm$ 0.0011	0.0295	$\pm$ 0.0016	0.0222	$\pm$ 0.0014	0.0242	$\pm$ 0.0017
Ours	1.7545	$\pm$ 0.1128	10.6826	$\pm$ 0.6436	1.9769	$\pm$ 0.1222	8.1938	$\pm$ 0.1400

Table 3

Performance (mean $\pm$ standard deviation) of comparisons on 2 relative large datasets. In terms of ACC/NMI, BiKM performs as the best one due to its (top-down) divisive manner and greedy strategy. However, for the hierarchical metric DPurity, our approach outperforms BiKM on both datasets with a large margin due to our global objective

Method	Metrics	Scene		Caltech-256
SL	NMI (%)	0.52	$\pm$ 0.00	0.26	$\pm$ 0.00
	Accuracy (%)	14.98	$\pm$ 0.00	23.87	$\pm$ 0.00
	DPurity (%)	26.17	$\pm$ 0.00	45.17	$\pm$ 0.00
CL	NMI (%)	9.27	$\pm$ 0.00	23.53	$\pm$ 0.00
	Accuracy (%)	19.33	$\pm$ 0.00	38.73	$\pm$ 0.00
	DPurity (%)	28.04	$\pm$ 0.00	41.51	$\pm$ 0.00
AL	NMI (%)	0.89	$\pm$ 0.00	0.49	$\pm$ 0.00
	Accuracy (%)	15.17	$\pm$ 0.00	24.01	$\pm$ 0.00
	DPurity (%)	31.11	$\pm$ 0.00	52.73	$\pm$ 0.00
BiKM	NMI (%)	48.39	$\pm$ 1.18	50.35	$\pm$ 3.13
	Accuracy (%)	43.56	$\pm$ 1.53	54.75	$\pm$ 5.12
	DPurity (%)	25.76	$\pm$ 1.75	42.91	$\pm$ 2.51
SPHK	NMI (%)	39.15	$\pm$ 6.07	37.09	$\pm$ 9.25
	Accuracy (%)	29.81	$\pm$ 4.66	44.73	$\pm$ 8.81
	DPurity (%)	43.12	$\pm$ 3.89	71.60	$\pm$ 7.66
Ours	NMI (%)	41.33	$\pm$ 5.20	46.42	$\pm$ 8.50
	Accuracy (%)	32.07	$\pm$ 4.04	49.73	$\pm$ 7.56
	DPurity (%)	44.39	$\pm$ 3.95	73.57	$\pm$ 9.29

Figure 5.

Convergence curves on benchmark datasets.

5.2.3 Experiment on large dataset

Different from existing hierarchical methods [23, 1] which conducted experiments on small dataset limited by their high computational cost, we also use 2 relatively large image datasets in our experiments, i.e., Natural Scene Category Dataset [1]2

²
http://www-cvr.ai.uiuc.edu/ponce_grp/data/.

and Caltech-256 Object Category Dataset.3

http://www.vision.caltech.edu/Image_Datasets/Caltech256/.

We use VGG-19 to extract the deep features and sample subsets from these datasets in our experiment: Scene (2762 examples, 8 classes, 4096 attributes) and Caltech-256 (3653 examples, 5 classes, 4096 attributes). The detailed results are shown in Table 3. Our approach achieves the best performance in terms of all metrics on both datasets. This further validates the potential of our model for hierarchical clustering on large dataset.

We also conduct experiment on the dataset SUN [29]4

⁴

https://groups.csail.mit.edu/vision/SUN/.

with known hierarchical class structure. We employ the predefined subset with 324 well-sampled categories including 5639 (single-labeled) images. The numbers of categories at different layers (from bottom to top) are 324, 16 and 3, respectively. As shown in Fig. 6, our performance is better than all the other compared methods in terms of data-oriented SSE. It seems that the performances between SPHK and ours are similar at some layers according to the curves due to the scale issue. To be exactly, the SSE of SPHK and ours at all layers (from the first layer to the sixth layer) are 2.18

\times

{}^{5}

(2.18

\times

10^{5}

), 2.72

\times

{}^{6}

(1.87

\times

{}^{6}

), 1.66

\times

{}^{7}

(8.51

\times

{}^{6}

), 5.63

\times

{}^{7}

(3.05

\times

{}^{7}

), 3.93

\times

{}^{8}

(2.57

\times

{}^{8}

) and 7.43

\times

{}^{8}

(7.11

\times

{}^{8}

), respectively. The method HCSM is not compared on this dataset due to its high computational cost as mentioned before. According to the results, it further empirically validates the potential of our method for hierarchical clustering.

Figure 6.

Performance comparison on SUN.

6. Conclusion

In this paper, we have presented a simple yet effective model for hierarchical clustering. Our method has an explicit objective function which makes the model more interpretable over traditional approaches. Moreover, our model holds a clear interpretation from the view of hierarchical probability model. Compared with the most recent method [23] which also provides an explicit objective function, empirical results validated that our method is more effective and efficient. The limitations of our approach include the bias towards spherical clusters and the computational complexity inherited from $K$ -means clustering. Due to its simplicity, there are several directions deserved to investigate in the future. Incorporating the side information [16, 17] and enhancing the robustness [18, 19] will be considered into our model. Second, it will be interesting to take the kernel techniques into account for the issue of spherical bias inherited from $K$ -means. Last but not least, it is important to develop more efficient algorithm for very large scale data such as ImageNet [41].

References

Fei-Fei

and Perona

, A Bayesian hierarchical model for learning natural scene categories, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE, 2005.

Neal

R.M.

, Density modeling and clustering using dirichlet diffusion trees, Bayesian Statistics, 2003, 619–629.

Schmid Christopher

, [Methods in Enzymology] Numerical Computer Methods, Part C Volume 321, Bayesian hierarchical models, Methods in Enzymology, 2000, 305-330.

Wang

and Grimson

, Unsupervised Activity Perception by Hierarchical Bayesian Models, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2007.

Hoef

J.M.V.

and Frost

K.J.

, A Bayesian hierarchical model for monitoring harbor seal changes in Prince William Sound, Alaska, Environmental and Ecological Statistics 10(2) (2003), 201–219.

Chen

and Varshney

P.K.

, A Bayesian sampling approach to decision fusion using hierarchical models, IEEE Transactions on Signal Processing 50(8) (2002), 1809–1818.

Heller

K.A.

and Ghahramani

, Bayesian hierarchical clustering, in: International Conference on Machine Learning, 2005.

Duda

R.O.

and Hart

P.E.

, Pattern classification and scene analysis, Wiley, 1973.

Adams

R.P.

Ghahramani

and Jordan

M.I.

, Tree-structured stick breaking processes for hierarchical data, Advances in Neural Information Processing Systems 23 (2010), 19–27.

10.

Blundell

and Teh

Y.W.

, Bayesian hierarchical community discovery, Advances in Neural Information Processing Systems, 2013, 1601–1609.

11.

Friedman

Hastie

and Tibshirani

, The elements of statistical learning, New York: Springer series in statistics, 2001.

12.

Guénoche

Hansen

and Jaumard

, Efficient algorithms for divisive hierarchical clustering with the diameter criterion, Journal of Classification 8(1) (1991), 5–30.

13.

Charikar

and Chatziafratis

, Approximate hierarchical clustering via sparsest cut and spreading metrics, in: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 2017, pp. 841–854.

14.

Dasgupta

, A cost function for similarity-based hierarchical clustering, in: Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, ACM, 2016, pp. 118–127.

15.

Perou

C.M.

Sørlie

Eisen

M.B.

et al., Molecular portraits of human breast tumours, Nature 406(6797) (2000), 747.

16.

Krishnamurthy

Balakrishnan

et al., Efficient Active Algorithms for Hierarchical Clustering, Computer Science, 2012.

17.

Vikram

and Dasgupta

, Interactive bayesian hierarchical clustering, in: International Conference on Machine Learning, 2016, pp. 2081–2090.

18.

Balcan

M.F.

Liang

and Gupta

, Robust hierarchical clustering, The Journal of Machine Learning Research 15(1) (2014), 3831–3871.

19.

Eriksson

Dasarathy

Singh

et al., Active clustering: Robust and efficient hierarchical clustering using adaptively selected similarities, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 260–268.

20.

Allenby

G.M.

Rossi

P.E.

and McCulloch

R.E.

, Hierarchical Bayes Models: A Practitioners Guide, SSRN Scholarly Paper ID 655541, Social Science Research Network, Rochester, NY, 2005.

21.

Glazer

Weissbrod

Lindenbaum

et al., Approximating hierarchical mv-sets for hierarchical clustering, Advances in Neural Information Processing Systems, 2014, 999–1007.

22.

Maaten

and Hinton

, Visualizing data using t-SNE, Journal of Machine Learning Research 9(Nov) (2008), 2579–2605.

23.

Roy

and Pokutta

, Hierarchical clustering via spreading metrics, Advances in Neural Information Processing Systems, 2016, 2316–2324.

24.

Schütze

Manning

C.D.

and Raghavan

, Introduction to information retrieval, Cambridge: Cambridge University Press, 2008.

25.

Wong

J.A.H.A.

, Algorithm AS 136: A K-Means clustering algorithm, Journal of the Royal Statistical Society, Series C (Applied Statistics) 28(1) (1979), 100–108.

26.

A.Y.

Jordan

M.I.

and Weiss

, On Spectral Clustering: Analysis and an algorithm, in: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, MIT Press, 2001.

27.

Bengio

, Convergence Properties of the K-Means Algorithms, in: International Conference on Neural Information Processing Systems, 1995.

28.

Gunaratna

Thirunarayan

and Sheth

A.P.

, FACES: Diversity-Aware Entity Summarization using Incremental Hierarchical Conceptual Clustering, in: Twenty-ninth Aaai Conference on Artificial Intelligence, AAAI Press, 2015.

29.

Xiao

Hays

Ehinger

K.A.

et al., SUN database: Large-scale scene recognition from abbey to zoo, Proc.ieee Conf. on Computer Vision and Pattern Recognition 23(3) (2010), 3485–3492.

30.

Griffin

Holub

and Perona

, Caltech-256 Object Category Dataset/California Institute of Technology, Version: March 2007.

31.

Ghasedi Dizaji

Herandi

Deng

et al., Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5736–5745.

32.

Law

M.T.

Urtasun

and Zemel

R.S.

, Deep spectral clustering learning, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1985–1994.

33.

Yang

Yuan

et al., Multi-feature max-margin hierarchical Bayesian model for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1610–1618.

34.

Steinbach

Karypis

and Kumar

, A comparison of document clustering techniques, KDD Workshop on Text Mining 400(1) (2000), 525–526.

35.

Kim

and Huang

, A hierarchical image clustering cosegmentation framework, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 686–693.

36.

Liu

A.A.

Y.T.

Nie

W.Z.

et al., Hierarchical clustering multi-task learning for joint human action grouping and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(1) (2016), 102–114.

37.

Jardine

and Sibson

, Mathematical taxonomy, Systematic Zoology, 1971.

38.

Sneath

P.H.A.

Sokal

R.R.

and Williams

W.T.

, Numerical taxonomy, the principles and practice of numerical classification, Taxon 12(5) (1963), 190–199.

39.

Felsenstein

, Inferring Phylogenies, Inferring phylogenies, 2004.

40.

Ailon

and Charikar

, Fitting tree metrics: Hierarchical clustering and phylogeny, SIAM Journal on Computing 40(5) (2011), 1275–1291.

41.

Deng

Dong

Socher

et al., ImageNet: a Large-Scale Hierarchical Image Database, in: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, IEEE, 2009.

42.

Yager