Kernelized evolutionary distance metric learning for semi-supervised clustering

Abstract

This study proposes a novel distance metric learning method called evolutionary distance metric learning (EDML) to improve clustering quality that simultaneously evaluates inter- and intra-clusters. While we also provide an extension which integrates kernelization technique to the proposed method namely kernelized evolutionary distance metric learning (K-EDML). Hence, the non-linear transformation of distance metric can be performed while maintaining all properties of EDML. The proposed methods are able to handle either class label or pairwise constraints and directly improve any clustering index as an objective function. Both can be viewed as utilization of cluster-level soft constraints, unlike other instance-level hard constraints which sometimes collapse the clustering. Also, maintaining neighbor relation of clusters can lead to better visualization of the clustering result. For multimodality problem of the objective function, an evolutionary algorithm (EA), differential evolution with self-adapting control parameters and generalized opposition-based learning (GOjDE), is employed to optimize a metric transform matrix based on the Mahalanobis distance. We empirically demonstrate the drawback of EDML in non-linearly separable input space and illustrate the benefit of kernel function to extension K-EDML method by showing its superior result benefits to other clustering algorithms in the semi-supervised clustering on various real-world datasets.

Keywords

Clustering neighbor graph cluster validity index distance metric learning kernelization differential evolution

1. Introduction

In data mining and machine learning, the definition of distance between two data points substantially affects clustering and classification tasks. Recently, varieties of distance metric learning (DML) methods have been proposed to improve the accuracy of data mining tasks by learning a distance metric from a dataset [35, 15, 41]. Example approaches include nearest neighbor classification [32, 11], clustering [14, 56, 6], image ranking [16, 44, 27, 42, 53], and data visualization [23, 50].

This research is focusing on the DML in clustering domain, precisely semi-supervised clustering. Semi-supervised clustering [13, 14, 56] has been proposed to take advantage of background information, e.g., pairwise constraints, in the clustering. Traditionally Xing et al. have attempted to learn a similarity metric from side information [14], such as constraints on which pairs of documents must or must not appear in the same cluster [33], so that the ideal clustering can be produced. Semi-supervised clustering has emerged as an interesting alternative in the last years. These algorithms improve the clustering quality through external knowledge conveyed in the form of constraints. These constraints are used to guide the clustering process and can be directly derived from original data (using partially labeled data) or provided by a user, trying to adapt clustering results to his/her expectations [48]. Figure 1 illustrates the benefit of DML in clustering. Figure 1a shows data points with three classes (i.e., circles, squares, and stars) and three initial partitions (or clusters) in Euclidean space; one of the clusters has data points in all three classes. To correctly cluster every data point, the data space transformation stretches the partitions as shown in Fig. 1b such that every data point is correctly clustered. DML methods tries to find the appropriate transformation of the data space globally or locally so that the data points of the same class should be in neighbors and the data points of the different classes should be in distant places.

Figure 1.

Conceptual diagram of distance metric transformation.

However, the conventional semi-supervised clustering methods have the following drawbacks. It is reported that instance-level constraints, sometimes destroy the clustering quality [21, 34, 16], depending on relationship between the constraints and the data distribution. Moreover, there is no monotonicity to the number of constraints, that is the improvement of cluster quality is not guaranteed by adding constraints. These drawbacks are critical issues in practice. Instead of including the pairwise constraints into the objective function as penalty, we propose a methodology to directly improve a cluster validity index such as purity, F-measure, and entropy. The neighbor relation of clusters expects to propagate estimations of classes for unlabeled data by neighboring labeled data, like other DML methods such as neighborhood components analysis [23] or large-margin nearest neighbor classification [32]. In our work, the smoothed clustering validity index [28], which is an objective function, is used to evaluate overall cluster structure by simultaneously evaluate inter- and intra-clusters. Thus, our method can be regarded as based on cluster-level soft constraints.

In general, a function of the clustering index, either conventional or smoothed one, is massively multimodal when distance metrics vary. Hence, the proposed method named evolutionary distance metric learning (EDML) framework utilizes an evolutionary algorithm for this multimodality problem to search a sufficiently optimal metric transformation. The advantages of an evolutionary algorithm (EA) are as follows: (1) EAs can provide a solution even to problems that are hard to formulate using mathematical programming; (2) EAs sometimes heuristically discover unexpected solutions; and (3) EAs are highly parallelizable and can therefore make use of recent computational resources as multicore CPUs or PC clusters. Classic optimization methods such as gradient descent require a differentiable objective function, thus we addressed this by differential evolution (DE) [45] algorithm, since DE does not require the optimization problem to be differentiable. Thus, our work utilizes differential evolution with self-adapting control parameters and generalized opposition-based learning (GOjDE) [18] for a real-valued optimization problem, which has a high search-ability without requiring parameter adjustments. Koloseni et al. [9] proposed the DML method using DE [45] for classification. Our proposed EDML is for clustering, while their work supports the capability of applying DE to DML.

EDML provides outstanding results over other semi-supervised clustering in many dataset [57, 60, 58]; however, it can only perform a linear transformation, like most of the DML techniques, which yields small benefit to non-linearly separable data. Many kernel-based distance metric learning approaches have been proposed to address non-linearly separable data [10, 37, 46, 11, 8]. Thus, this study proposes kernelized evolutionary distance metric learning (K-EDML), a DML method which provides an integration of kernelization technique and EDML in order to address the problem of non-linearly separable data while maintaining all properties of EDML simultaneously. Therefore, the non-linear transformation of the distance metric can be performed while maintaining the optimized cluster validity index by an evolutionary algorithm.

The contributions of this work are as follows:

Proposing a novel semi-supervised distance metric learning for clustering techniques, by optimizing a cluster validity index that can be seen as utilizing cluster-level soft constraints

Proposing an extension for semi-supervised distance metric learning for clustering techniques, by integrating the kernelization technique to address non-linearly separable data

The neighborhood smoothing of a clustering index is introduced for better visualization of clustering result and for propagating label information to unlabeled data

Because of multimodality of clustering validity index as the objective function, an evolutionary algorithm is utilized for searching the sufficiently optimal distance metric

The experiments1

This paper is an extended version of our IEEE ICTAI2013 paper [29] and our AAAI2017 Student Abstract and Poster Program [59].

in this paper show that (1) GOjDE discovers better solution than other evolutionary algorithms, (2) the ratio of labels samples affect the result, (3) the smoothing of the clustering index with DML refines neighboring clusters for better visualization, (4) EDML and K-EDML outperforms the conventional semi-supervised clustering methods for benchmark datasets. Lastly, (5) the problem of EDML which yields insignificant to non-linearly separable data is addressed by K-EDML which is an integration of kernelization technique to EDML.

2. Related work

2.1 Distance metric learning

Distance metric learning (DML) [35] attempts to optimize a metric to improve classification or clustering. Example approaches include nearest neighbor classification [32, 11], clustering [14, 56, 6], image ranking [16, 44, 27, 42, 53], and data visualization [23, 50]. It can be categorized into 3 categories according to the amount of label information:

•
Unsupervised DML attempts to identify geometric relationships in the Euclidean data space. Normally, unsupervised DML methods are viewed as dimensional reduction or projection into low-dimensional space while preserving neighbor relations of data points. Classical method of multidimensional scaling (MDS) fall into this category. ISOMAP [26], local linear embedding (LLE) [49], and Laplacian eigenmaps [39] can also be called manifold learning, which is an approach for learning the nonlinear structure of the data distribution.
•
Supervised DML attempts to learn a distance metric transform function based on auxiliary information, including class labels and pairwise constraints of must-links and cannot-links. So far, a lot of supervised DML algorithms have been proposed, such as Mahalanobis distance learning model which first proposed by Xing et al. [14], Distance Metric Learning for Large Margin Nearest Neighbor Classification(LMNN) [32], linear DML for ranking (LDMLR) [4], Online Algorithm for Scalable Image Similarity learning (OASIS) [16], DML using dropout [44], and Geometric Mean Metric Learning (GMML) [42].
•
Semi-supervised DML combines an advantage of supervised and unsupervised DML, which attempts to use unlabeled data to help supervised metric learning which has limited of auxiliary information to learn an appropriate metric such that it satisfies the constraints. Example algorithms are MPC-Kmeans [38], Information-Theoretic Metric Learning (ITML) [25], Hierarchical Confidence-based Active Clustering with Metric learning [6], An Intrinsic Approach for Semi-supervised Distance Metric Learning [53].

Moreover, DML can be viewed in another perspective as global and local DML: Global distance metric learning[14, 1, 23, 25, 32, 64, 16, 61, 44, 8, 42, 6, 53] has a common metric transformation in a whole data space, and attempts to learn the optimal transformation by preserving all the elements of classes close to each other while separating different classes. Conversely, local distance metric learning[36, 32, 43] attempts to locally satisfy the constraints rather than simultaneously satisfy all constraints. This locality is particularly useful for information retrieval and $k$ -nearest neighbor classifiers. Although local DML methods have rich representation capability, they also have a tendency to over-fit owing to the high dimensionality to learn [36] while the global DML methods are resistant to over-fit even, they have relatively high constraints.

In addition, there are several nonlinear methods which learn more flexible metrics in order to fit into non-linearly separable data [10], for instance, the kernelization technique aid the linear learning algorithm by a implicit nonlinear mapping function, e.g., a nonparametric kernel matrix [37], Semi-supervised Kernel k-means (SS-K-KMN) [5], and Kernel-Based Distance Metric Learning for Ranking [8]. Moreover, a nonlinear distance metric could be learned via non-linear Gradient Boosting Regression Trees (GBRT) [11] or a deep feedforward neural network [62].

Our proposed EDML and K-EDML both have a distance metric learning as a basis and they are categorized as semi-supervised linear global DML and semi-supervised non-linear global DML respectively.
2.2 Semi-supervised clustering

Semi-supervised clustering tries to improve clustering quality aiding by external knowledge, mostly pairwise constraints. COP-Kmeans [33] is the first attempt to introduce pairwise constraints to K-means clustering. The cluster assignments of data points are forcibly modified to satisfy the constraints (hard-constraints), and the centroids are updated based on the modified assignments. MPC-Kmeans [38] uses soft-constraints to allow violating some constraints, and integrates with DML as well. Then, Information-Theoretic Metric Learning (ITML) [25] uses the LogDet divergence regularization which will later be used in several other Mahalanobis distance learning methods. Hierarchical Confidence-based Active Clustering with Metric Learning (HCAC-ML) [6] is one of the successor of ITML which generate the constraints from hierarchical information and feed to ITML. In addition, there are several nonlinear methods which learn more flexible metrics, for example, a nonparametric kernel matrix [37], Semi-Supervised Kernel k-means (SS-K-KMN) [5].

As mentioned earlier, the instance-level constraints sometimes collapse the clustering. Davidson et al. [21] introduced Coherence that is the degree of agreement between the constraints to measure the property of a given set of constraints. Constraints with low coherence have contradictions in the data space. Thus it will be difficult to fully satisfy the constraints and can lead the clustering to undesirable result. Meanwhile, our EDML and K-EDML utilizes cluster-level constraints, which tries to satisfy the constraints as much as possible guided by a clustering validity index.

2.3 Kernel function

Kernel trick is a technique to map the feature space to a higher dimensional feature space using a nonlinear function. Given a dataset $\mathcal{D}=\{{\bm{x}_{i}}=(x_{i,1},\ldots,x_{i,v})^{t}\in\mathbb{R}^{v}\}_{i=% 1}^{N}$ . By mapping points to feature space using basis function $\phi({\bm{x}_{i}})$ , then replace a dot product $\phi({\bm{x}_{i}})\phi({\bm{x}_{j}})$ with kernel function $K({\bm{x}_{i}},{\bm{x}_{j}})$ .

$\displaystyle K({\bm{x}_{i}},{\bm{x}_{j}})=\phi({\bm{x}_{i}})\cdot\phi({\bm{x}% _{j}})$ (1)

For example the polynomial kernel function is as follows:

$\displaystyle K({\bm{x}_{i}},{\bm{x}_{j}})=({\bm{x}_{i}}^{t}{\bm{x}_{j}}+c)^{d}$ (2)

In two dimensional space, given $c=0$ and $d=2$ , the basis function $\phi({\bm{x}_{i}})$ can derived as follows:

$\displaystyle K({\bm{x}_{i}},{\bm{x}_{j}})=({\bm{x}_{i}}^{t}{\bm{x}_{j}})^{2}=% (x_{i,1}^{2},\sqrt{2}x_{i,1}x_{i,2},x_{i,2}^{2})\cdot(x_{j,1}^{2},\sqrt{2}x_{j% ,1}x_{j,2},x_{j,2}^{2})=\phi({\bm{x}_{i}})\phi({\bm{x}_{j}})$ (3)

Thus, the mapping function is

$\displaystyle\phi({\bm{x}_{i}})=\phi(x_{i,1},x_{i,2})=(x_{i,1}^{2},\sqrt{2}x_{% i,1}x_{i,2},x_{i,2}^{2})$ (4)

Figure 2 shows the visualization of data space on synthetic data. Each point denotes the data points and color denotes the class of each point. In Fig. 2a, two classes are not linearly separable. On the other hand, Fig. 2b shows the transformed data space using polynomial mapping function in Eq. (4), this visualization shows the linearly separations of two classes.

Figure 2.

Example of the mapping function of polynomial kernel on synthetic data.

2.4 Kernel kmeans clustering

Kernel k-means clustering (K-KMN) [22] is an enhancement of K-means clustering (KMN) that can extract non-linearly separable clusters in the original data space by applying a proper nonlinear mapping function (kernel) to a higher dimensional feature space. Given a dataset $\mathcal{D}=\{{\bm{x}_{i}}=(x_{i,1},\cdots,x_{i,v})^{t}\in R^{v}\}_{i=1}^{N}$ with cluster set ${\mathbf{C}}$ , let the $k^{th}$ cluster $C_{k}\in{\mathbf{C}}$ . Using the non-linear function $\phi(\bm{x})$ , the objective function of K-KMN is defined as:

$\displaystyle\operatorname*{Minimize}\sum_{C_{k}\in\mathbf{C}}\sum_{\bm{x}_{i}% \in C_{k}}\parallel{\bm{\pi}}_{k}-\phi(\bm{x}_{i})\parallel^{2}_{2}$ (5)

Note that $\bm{\pi}_{k}$ denotes a centroid of cluster $C_{k}$ on the mapped space. Then, the $\bm{\pi}_{k}$ is as follow:

$\displaystyle{\bm{\pi}}_{k}=\frac{\sum_{\bm{x}_{i}\in C_{k}}\phi(\bm{x}_{i})}{% \mid C_{k}\mid}$ (6)

$\mid C_{k}\mid$ denotes the number of data points in cluster $C_{k}$ . Since the basis function $\phi(\mathbf{x_{i}})$ is hard to obtain, kernel function $K(\mathbf{x_{i}},\mathbf{x_{j}})=\phi(\mathbf{x_{i}})\cdot\phi(\mathbf{x_{j}})$ is calculated instead.

$\displaystyle\parallel{\bm{\pi}}_{k}-\phi(\bm{x}_{i})\parallel^{2}_{2}=\left|% \left|\frac{\sum_{\bm{x}_{j}\in C_{k}}\phi(\bm{x}_{j})}{\mid C_{k}\mid}-\phi(% \bm{x}_{i})\right|\right|^{2}_{2}=\frac{\sum_{\bm{x}_{j},\bm{x}_{l}\in C_{k}}% \phi(\bm{x}_{j})\cdot\phi(\bm{x}_{l})}{\mid C_{k}\mid^{2}}-\frac{2\sum_{\bm{x}% _{j}\in C_{k}}\phi(\bm{x}_{i})\cdot\phi(\bm{x}_{j})}{\mid C_{k}\mid}+\phi(\bm{% x}_{i})\cdot\phi(\bm{x}_{i})=\frac{\sum_{\bm{x}_{j},\bm{x}_{l}\in C_{k}}K(\bm{% x}_{j},\bm{x}_{l})}{\mid C_{k}\mid^{2}}-\frac{2\sum_{\bm{x}_{j}\in C_{k}}K(\bm% {x}_{i},\bm{x}_{j})}{\mid C_{k}\mid}+K(\bm{x}_{i},\bm{x}_{i})$ (7)

2.5 Self-Organizing Map

Self-Organizing Map (SOM) [55] represents clustering concept by grouping similar data together which gradually adjusted in an attempt to preserve neighborhood relationships that exist within the input dataset. Moreover, in the visualization of high-dimensional data, SOM preserves the neighbor relations of clusters in a low-dimensional space to the possible extent.

2.6 Clustering index with neighbor relation

In general, an external clustering index evaluates individual cluster quality, while Fukui and Numao [28] introduced neighbor relations of clusters into the conventional external indices by adding a weighting function. In principle, to introduce neighbor relations, the data points of the same class in the neighbor clusters should have high weights, while those of distant clusters should have low weights based on the inter-cluster distance. For example, optimizing the distance metric according to pairwise F-measure allows the data points of the same class belong to the same cluster (improving precision), and each class is distributed to fall in fewer clusters (improving recall). Moreover, by introducing the smoothing function into the index, the data points of the same class tend to located in neighboring clusters [58].

2.6.1 Set-based indices

Given dataset $\mathcal{D}$ with cluster set ${\mathbf{C}}$ and class set ${\mathbf{T}}$ , let $N_{s,i}$ be the number of data points with class $s\in{\mathbf{T}}$ in the $i^{th}$ cluster $C_{i}\in{\mathbf{C}}$ ; $N_{s,i}=\#\{{\bm{x}}_{k}|t(k)=s,c(k)=C_{i}\}$ , where $\#$ denotes the number of elements, and $c(k)$ and $t(k)$ denote the cluster/class assignment for ${\bm{x}}_{k}$ . $N_{i}$ denotes the number of data points in cluster $C_{i}$ ; $N_{i}=\#\{{\bm{x}}_{k}|c(k)=C_{i}\}$ , $N$ is the total number of data points; $N=\#\{{\bm{x}}_{k}|{\bm{x}}_{k}\in{\mathcal{D}}\}$ . These basic values are smoothed by a weighting function $h_{i,j}$ as follows:

$\displaystyle N^{\prime}_{s,i}=\sum_{C_{j}\in{\mathbf{C}}}h_{i,j}N_{s,j},$ (8) $\displaystyle N^{\prime}_{i}=\sum_{s\in{\mathbf{T}}}N^{\prime}_{s,i}=\sum_{s% \in{\mathbf{T}}}\sum_{C_{j}\in{\mathbf{C}}}h_{i,j}N_{s,j},$ (9) $\displaystyle N^{\prime}=\sum_{C_{i}\in{\mathbf{C}}}N^{\prime}_{i}=\sum_{C_{i}% \in{\mathbf{C}}}\sum_{s\in{\mathbf{T}}}\sum_{C_{j}\in{\mathbf{C}}}h_{i,j}N_{s,% j}.$ (10)

Here, the smoothing function $h_{i,j}$ can be any monotonically decreasing function, we use Gaussian function; $h_{i,j}=\exp(-d^{c}_{i,j}/\sigma^{2})$ , where $d^{c}_{i,j}$ denotes inter-cluster distance such as distance between cluster centroids and $\sigma(>0)$ is a smoothing (neighborhood) radius.

By using Eq. (8) through Eq. (10) instead of the original values, any set-based clustering index, such as purity, F-measure, and entropy, can be extended as follows:

•

weighted purity (wPUR)

$\displaystyle\text{wPUR}({\mathbf{C},\mathbf{T}})=\frac{1}{N^{\prime}}\sum_{C_% {i}\in{\mathbf{C}}}\max_{s\in{\mathbf{T}}}{N^{\prime}_{s,i}}$ (11)

•

weighted F-measure (wFME)

$\displaystyle\text{wFME}({\mathbf{C},\mathbf{T}})=\sum_{s\in{\mathbf{T}}}\frac% {N_{s}}{N}\max_{C_{i}\in C}F(s,C_{i}),$ (12) $\displaystyle F(s,C_{i})=\frac{2\cdot\textit{Prec}(s,C_{i})\cdot\textit{Rec}(s% ,C_{i})}{\textit{Prec}(s,C_{i})+\textit{Rec}(s,C_{i})},$ (13)

where $\textit{Prec}(s,C_{i})=N^{\prime}_{s,i}/N^{\prime}_{i}$ , $\textit{Rec}(s,C_{i})=N^{\prime}_{s,i}/N_{s}$ , and $N_{s}=\#\{{\bm{x}}_{k}|t(k)=s\in{\mathbf{T}}\}$ .

•

weighted entropy (wENT)

$\displaystyle\text{wENT}({\mathbf{C},\mathbf{T}})=1-\frac{1}{|{\mathbf{C}}|}% \sum_{C_{i}\in{\mathbf{C}}}\textit{Entropy}(C_{i}),$ (14) $\displaystyle\textit{Entropy}(C_{i})=-\frac{1}{\log{N^{\prime}}}\sum_{s\in{% \mathbf{T}}}\frac{N^{\prime}_{s,i}}{N^{\prime}_{i}}\log\frac{N^{\prime}_{s,i}}% {N^{\prime}_{i}}.$ (15)

2.6.2 Pairwise-based index

Given class and cluster assignment of a data point ${\bm{x}}_{i}$ denoted as $t(i)$ and $c(i)$ . Table 1 shows a class and cluster confusion matrix of data pairs, where $a, b, c, d$ are the number of data pairs that ${\bm{x}}_{i}$ and ${\bm{x}}_{j}$ do or do not belong to the same class/cluster.

Here, Fukui and Numao [28] introduced $\textit{likelihood}(c(i)=c(j))$ indicating a degree that a data pair ${\bm{x}}_{i}$ and ${\bm{x}}_{j}$ belongs to the same cluster instead of the actual number of data pairs. The likelihood is given by a weighting function based on inter-cluster distance of the data pair; $\textit{likelihood}(c(i)=c(j))=h_{c(i),c(j)}$ .

Table 1
Class and cluster confusion matrix of data pairs

	$t(i)=t(j)$	$t(i)\neq t(j)$
$c(i)=c(j)$	$a$	$b$
$c(i)\neq c(j)$	$c$	$d$

Then, $a, b, c, d$ are replaced by summation of the likelihoods as follows:

$\displaystyle a^{\prime}=\sum_{\{i,j|t(i)=t(j)\}}h_{c(i),c(j)},$ (16) $\displaystyle b^{\prime}=\sum_{\{i,j|t(i)\neq t(j)\}}h_{c(i),c(j)},$ (17) $\displaystyle c^{\prime}=\sum_{\{i,j|t(i)=t(j)\}}\Bigl{(}1-h_{c(i),c(j)}\Bigr{% )}=a+c-a^{\prime},$ (18) $\displaystyle d^{\prime}=\sum_{\{i,j|t(i)\neq t(j)\}}\Bigl{(}1-h_{c(i),c(j)}% \Bigr{)}=b+d-b^{\prime}.$ (19)

With these extended $a^{\prime},b^{\prime},c^{\prime}$ and $d^{\prime}$ , weighted pairwise F-measure is defined as follows:

•

weighted pairwise F-measure (wPFM)

$\displaystyle\text{wPFM}({\mathbf{C},\mathbf{T}})=\frac{2\cdot P\cdot R}{P+R},$ (20)

where $P=a^{\prime}/(a^{\prime}+b^{\prime})$ is a weighted precision and $R=a^{\prime}/(a^{\prime}+c^{\prime})$ is a weighted recall. The conventional precision is a ratio of the data pairs belonging to the same class within the same cluster. Likewise, the conventional recall is the data pairs belonging to the same cluster within the same class. The weighted precision and recall are extended to calculate the degree of belonging to the cluster/class by neighborhood relation of clusters.

2.7 Differential evolution

Differential evolution (DE) [45] is a population-based meta-heuristics approach for solving real-valued optimization problems. DE requires less user’s interaction by requiring minimal gene selection operator(s) and control parameter adjustments. Furthermore, DE performs better than real-valued genetic algorithms or evolution strategies and is therefore applied to various optimization problems [30].

Our work makes use of a variant of DE, generalized opposition-based jDE (GOjDE) [18], which is an extension of self-adaptive differential evolution (jDE) [24]. They do not require any adjustment to crossover rate CR and scale factor SF by randomization of the parameters while searching. Because each individual has its own CR and SF values, GOjDE/jDE allows individuals that have better values to lead to better individuals in the next generation owing to their higher survivability rates. Also, jDE showed the best performance in “Evolutionary Computation in Dynamic and Uncertain Environments” in CEC2009 [7].

GOjDE employs generalized opposition-based learning (GOBL) for population initialization and for population jumping during the optimization. GOBL improves the search performance of DE for functions whose global optimum is around the center of its search space, in particular [51, 19]. Because most non-diagonal elements of optimal solutions in EDML become zero, which implies the solutions located around the center of the search space, GOjDE is suitable for this problem. The GOjDE algorithm is summarized as follows:

Step 1:
Initialization

Randomly generate $N_{P}$ individuals with $m$ -dimensional vector ${\bm{p}}_{i,g}=\{p_{1,i,g},$ $p_{2,i,g},\ldots,p_{m,i,g}\}$ $(i=1,2,\ldots,N_{P})$ within each domain of the definition, and set the generation number as $g=0$ . Then, create an opposition population using GOBL by the following equations:

$\displaystyle p^{}_{j,i,g}=k(a_{j,g}+b_{j,g})-p_{j,i,g}\ (j=1,2,\ldots,m)$ (21) $\displaystyle a_{j,g}=\min_{i}(p_{j,i,g}),∼{}b_{j,g}=\max_{i}(p_{j,i,g})$ (22)

where $p^{}_{j,i,g}$ denotes an opposite point calculated from a reference point $k(a_{j,g}+b_{j,g})$ . In initialization, $a_{j,0}$ and $b_{j,0}$ are regarded as the min and the max of the defined range of $j^{th}$ variable, respectively, and $k=1$ . Next, evaluate the fitness (i.e., one of the smoothed clustering indices mentioned above) for individuals in the original and opposite populations, and select top $N_{P}$ individuals to the next population.
Step 2:
Termination determination

When a termination condition is satisfied, the process terminates.
Step 3:
Operation selection

Select GOBL operation (go to Step 4) with the probability $\tau_{o}$ , otherwise perform jDE operations (go to Step 6).
Step 4:
GOBL

Randomly determine $k$ within the range of $[0,1]$ . Then, create opposition population by Eq. (22). The change of $k$ produces various reference points, allowing the population to jump to another place in the search space; the closer to $k=1$ , the opposition population is generated inside the population in previous generation, while the closer to $k=0$ , the opposition population jumped to farther area from the previous population. If $p^{}_{j,i,g}$ exceeds the defined domain range, $p^{}_{j,i,g}$ is redetermined by $\textit{rand}(a_{j,g},b_{j,g})$ .
Step 5:
Evaluation and selection (GOBL)

Evaluate the fitness of individuals in the opposition population, and then select top $N_{P}$ individuals from a union of previous and opposite populations to the next generation. Update the generation no. $g\rightarrow g+1$ , and go back to Step 2.
Step 6:
Control parameter update

Update scale factor $SF_{i}$ and crossover rate $CR_{i}$ of $i^{th}$ individual by the following equations:

$\displaystyle\textit{SF}_{i,g}=\left\{\begin{array}[]{ll}\textit{SF}_{l}+% \textit{rand}_{1}\cdot\textit{SF}_{u}&\text{if}\ \textit{rand}_{2}<\tau_{1}\\ \textit{SF}_{i,g-1}&\text{otherwise}\end{array}\right.$ (23) $\displaystyle\textit{CR}_{i,g}=\left\{\begin{array}[]{ll}\textit{rand}_{3}&% \text{if}\ \textit{rand}_{4}<\tau_{2}\\ \textit{CR}_{i,g-1}&\text{otherwise}\end{array}\right.$ (24)

where $\textit{rand}_{j}\ (j\in{1,2,3,4})$ are uniform random values ranging in $[0,1]$ , $\tau_{1}$ and $\tau_{2}$ are probabilities changing $\textit{SF}_{i,g}$ and $CR_{i,g}$ , respectively, and $\textit{SF}_{l}$ and $\textit{SF}_{u}$ determines the range of scale factor values.
Step 7:
Mutation

Let a target vector be the $i^{th}$ individual ${\bm{p}}_{i,g}$ to be operated on. Select a base vector ${\bm{p}}_{b,g}$ from individuals, and generate a mutant vector ${\bm{v}}_{i,g}$ by

$\displaystyle{\bm{v}}_{i,g}={\bm{p}}_{b,g}+SF_{i,g}\cdot({\bm{p}}_{r1,g}-{\bm{% p}}_{r2,g}),$ (25)

where $b\neq r_{1}\neq r_{2}$ ( $r_{1},r_{2}\in\{1,\cdots,N_{P}\}$ are randomly selected), and scale factor $SF_{i,g}(0\leqslant SF_{i,g}\leqslant 1)$ is an important parameter to determine the search range.
Step 8:
Crossover

Generate trial vector ${\bm{u}}_{i,g}$ by a crossover operation between target vector ${\bm{p}}_{i,g}$ and mutant vector ${\bm{v}}_{i,g}$ . Equation (26) shows a binomial crossover. Each element of $u_{j,i,g}\in{\bm{u}}_{i,g}\ (j=1,2,\ldots,m)$ is determined with crossover rate $CR_{i,g}\ (0\leqslant CR_{i,g}\leqslant 1)$ and a randomly selected index $j_{\textit{rand}}\ (1\leqslant j_{\textit{rand}}\leqslant m)$ as:

$\displaystyle{\bm{u}}_{j,i,g}=\left\{\begin{array}[]{ll}v_{j.i,g}&\text{if}\ % \textit{rand}[0,1]\leqslant CR_{i,g}\ \text{or}\ j=j_{\textit{rand}},\\ p_{j,i,g}&\text{otherwise},\end{array}\right.$ (26)

where rand[0,1] is a uniformly distributed random numbers in the range [0,1].
Step 9:
Repair

When the trial vector involves a violation against the constraint, the repair operation is executed in order to maintain the individual back on track, e.g., the repair operation of individual in EDML and K-E

Evaluate the fitness of the trial vectors and compare with each target vector, and then select the more fit vector as an individual to the next generation. Update the generation no. $g\rightarrow g+1$ , and go back to Step 2.

3. Proposed evolutionary distance metric learning

Evolutionary Distance Metric Learning (EDML) is an efficient framework that applies an evolutionary algorithm (EA) to firmly search a sufficiently optimal distance metric transformation matrix. EDML is based on a clustering index with neighbor relation that simultaneously evaluates inter- and intra-clusters to improve clustering quality. In contrast to other semi-supervised clustering, which formulate a penalty function for constraints into an objective function, EDML, however, directly improves the cluster validity index, such as purity, F-measure, or entropy, depending on the clustering purpose, as an objective function, when class information is available. Moreover, the cluster validity index is smoothed by neighbor relations which refines neighboring cluster for better visualization and the data points of the same class tend to locate in neighboring clusters.

3.1 Global distance metric learning

In this work, a Mahalanobis-based distance is used just as in the case of many global DML methods. Given a dataset $\mathcal{D}=\{{\bm{x}_{i}}=(x_{i,1},\ldots,x_{i,v})^{t}\in\mathbb{R}^{v}\}_{i=% 1}^{N}$ , the Mahalanobis-based distance can be defined as:

$\displaystyle d_{i,j}^{2}=({\bm{x}}_{i}-{\bm{x}}_{j})^{t}{\mathbf{M}}({\bm{x}}% _{i}-{\bm{x}}_{j}),$ (27)

where ${\mathbf{M}}=(m_{k,l})$ is a $v\times v$ matrix. In the original Mahalanobis distance, ${\mathbf{M}}$ is given by the inverse of the variance-covariance matrix of the input data, i.e., ${\mathbf{M}}={\bm{\Sigma}}^{-1}$ . While in DML, the elements of ${\mathbf{M}}$ are variables to be learned that represent a transformation of the input data, in this case, ${\mathbf{M}}$ must be a symmetric positive semi-definite matrix to satisfy the distance propositions. For further understanding, we can rewrite Eq. (27) as follows:

$\displaystyle d_{i,j}^{2}=({\bm{x}}_{i}-{\bm{x}}_{j})^{t}{\mathbf{M}}({\bm{x}}% _{i}-{\bm{x}}_{j})=\sum_{k,l}m_{k,l}(x_{i,k}-x_{j,k})(x_{i,l}-x_{j,l}),$ (28)

in which diagonal elements of ${\mathbf{M}}$ (where $k=l$ ) indicate scaling for each dimension, whereas non-diagonal elements indicate correlation between different dimensions. Obviously, when ${\mathbf{M}}$ is a unit matrix, the Mahalanobis-based distance is equivalent to the Euclidean distance. This Mahalanobis-based distance is used as a distance metric for clustering algorithm in EDML.

3.2 Objective function

Our metric learning approach optimizes a clustering index Eval as follows:

$\displaystyle\text{Maximize}\ \textit{Eval}(\textit{Clustering}(d_{i,j}^{2})),% \text{s.t.}\ \ |m_{k,k}|\geqslant\sum_{l(k\neq l)}|m_{k,l}|,\ 0<m_{k,k}% \leqslant 1,\ -1\leqslant m_{k,l}\leqslant 1\ (k\neq l),$ (29)

where $\textit{Clustering}(d_{i,j}^{2})$ denotes a clustering result by using a distance metric $d_{i,j}^{2}$ – i.e., $\textit{Clustering}():{\bm{x}}\mapsto c\in{\mathbf{C}}$ , where ${\mathbf{C}}$ is a set of cluster identifier. The clustering result is evaluated by external criteria, however the number of available information is limited in semi-supervised environment. Thus, the neighborhood smoothing in the cluster validity index is used instead to capture the overall cluster structure both inter- and intra-clusters simultaneously. The parameters for neighborhood smoothing are as follows. The weighting function $h_{i,j}$ is a Gaussian function. Therefore, $h_{i,j}=\exp(-{\mbox{$\bm{r}$}}_{i,j}/\sigma)$ , where ${\mbox{$\bm{r}$}}_{i,j}$ denotes the inter-cluster distance between $C_{i}$ and $C_{j}$ , and $\sigma(>0)$ is a smoothing (neighborhood) radius.

For constraint condition, in order to satisfy the proposition of distance metric, we set a condition as matrix ${\mathbf{M}}$ be a weak diagonally dominant matrix – i.e., $|m_{i,i}|\geqslant\sum_{j(i\neq j)}|m_{i,j}|$ and the diagonal elements must be positive, to ensure that ${\mathbf{M}}$ is a positive semi-definite matrix.

3.3 Evolutionary algorithm

EDML is focused on applying to real-world problems, which typically involve a gigantic high-dimensional data. High-dimensional global optimization is one such high-complexity problem. Therefore, self-adapting control parameters and generalized opposition-based differential evolution (GOjDE) [18], which explained in Section 2.7, is used to optimize the objective Eq. (29) in order to manage the quality of candidate solutions. Here an upper bound on the generation number or fitness evaluation number value is used as a termination condition in GOjDE. The matrix ${\mathbf{M}}$ is an individual, and elements in a triangular matrix of ${\mathbf{M}}$ correspond to a gene, for example, in the two-dimensional case, the individual vector for ${\mathbf{M}}$ is $(m_{1,1},m_{1,2},m_{2,2})$ . The repair process in Step 9 of GOjDE is occurred during the mutation, the variables in the vector are uniformly repaired unless the diagonally dominant matrix condition in Eq. (29) is satisfied. The following eqaution is a repair process for the individual both in EDML and K-EDML.

$\displaystyle m_{i,j}^{\textit{repair}}=\frac{m_{i,i}}{\sum_{j}|m_{i,j}|}m_{i,% j},\ (i\neq j).$ (30)

3.4 Evolutionary distance metric learning framework

The proposed EDML framework is summarized in Fig. 3. First, the candidates of metric matrix ${\mathbf{M}}$ are generated by evolutionary algorithm, e.g., DE, jDE, GOjDE. Next, the cluster structure – i.e., clusters with neighbor relations – is obtained with a distance metric transformed by Eq. (27). Here the cluster structure can be obtained by any partition-based clustering technique with neighborhood relation, such as $k$ -means with a $k$ -nearest neighbor graph of cluster centroids, or vector quantization with topology preservation by Self-Organizing Map (SOM). After obtaining the cluster structure with a transformed distance metric, the quality of the clusters and neighbor relations is evaluated with class labels or pairwise constraints via the smoothed clustering index; one of the weighted purity, weighted F-measure, weighted entropy, or weighted pairwise F-measure indices are used. Next, the evaluation value is fed back into GOjDE as the fitness for the candidate metric matrix. GOjDE selects individuals for the next generation on the basis of the fitness and generates the next candidates by mutation and crossover with certain probabilities. These steps are repeated until the termination condition is satisfied. The output is the best metric matrix ${\mathbf{M}}^{\ast}$ in terms of the smoothed clustering index among the overall generations of candidates. Algorithm 1 shows the pseudocode of EDML with the following configuration, evolutionary algorithm: GOjDE; clustering algorithm: k-means clustering; cluster validity index: wPFM.

Figure 3.

Diagram of the evolutionary distance metric learning (EDML) framework.

Algorithm 1

Evolutionary Distance Metric Learning (EDML)

Input: $\mathcal{D}$ : dataset, $\mathbf{T}$ : class labels or pairwise constraints Output: $\mathbf{M^{*}}$ : best metric matrix

$g\leftarrow 0$ .

$\textit{maxEval}\leftarrow 0$

Initialize candidate metric population ${\bm{P}}_{g}$ via GOjDE.

while $g<itr_{\textit{max}}$ or $\textit{maxEval}=1$ do

for $\forall{\bm{p}}_{c,g}\in{\bm{P}}_{g}$ do

$\mathbf{M}\leftarrow{\bm{p}}_{c,g}$

if $|m_{k,k}|\geqslant\sum_{l(k\neq l)}|m_{k,l}|,0<m_{k,k}\leqslant 1,\ -1% \leqslant m_{k,l}\leqslant 1\ (k\neq l)$ then // weak diagonally dominant matrix

Repair $\mathbf{M}$ using Eq. (30).

$d_{i,j}^{2}\leftarrow({\bm{x}}_{i}-{\bm{x}}_{j})^{t}\mathbf{M}({\bm{x}}_{i}-{% \bm{x}}_{j})$ using Eq. (28).

10:

$\textit{Clustering}(d_{i,j}^{2})\leftarrow$ k-means clustering using $d_{i,j}^{2}$

11:

$\textit{eval}\leftarrow$ wPFM( $\textit{ Clustering}(d_{i,j}^{2})$ , $\mathbf{T}$ ) using Eq. (20).

12:

if $\textit{eval}>\textit{maxEval}$ then

13:

$\textit{maxEval}\leftarrow\textit{eval}$

14:

$\mathbf{M^{*}}\leftarrow\mathbf{M}$

15:

Crossover and Mutation ${\bm{P}}_{g+1}$ using Eqs (25) and (26) via GOjDE.

16:

$g\leftarrow g+1$ .

17:

return $\mathbf{M^{*}}$

4. Kernelized evolutionary distance metric learning

As mentioned at the end of Section 1, theoretically the proposed EDML yields insignificant results in non-linearly separable data. Therefore, this study proposed an improvement of EDML, namely kernelized evolutionary distance metric learning (K-EDML) which is an integration of kernelization technique and EDML. While this proposed method maintains all the properties of EDML, unlike other kernelized DML [41] which are formulating a penalty function for constraints, i.e., must-link and cannot-link, into an objective function. Note that, the K-EDML is similar to the EDML when the linear kernel function $K(\bm{x}_{i},\bm{x}_{j})=(\bm{x}_{i}^{t}\bm{x}_{j})$ is used as a kernel function. Although the framework of K-EDML is similar to EDML, it needs a modification in order to integrate the kernelization technique.

4.1 Integrating kernelization technique to EDML

The kernelization technique is unified in the cluster structure learning process. K-EDML could be applied to any partition-based kernel clustering with neighbor relation. In this study, kernel K-means clustering (K-KMN) [22] has been used as a base clustering method. In order to integrate the kernelization technique into EDML, a symmetric positive semi-definite matrix ${\mathbf{M}}$ in Eq. (28) is decomposed into ${\mathbf{M}}={\mathbf{L}}^{t}{\mathbf{L}}$ by Cholesky decomposition, where ${\mathbf{L}}$ denotes an upper triangular matrix. Equation (27) can therefore be rewritten as:

$\displaystyle d_{i,j}^{2}=({\bm{x}}_{i}-{\bm{x}}_{j})^{t}{\mathbf{M}}({\bm{x}}% _{i}-{\bm{x}}_{j})=({\bm{x}}_{i}-{\bm{x}}_{j})^{t}({\mathbf{L}}^{t}{\mathbf{L}% })({\bm{x}}_{i}-{\bm{x}}_{j})=({\mathbf{L}}{\bm{x}}_{i}-{\mathbf{L}}{\bm{x}}_{% j})^{t}({\mathbf{L}}{\bm{x}}_{i}-{\mathbf{L}}{\bm{x}}_{j})=\|{\mathbf{L}}{\bm{% x}}_{i}-{\mathbf{L}}{\bm{x}}_{j}\|^{2}_{2}.$ (31)

Hence, the Mahalanobis-based distance can be viewed as the Euclidean distance after a linear transformation by ${\mathbf{L}}$ . Hence each decomposed obtained candidate ${\mathbf{L}}$ is individually multiplied to the original data, thus the input data ${\bm{x}}_{i}$ is substituted with ${\mathbf{L}}{\bm{x}}_{i}$ . These transformed input is utilized into the K-KMN objective function in Eq. (5) as follows:

$\displaystyle\operatorname*{Minimize}\sum_{C_{k}\in\mathbf{C}}\sum_{{\bm{x}}_{% i}\in C_{k}}\parallel\bm{\pi}_{k}-\phi({\mathbf{L}}{\bm{x}}_{i})\parallel^{2}_% {2}$ (32)

Figure 4.

Diagram of the kernelized evolutionary distance metric learning (K-EDML) framework.

4.2 Kernelized evolutionary distance metric learning framework

Although the framework of K-EDML is similar to EDML, two steps are added, i.e., candidate decomposition, input data transformation, and a modification in cluster structure learning. Figure 4 summarized K-EDML framework. First, candidates of metric transform matrix ${\mathbf{M}}$ are generated using GOjDE. Further, the symmetric PSD matrix M can be decomposed into $\textbf{M}=\textbf{L}^{t}\textbf{L}$ by Cholesky decomposition, where L denotes an upper triangular matrix. Mahalanobis distance in Eq. (27) can be rewritten as Eq. (31). Next, each decomposed obtained candidate ${\mathbf{L}}$ is individually multiplied to the original data, thus the input data $\bm{x}_{i}$ is substituted with ${\mathbf{Lx}}_{i}$ . This transformed input is utilized into the K-KMN objective function Eq. (5) and used as Clustering() in Eq. (29). Then, class labels are utilized in order to evaluate the quality of the cluster structure through the neighborhood smoothing in the clustering index. This is followed by feeding the evaluated values back into GOjDE as the fitness for each candidate ${\mathbf{M}}$ . GOjDE selects candidates based on the fitness to evolve and generate the next candidates by mutation and crossover with certain probabilities. These steps are repeated until the termination condition (e.g., the iteration limit) is satisfied. Finally, the optimal metric transform matrix ${\mathbf{M}}^{\ast}$ is obtained in terms of the most smoothed clustering index among the overall generations of candidates. Algorithm 2 shows the pseudocode of K-EDML with the following configuration, evolutionary algorithm: GOjDE; clustering algorithm: kernel k-means clustering; cluster validity index: wPFM.

Algorithm 2
Kernelized Evolutionary Distance Metric Learning

Input: $\mathcal{D}$ : dataset, $K(\bm{x}_{i},\bm{x}_{j})$ : kernel function, $\mathbf{T}$ : class labels or pairwise constraints Output: $\mathbf{M^{}}$ : best metric matrix
1:
$g\leftarrow 0$ .
2:
$\textit{maxEval}\leftarrow 0$
3:
Initialize candidate metric population ${\bm{P}}_{g}$ via GOjDE.
4:
while $g<itr_{\textit{max}}$ or $\textit{maxEval}=1$ do
5:
for $\forall{\bm{p}}_{c,g}\in{\bm{P}}_{g}$ do
6:
$\mathbf{M}\leftarrow{\bm{p}}_{c,g}$
7:
if $|m_{k,k}|\geqslant\sum_{l(k\neq l)}|m_{k,l}|,0<m_{k,k}\leqslant 1,\ -1% \leqslant m_{k,l}\leqslant 1\ (k\neq l)$ then // weak diagonally dominant matrix
8:
Repair $\mathbf{M}$ using Eq. (30).
9:
${\mathbf{L}}^{t}{\mathbf{L}}\leftarrow\mathbf{M}$
10:
$d_{i,j}^{2}\leftarrow\|{\mathbf{L}}{\bm{x}}_{i}-{\mathbf{L}}{\bm{x}}_{j}\|^{2}% _{2}$ using Eq. (31).
11:
$\textit{Clustering}(d_{i,j}^{2})\leftarrow$ kernel k-means clustering using $d_{i,j}^{2}$ and $K(\bm{x}_{i},\bm{x}_{j})$
12:
$\textit{eval}\leftarrow$ wPFM( $\textit{Clustering}(d_{i,j}^{2})$ , $\mathbf{T}$ ) using Eq. (20).
13:
if $\textit{eval}>\textit{maxEval}$ then
14:
$\textit{maxEval}\leftarrow\textit{eval}$
15:
$\mathbf{M^{}}\leftarrow\mathbf{M}$
16:
Crossover and Mutation ${\bm{P}}_{g+1}$ using Eqs (25) and (26) via GOjDE.
17:
$g\leftarrow g+1$ .
18:
return $\mathbf{M^{*}}$
4.3 Simple example of k-EDML

Figure 5 visualizes the concept of the proposed K-EDML. The original data is visualized in Fig. 2a, each color denotes each class. In order to see the different in kernel data space, we map the original data to the kernel data space using the Eq. (4) as in Fig. 2b. Figure 5a and b present the result of K-KMN and K-EDML respectively. Since the K-KMN cannot make use of provided class labels, also only the minimization of the distance between centroid and data points in Eq. (5) is preserved. Thus, K-KMN cannot correctly cluster the data even it has a linear separation which we can see from the mixing of two clusters on the outer circle. In contrast, K-EDML can take advantage of class labels to preserve the clusters with the same class label by stretch the data space, i.e., the same class data move close together, and the different class move apart which we can see in Fig. 5b.

Figure 5.

Visualization of clustering results on kernel space.

5. Experiments

5.1 Experimental settings

We used the following ten open datasets from the well-known UCI Machine Learning Repository2

²
http://archive.ics.uci.edu/ml/.

[12]; Glass, Iris, Wine, Vehicle, Segment, Ionosphere, Pima, Musk, Balance, and Yeast. Basic statistics of these datasets are summarized in Table 2. The attribute values were normalized such that each average is equal to zero and standard deviation is equal to one. Note that for the datasets that the attributes is greater than 10, are indicated by adding asterisk (*) after their names in Table 2. Principal component analysis (PCA) were adopted to reduce the input dimension to 10.

Table 2

The basic statistics of UCI datasets

Dataset	# samples	# attributes	# classes
Glass	214	9	6
Iris	150	4	3
Wine*	178	13	3
Vehicle*	846	18	4
Segment*	2310	19	7
Ionosphere*	351	34	2
Pima	768	8	2
Musk*	625	166	2
Balance	625	4	3
Yeast	1484	8	10

Table 3

Settings for SOM, K-means with K-nearest neighbor graph (KMN-KNN) and Kernel K-means kernel function, and the evolutionary algorithms (EA) (“diagonal” is ${\bf M}$ as a diagonal matrix case, and “full” is a full matrix case)

Dataset	SOM # nodes	KMN-KNN # clusters, # neighbors	# individuals in EA diagonal, full	# generations in EA
Glass	10 $\times$ 10	20, 5	36, 90	6,000
Iris	10 $\times$ 10	20, 5	20, 30	2,000
Wine	10 $\times$ 10	20, 5	39, 182	4,000
Vehicle	13 $\times$ 13	50, 5	36, 171	3,000
Segment	13 $\times$ 13	50, 5	38, 190	2,000
Ionosphere	$-$	20, 5	$-$ , 165	2,000
Pima	$-$	20, 5	$-$ , 108	2,000
Musk	$-$	20, 5	$-$ , 165	2,000
Balance	$-$	20, 5	$-$ , 30	2,000
Yeast	$-$	20, 5	$-$ , 108	2,000

We employed k-means together with k-nearest neighbor graph, or SOM [55] to obtain clusters with neighbor relations in EDML. We also used SOM to visualize the cluster structure because SOM is often useful for intuitively understanding learning results. The standard batch-type SOM with a regular grid and a Gaussian neighbor function was employed. We set unique initial weights for each dataset by using PCA alignment [40] to avoid initial value dependency, where the initial weights are aligned on the first and the second principal component plane so as to equally be distributed among the input data.

Table 3 shows the settings of the parameters for each dataset. The number of SOM nodes and K-means clusters were set depending on dataset size. In this experiment, it is not necessary to determine appropriate number of clusters. The number of individuals in an evolutionary algorithm (EA) was determined depending on the size of search space. We set 1D, where D is the dimension number of variables, for larger number of variables; Wine, Vehicle, and Segment with a full matrix, and the rest of cases use 2D to 5D. Also the generation limit in EA was determined by checking convergence and by computational time consumption. As we mentioned earlier, the parameters in GOjDE and jDE, CR and SF are automatically adjusted while searching. Moreover, the optimal neighborhood radius in the smoothing function $h_{i,j}$ within a cluster validity index was also automatically determined in advance of performing EDML for each dataset and a type of validity index. We assumed that the smoothing radius that maximizes Eval compared to the randomized neighbor relations is the optimal one; $\sigma^{\ast}=\arg\max_{\sigma}|\textit{Eval}-\overline{\textit{Eval}}_{% \textit{rnd(n)}}|$ , where $\overline{\textit{Eval}}_{\textit{rnd(n)}}$ denotes an average of Evals when inter-cluster distances $\{{\bm{r}}_{i,j}\}$ are $n$ times shuffled.

5.2 Landscape of the fitness function

Figure 6.

The landscape of the clustering index (weighted F-measure) as a fitness function on Iris dataset.

This section shows why the cluster validity index is hard to optimize, by illustrating the landscape of the fitness function. Figure 6 shows the fitness landscape of Iris dataset using the weighted F-measure in the neighbor of the best solution obtained by EDML. This landscape analysis is based on the idea of the work [54]. In each equivalent distance from the solution, in terms of Euclidean distance in the variable vector space, 100 points were sampled and the average and the maximum of fitness values (weighted F-measure) are illustrated in the graph.

From the figure, clearly there exists many spikes, thus greedy search is easily trapped in a poor local optimum, stochastic search is necessary to obtain better solution. Also we can see gradual global trends in both average and maximum of the fitness shown as a linear approximation, which implies that EDML was able to find the sufficiently optimal solution. However, it is difficult to find such trends in some distances locally as shown in the enlarged graphs, which sometimes makes an optimizer difficult to find the optimal solution by being trapped in these shelf regions.

5.3 Improvement of the clustering index

Figures 7 and 8 show transitions of the fitness (Eval) of the best individuals averaged over five trials with random initial values. These figures illustrate the improvement and convergence of the clustering index along with the evolution of the generation of the evolutionary algorithms.

Figure 7.

Transitions of the fitness (in various indices) of the best individuals on Iris dataset.

Figure 8.

Transitions of the fitness (weighted F-measure) of the best individuals on various datasets.

We compared a diagonal matrix case (i.e., in which $m_{i,j}=0$ $(i\neq j)$ in ${\mathbf{M}}$ ) to a full matrix case. The diagonal matrix case only handles scaling of each dimension, but no correlation of different dimensions is taken into account. A random search with the same number of trials – i.e., the population size multiplied by the generation limit – is also compared to validate search performance. “Euclidean” is the baseline evaluation using Euclidean distance.

We also partially compared representative real-valued evolutionary algorithms, particle swarm optimization (PSO) [47], and real-coded genetic algorithm (RGA) [52]. For the parameter settings of PSO, we used generally recommended values of inertia weight $w=$ 0.729, personal best weight $c_{1}=$ 1.49445, and global best weight $c_{2}=$ 1.49445. RGA employed simplex crossover, Gaussian mutation, and minimal generation gap (MGG) [17] for generation alternation. The number of fitness evaluations was set to be the same as that of GOjDE.

First, comparing our approach to one using Euclidean distance, even a random search outperforms Euclidean, which shows the capabilities of using a Mahalanobis-based distance. Second, the performance of GOjDE using a full matrix was better than using a diagonal matrix in most cases except for Wine data with weighted F-measure (wFME) (Fig. 8a), which shows that GOjDE was able to appropriately search even in high dimensions of a full matrix case. Third, although the performance of GOjDE is almost equal to PSO for the Iris dataset when using the weighted purity (wPUR) or the weighted F-measure (wFME) indices (Fig. 7a and b); performance of GOjDE is much better than PSO and RGA3

In RGA, we needed to adjust the parameters depending on the dataset: the number of parents and children for crossover; and the expanding rate.

in most cases. For higher dimensional data – i.e., Wine, Vehicle, and Segment – RGA obtained lower results than that of random search.

5.4 Effect of the amount of constraints

Figure 9.

Effect of the amount of labeled samples.

Figure 10.

Comparison on visualization of the clustering results by representative class (Segment data).

In this section, we studied the effect of the number of constraints in the learning. We varied the ratio of labeled samples for generating constraints in a range of 0.01 to 1.0 on Iris, Wine and Glass data (Fig. 9). The wPFM was used as the fitness evaluation. As increasing the labeled sample ratio, wPFM rapidly increases until around 0.3. Glass data shows overfitting when labeled sample rate is larger than 0.5 as the score for test data decreases.

Figure 11.

Distribution of each class by standard SOM (Segment dataset).

Figure 12.

Distribution of each class by EDML-SOM without smoothing (Segment data).

From the result, around 30% of labeled samples are recommended to use in order to avoid overfitting in some data, and contribution of the rest of labeled samples are much less than the first 30%. Moreover, when the ratio of labeled samples is lower than 0.1, there is no difference between EDML with a full matrix and with a diagonal matrix. It is suggested that when less than 10% of the labeled samples could be obtained, we can use a diagonal matrix for smaller computational time.

5.5 Effect of the neighborhood smoothing

This section shows how metric learning affects SOM cluster visualization. SOM captures the data space by the topology of neurons and maps this onto a low-dimensional space. Comparison on the clustering results on Segment data are shown in Figs 10–13. We compared the standard SOM and EDML with SOM (denoted as EDML-SOM), also compared EDML-SOM with and without neighborhood smoothing. The wFME was used for EDML training.

Figure 13.

Distribution of each class by EDML-SOM with smoothing (Segment dataset).

First, Fig. 10 shows majority class of each micro-cluster, where the x-y coordinate planes in the subfigures show the mapped planes. Clearly, EDML-SOM with neighbor smoothing (Fig. 10c) well organizes the same classes into the neighbors in the low-dimensional space, while the one without neighbor smoothing (Fig. 10b) has no clear difference to the standard SOM (Fig. 10a).

Next, the distributions of samples in each class in the mapped space are shown in Fig. 11 using standard SOM, Fig. 12 using EDML-SOM without smoothing, and Fig. 13 using EDML-SOM with smoothing. The z-axis indicates the number of samples assigned to each class, which should be neighbors in the mapped plane.

Standard SOM failed to keep the same neighbors for most of the classes, especially classes 1, 3 and 5 (Fig. 11), that are consistent result with Fig. 10. EDML-SOM without smoothing obtained clusters in which each class is much closer but some classes, most of samples in classes 3 and 5 are assigned to identical clusters (Fig. 12). Meanwhile, the distribution in EDML-SOM with smoothing (Fig. 13) spreads comparing to without smoothing, however they are still in neighbor, and more importantly there are small overlaps among classes. Overall, EDML-SOM with smoothing shows the best visualization in terms of the density of each class and separation among classes.

5.6 Comparison with other semi-supervised clustering methods

To measure the generalized performance of our proposed algorithms, we compared our proposed methods, EDML and K-EDML with well-known methods in each categories. For standard unsupervised clustering method, K-means clustering (KMN) and Kernel K-means clustering (K-KMN) [22] were selected as a basedline. Information-Theoretic Metric Learning (ITML) [25], Distance Metric Learning for Large Margin Nearest Neighbor Classification (LMNN) [32], and Geometric Mean Metric Learning (GMML)[42] were chosen as DML representative methods. And for non-linear DML technique, we selected Gradient Boosted Large Margin Nearest Neighbors (GB-LMNN) [11]. Note that, we omited some popular semi-supervised clustering, i.e., COP-Kmeans [33], clustering with distance metric learning methods, i.e., DML [14] and MPC-Kmeans [38] since our other comparison methods overcomes this baseline clustering and DML methods [25, 32, 11].

For the fair comparison, the experiments performed under five-fold cross-validation. In the training process, each method produces 5 metrics matrix in each fold, the number of clusters and neighbors equal to 20 and 5 respectively. Label sampling rate for training data is set to 30%. During the kernel selection and hyper-parameter tuning, grid search with five-fold cross-validation was performed to achieve the suitable kernel and hyper-parameters. The suitable kernel was selected among polynomial kernel, radial basis function (rbf) kernel, laplacian kernel and sigmoid kernel. And for the hyper-parameters tuning range, degree $d=[-10,-9,\ldots,9,10]$ for polynomial kernel, alpha $\mathcal{\alpha}=10^{i}$ where $i=[-10,-9,\ldots,9,10]$ for polynomial kernel while $\mathcal{\alpha}$ equals to inverse of number of attibutes in sigmoid kernel, gamma $\mathcal{\gamma}=10^{i}$ where $i=[-10,-9,\ldots,9,10]$ for radial basis function kernel, sigma $\mathcal{\sigma}=10^{i}$ where $i=[-10,-9,\ldots,9,10]$ for laplacian kernel, and coefficient $c$ is search in the same range as $\mathcal{\alpha,\gamma}$ and $\mathcal{\sigma}$ for polynomial kernel and sigmoid kernel. We omitted the linear kernel from K-EDML because the K-EDML with a linear kernel is analogous to EDML. Therefore, the results of EDML can be view as K-EDML with a linear kernel.

Product from each method, i.e., cluster centriods in KMN and K-KMN and ${\mathbf{M}^{*}}$ in ITML, LMNN, GMML, GB-LMNN, EDML and K-EDML were carried out to the evaluation process afterwards.

Each method is evaluated base on their category, i.e., linear and non-linear distance metric learning. This research adopted k-means with k-nearest neighbor graph for linear technique (KMN, ITML, LMNN, GMML, EDML) and kernel k-means clustering with k-nearest neighbor graph with an identical trained kernel for non-linear methods (K-KMN, GB-LMNN, K-EDML). Weighted pairwise F-measure (wPFM) with the same configurations as in training process, the number of cluster and nearest neighbor equals to 20 and 5, was used to evaluating the clustering results.

Table 4
The average wPFM@20/5 with standard deviation of EDML, K-EDML and their comparison clustering methods on training data

	KMN	ITML	LMNN	GMML	K-KMN	GBLMNN	EDML	K-EDML
Glass	0.416 $\pm$ 0.01	0.418 $\pm$ 0.01	0.416 $\pm$ 0.01	0.422 $\pm$ 0.01	0.471 $\pm$ 0.02	0.443 $\pm$ 0.02	0.425 $\pm$ 0.02	0.477 $\pm$ 0.03
Iris	0.568 $\pm$ 0.08	0.619 $\pm$ 0.09	0.644 $\pm$ 0.09	0.641 $\pm$ 0.09	0.675 $\pm$ 0.09	0.606 $\pm$ 0.09	0.642 $\pm$ 0.09	0.737 $\pm$ 0.04
Wine	0.520 $\pm$ 0.02	0.638 $\pm$ 0.13	0.628 $\pm$ 0.12	0.552 $\pm$ 0.07	0.556 $\pm$ 0.07	0.688 $\pm$ 0.16	0.553 $\pm$ 0.07	0.574 $\pm$ 0.09
Vehicle	0.395 $\pm$ 0.00	0.406 $\pm$ 0.00	0.396 $\pm$ 0.00	0.396 $\pm$ 0.00	0.399 $\pm$ 0.00	0.396 $\pm$ 0.00	0.400 $\pm$ 0.01	0.399 $\pm$ 0.00
Segment	0.261 $\pm$ 0.01	0.294 $\pm$ 0.02	0.271 $\pm$ 0.02	0.286 $\pm$ 0.05	0.450 $\pm$ 0.06	0.318 $\pm$ 0.04	0.338 $\pm$ 0.06	0.514 $\pm$ 0.07
Ionosphere	0.654 $\pm$ 0.01	0.683 $\pm$ 0.01	0.655 $\pm$ 0.01	0.690 $\pm$ 0.02	0.591 $\pm$ 0.03	0.662 $\pm$ 0.02	0.689 $\pm$ 0.01	0.704 $\pm$ 0.01
Pima	0.642 $\pm$ 0.01	0.664 $\pm$ 0.01	0.646 $\pm$ 0.01	0.577 $\pm$ 0.15	0.496 $\pm$ 0.12	0.630 $\pm$ 0.01	0.661 $\pm$ 0.02	0.570 $\pm$ 0.21
Musk	0.552 $\pm$ 0.06	0.614 $\pm$ 0.01	0.546 $\pm$ 0.06	0.565 $\pm$ 0.05	0.666 $\pm$ 0.02	0.560 $\pm$ 0.06	0.585 $\pm$ 0.04	0.639 $\pm$ 0.05
Balance	0.561 $\pm$ 0.00	0.601 $\pm$ 0.01	0.583 $\pm$ 0.01	0.570 $\pm$ 0.00	0.507 $\pm$ 0.02	0.591 $\pm$ 0.01	0.569 $\pm$ 0.01	0.546 $\pm$ 0.04
Yeast	0.366 $\pm$ 0.00	0.369 $\pm$ 0.00	0.363 $\pm$ 0.01	0.370 $\pm$ 0.00	0.364 $\pm$ 0.00	0.371 $\pm$ 0.02	0.383 $\pm$ 0.03	0.396 $\pm$ 0.01

Table 5

The average wPFM@20/5 with standard deviation of EDML, K-EDML and their comparison clustering methods which on test data

	KMN	ITML	LMNN	GMML	K-KMN	GBLMNN	EDML	K-EDML
Glass	0.410 $\pm$ 0.02	0.406 $\pm$ 0.02	0.410 $\pm$ 0.02	0.413 $\pm$ 0.02	0.458 $\pm$ 0.04	0.419 $\pm$ 0.02	0.417 $\pm$ 0.02	0.469 $\pm$ 0.04
Iris	0.553 $\pm$ 0.08	0.622 $\pm$ 0.10	0.655 $\pm$ 0.11	0.604 $\pm$ 0.08	0.647 $\pm$ 0.1	0.567 $\pm$ 0.08	0.659 $\pm$ 0.11	0.704 $\pm$ 0.08
Wine	0.527 $\pm$ 0.05	0.563 $\pm$ 0.07	0.562 $\pm$ 0.07	0.540 $\pm$ 0.05	0.489 $\pm$ 0.05	0.565 $\pm$ 0.07	0.531 $\pm$ 0.04	0.538 $\pm$ 0.07
Vehicle	0.392 $\pm$ 0.00	0.403 $\pm$ 0.00	0.392 $\pm$ 0.00	0.393 $\pm$ 0.00	0.388 $\pm$ 0.02	0.393 $\pm$ 0.00	0.393 $\pm$ 0.01	0.386 $\pm$ 0.02
Segment	0.262 $\pm$ 0.01	0.301 $\pm$ 0.03	0.274 $\pm$ 0.02	0.298 $\pm$ 0.06	0.450 $\pm$ 0.06	0.293 $\pm$ 0.03	0.306 $\pm$ 0.05	0.504 $\pm$ 0.07
Ionosphere	0.652 $\pm$ 0.03	0.686 $\pm$ 0.04	0.652 $\pm$ 0.03	0.687 $\pm$ 0.04	0.621 $\pm$ 0.12	0.656 $\pm$ 0.03	0.685 $\pm$ 0.04	0.671 $\pm$ 0.06
Pima	0.655 $\pm$ 0.03	0.665 $\pm$ 0.02	0.657 $\pm$ 0.03	0.659 $\pm$ 0.03	0.500 $\pm$ 0.17	0.639 $\pm$ 0.03	0.665 $\pm$ 0.02	0.617 $\pm$ 0.18
Musk	0.558 $\pm$ 0.06	0.612 $\pm$ 0.01	0.561 $\pm$ 0.06	0.571 $\pm$ 0.05	0.570 $\pm$ 0.10	0.570 $\pm$ 0.05	0.589 $\pm$ 0.04	0.530 $\pm$ 0.11
Balance	0.557 $\pm$ 0.01	0.593 $\pm$ 0.02	0.583 $\pm$ 0.01	0.569 $\pm$ 0.01	0.499 $\pm$ 0.06	0.586 $\pm$ 0.01	0.564 $\pm$ 0.01	0.531 $\pm$ 0.07
Yeast	0.367 $\pm$ 0.02	0.370 $\pm$ 0.02	0.364 $\pm$ 0.02	0.370 $\pm$ 0.02	0.361 $\pm$ 0.03	0.363 $\pm$ 0.01	0.371 $\pm$ 0.02	0.397 $\pm$ 0.02

Table 6

The rank of average wPFM@20/5 of proper kernel selection on K-EDML and their comparison clustering methods on training/test data

	KMN	ITML	LMNN	GMML	K-KMN	GBLMNN	K-EDML
Glass	6/5	5/7	7/6	4/4	2/2	3/3	1/1
Iris	7/7	5/4	3/2	4/5	2/3	6/6	1/1
Wine	7/6	2/2	3/3	6/4	5/7	1/1	4/5
Vehicle	7/6	1/1	6/5	5/3	3/7	4/4	2/2
Segment	7/7	4/3	6/6	5/4	2/2	3/5	1/1
Ionosphere	6/6	3/2	5/5	2/1	7/7	4/4	1/3
Pima	4/5	1/2	3/4	6/3	7/7	5/6	2/1
Musk	6/7	3/1	7/6	4/3	1/4	5/5	2/2
Balance	6/6	1/1	4/3	5/5	7/7	2/2	3/4
Yeast	5/4	4/2	7/5	3/3	6/7	2/6	1/1
Rank	6.1/5.9	2.9/2.5	5.1/4.5	4.4/3.5	4.2/5.3	3.5/4.2	1.8/2.1

Tables 5 and 5 present the five-fold cross-validation evaluation results in average and standard deviation of each clustering algorithm from 2000 trial on training and test data. Observation indicates as follows: First, ITML, LMNN, GMML, GBLMNN, and the proposed method improve the clustering performance from the baseline clustering, i.e., KMN and K-KMN, due to the benefit of distance metric learning; Second, the benefit of kernelization technique can be seen from every pair of the algorithm, i.e., KMN and K-KMN, LMNN and GBLMNN, and EDML and K-EDML, the kernel integrated technique yield a higher clustering result than the one without kernelization because the properties of the data that is non-linearly separable. Third, although, EDML is not a non-linear kernel, it still achieved acceptable high results compare to both linear and non-linear DML which is a benefit from directly improve cluster validity index. Fourth, the proposed K-EDML overcomes or at least comparable to other clustering methods in eight datasets and obtain the highest clustering score in five datasets which is the benefit of the unification of kernelization and directly improve cluster validity index as an objective function. Fifth, despite the highest results from other methods in some dataset, they tend to perform well in only some specific data, while the proposed method performs well in six datasets.

Then, we ranked the clustering evaluation results in Tables 5 and 5. By mergeing EDML to the K-EDML because EDML can be viewed as a special case of K-EDML, the ranking is presented as Table 6. As a results, K-EDML secures the top 5 in all dataset and overcomes all other unsupervised and semi-supervised clustering methods in this paper with average ranking 1.8 and 2.1 for training and test sample. These results clearly illustrated the performance and robustness of the proposed method.

Table 7

The rank of average standard PFM of proper kernel selection on K-EDML and their comparison clustering methods with the number of clusters is equal to the number of classes on training/test data

	KMN	ITML	LMNN	GMML	K-KMN	GBLMNN	K-EDML
Glass	5/5	6/2	2/7	4/4	7/6	1/1	3/3
Iris	7/6	3/3	1/2	4/4	6/7	5/5	2/1
Wine	7/6	2/2	1/1	6/5	5/7	3/3	4/4
Vehicle	7/7	1/1	6/6	4/4	3/2	5/5	2/3
Segment	7/7	4/3	6/6	5/5	3/2	2/4	1/1
Ionosphere	7/6	1/1	6/7	2/3	4/4	5/5	3/2
Pima	5/4	2/3	6/6	4/5	1/1	7/7	3/2
Musk	4/5	3/3	6/6	5/4	1/1	7/7	2/2
Balance	7/7	4/3	3/2	2/4	5/5	1/1	6/6
Yeast	5/3	7/7	4/5	2/2	3/4	6/6	1/1
Rank	6.1/5.6	3.3/2.8	4.1/4.8	3.8/4.0	3.8/3.9	4.2/4.4	2.7/2.5

5.7 Evaluation via standard evaluation criteria

Lastly, to make this experiment more practical, the number of clusters is set to equal the number of classes in an evaluation process. Then, standard pairwise F-measure (PFM) which can be seen equivalent to wPFM@#class/0 is used as the measurement criterion. Then the trained distance metric $\mathbf{M^{*}}$ from the training process is evaluated again. Table 7 shows the ranking of comparison results of selected K-EDML and other methods. The selected K-EDML still achieved similar results like in the previous evaluation at wPFM@20/5. The proposed method empirically overcomes other methods with the lowest average rank of 2.7 and 2.5 in both training and test dataset, also it still places in the top 4 in all dataset except Balance dataset. Thus, the proposed method demonstrates its generalization capability by overcoming other competitor methods in standard cluster evaluation, even the evaluation criteria is different from the training scheme. These results clearly affirmed the performance and robustness of the proposed method, the benefit of neighborhood smoothing in cluster validity index which is an objective function and kernelization technique.

6. Conclusion

We proposed evolutionary distance metric learning (EDML) and kernelized evolutionary distance metric learning (K-EDML) for semi-supervised clustering, wherein any set-based or pairwise-based clustering validity index can be optimized utilizing a variant of differential evolution (DE) algorithm. The experiments show that at first GOjDE achieved better search performance than other evolutionary algorithms such as PSO and RGA. Second, K-EDML addressed the drawback of EDML in non-linearly separable input space and illustrated the benefit of kernel function to the proposed K-EDML method due to its superior results comparing to other clustering algorithms for semi-supervised clustering. Third, smoothing of the neighborhood clusters in a cluster validity index can improve visualization of neighboring micro-clusters.

Lastly, this research can proceed in many ways, for example, improve the computational efficiency of EDML and K-EDML in the higher dimensionality problems, e.g., eigenvalue optimization [63], reinforcement learning [2, 3] and aggregated DML [27]; and utilize more label information, e.g., multiple-kernel learning [8], utilize hierarchical information [6], and integrate deep learning technique to learn a distance metric [20, 31, 62]. Moreover, applying the proposed methods to other field is also a candidate goal, e.g., classification and image retrieval.

References

Bar-Hillel

Hertz

Shental

and Weinshall

, Learning distance functions using equivalence relations, in: Proc. the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 11–18.

Ali

Fukui

Kalintha

Moriyama

and Numao

, Reinforcement learning based distance metric filtering approach in clustering, in: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), 2017, pp. 1328–1335.

Ali

Kalintha

Moriyama

Numao

and Fukui

, Reinforcement learning based distance metric filtering approach in clustering, in: 2018 The Genetic and Evolutionary Computation Conference (GECCO), 2018, pp. 155–156.

Xiao

Yang

and Zha

, Learning distance metric for regression by semidefinite programming with application to human age estimation, in: Proceedings of the 17th ACM International Conference on Multimedia, MM ’09, ACM, New York, NY, USA, 2009, pp. 451–460. doi: 10.1145/1631272.1631334.

Kulis

Basu

Dhillon

and Mooney

, Semi-supervised graph clustering: A kernel approach, Machine Learning 74(1) (2009), 1–22.

Nogueira

B.M.

Tomas

Y.K.B.

and Marcacini

R.M.

, Integrating distance metric learning and cluster-level constraints in semi-supervised clustering, in: 2017 International Joint Conference on Neural Networks (IJCNN), 2017, pp. 4118–4125. doi: 10.1109/IJCNN.2017.7966376.

Yang

Nguyen

T.T.

E.L.

Yao

Jin

Beyer

H.G.

and Suganthan

P.N.

, Benchmark generator for cec’2009 competition on dynamic optimization (2008).

Liu

and Lu

, Ordinal distance metric learning for image ranking, IEEE Transactions on Neural Networks and Learning Systems 26(7) (2015), 1551–1559. doi: 10.1109/TNNLS.2014.2339100.

Koloseni

Lampinen

and Luukka

, Optimized distance metrics for differential evolution based nearest prototype classifier, Expert Systems with Applications 39(12) (2012), 10564–10570.

10.

Yeung

D.Y.

and Chang

, A kernel approach for semisupervised metric learning, IEEE Transactions on Neural Networks 18 (2007), 141–149.

11.

Kedem

Tyree

Sha

Lanckriet

G.R.

and Weinberger

K.Q.

, Non-linear metric learning, in: Advances in Neural Information Processing Systems 25, 2012, pp. 2573–2581.

12.

Dheeru

and Karra Taniskidou

, UCI machine learning repository (2017). URL http://archive.ics.uci.edu/ml

13.

Bair

, Semi-supervised clustering methods, Wiley Interdisciplinary Reviews: Computational Statistics 5(5) (2013), 349–361.

14.

Xing

E.P.

A.Y.

Jordan

M.I.

and Russell

S.J.

, Distance metric learning with application to clustering with side-information, in: Advances in Neural Information Processing Systems (NIPS), 2002, pp. 505–512.

15.

Wang

and Sun

, Survey on distance metric learning and dimensionality reduction in data mining, Data Min Knowl Discov 29(2) (2015), 534–564.

16.

Chechik

Sharma

Shalit

and Bengio

, Large scale online learning of image similarity through ranking, J Mach Learn Res 11 (2010), 1109–1135. URL http://dlacm.org/citation.cfm?id=1756006.1756042.

17.

Satoh

Yamamura

and Kobayashi

, Minimal generation gap model for gas considering both exploration and exploitation, in: Proceedings of the 4th International Conference on Soft Computing, Vol. 2, 1996, pp. 494–497.

18.

Wang

Rahnamayan

and Wu

, Parallel differential evolution with self-adapting control parameters and generalized opposition-based learning for solving high-dimensional optimization problems, Journal of Parallel and Distributed Computing 73(1) (2013), 62–73.

19.

Wang

and Rahnamayan

, Enhanced opposition-based differential evolution for solving high-dimensional continuous optimization problems, Soft Computing 15(11) (2011), 2127–2140.

20.

Oh Song

Xiang

Jegelka

and Savarese

, Deep metric learning via lifted structured feature embedding, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

21.

Davidson

Wagstaff

K.L.

and Basu

, Measuring constraint-set utility for partitional clustering algorithms, in: Proc. The 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-06), 2006, pp. 115–126.

22.

Dhillon

I.S.

Guan

and Kulis

, Kernel k-means: Spectral clustering and normalized cuts, in: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, ACM, 2004, pp. 551–556.

23.

Goldberger

Roweis

Hinton

and Salakhutdinov

, Neighbourhood components analysis, in: Advances in Neural Information Processing Systems, 2004, pp. 513–520.

24.

Brest

Greinero

Boskovic

Mernik

and Zumer

, Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems, IEEE Transactions on Evolutionary Computation 10(6) (2006), 646–657.

25.

Davis

J.V.

Kulis

Jain

Sra

and Dhillon

I.S.

, Information-theoretic metric learning, in: Proceedings of the 24th International Conference on Machine Learning, ICML ’07, ACM, 2007, pp. 209–216.

26.

Tenenbaum

J.B.

de Silva

and Langford

J.C.

, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000), 2319–2323.

27.

Lin

Rui

and Tao

, A distributed approach toward discriminative distance metric learning, IEEE Transactions on Neural Networks and Learning Systems 26 (2015), 2111–2122.

28.

Fukui

and Numao

, Neighborhood-based smoothing of external cluster validity measures, in: Proc. the 16th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD ’12, Springer, 2012, pp. 354–365.

29.

Fukui

Ono

Megano

and Numao

, Evolutionary distance metric learning approach to semi-supervised clustering with neighbor relations, in: Proc. of 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, ICTAI ’13, 2013, pp. 398–403.

30.

Price

K.V.

Storn

R.M.

and Lampinen

J.A.

, Differential Evolution A Practical Approach to Global Optimization, Natural Computing Series, Springer-Verlag, Berlin, Germany, 2005.

31.

Sohn

, Improved deep metric learning with multi-class n-pair loss objective, in: Lee

D.D.

Sugiyama

Luxburg

U.V.

Guyon

and Garnett

, Eds., Advances in Neural Information Processing Systems 29, Curran Associates, Inc., 2016, pp. 1857–1865.

32.

Weinberger

K.Q.

Blitzer

and Saul

L.K.

, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research (JMLR) 10 (2009), 207–244.

33.

Wagstaff

Cardie

Rogers

and Schrödl

, Constrained k-means clustering with background knowledge, in: Proc. of the International Conference on Machine Learning (ICML-01), 2001, pp. 577–584.

34.

Wagstaff

K.L.

, Value, cost, and sharing: Open issues in constrained clustering, in: Proc. the Fifth International Workshop on Knowledge Discovery in Inductive Databases (KDID 2006), 2007, pp. 1–10.

35.

Yang

, Distance metric learning: A comprehensive survey, Tech. Rep. 16, Michigan State Universiy (2006).

36.

Yang

Jin

Sukthankar

and Liu

, An efficient algorithm for local distance metric learning, in: Proc. the National Conference on American Association for Artificial Intelligence (AAAI-06), 2006, pp. 543–548.

37.

Soleymani Baghshah

and Bagheri Shouraki

, Kernel-based metric learning for semi-supervised clustering, Neurocomput 73(7–9) (2010), 1352–1361.

38.

Bilenko

Basu

and Mooney

R.J.

, Integrating constraints and metric learning in semi-supervised clustering, in: Proc. of the International Conference on Machine Learning (ICML-04), 2004, pp. 81–88.

39.

Belkin

and Niyogi

, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural Processing Systems (NIPS), 2002, pp. 585–591.

40.

Attik

Bougrain

and Alexandre

, Self-organizing map initialization, in: Proc. International Conference on Artificial Neural Networks (ICANN-05), 2005, pp. 357–362.

41.

Moutafis

Leng

and Kakadiaris

I.A.

, An overview and empirical comparison of distance metric learning methods, IEEE Transactions on Cybernetics 47(3) (2016), 612–625.

42.

Zadeh

P.H.

Hosseini

and Sra

, Geometric mean metric learning, in: Balcan

M.F.

and Weinberger

K.Q.

, eds, Proceedings of The 33rd International Conference on Machine Learning, Vol. 48 of Proceedings of Machine Learning Research, PMLR, New York, New York, USA, 2016, pp. 2464–2471.

43.

Wang

Wan

and Yuan

, Locality constraint distance metric learning for traffic congestion detection, Pattern Recognition 75 (2018), 272–281, distance Metric Learning for Pattern Recognition. doi: 10.1016/j.patcog.2017.03.030.

44.

Qian

Jin

Pei

and Zhu

, Distance metric learning using dropout: A structured regularization approach, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, ACM, New York, NY, USA, 2014, pp. 323–332. doi: 10.1145/2623330.2623678.

45.

Storn

and Price

, Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces, Journal of Global Optimization 11 (1997), 341–359.

46.

Chatpatanasiri

Korsrilabutr

Tangchanachaianan

and Kijsirikul

, A new kernelization framework for mahalanobis distance learning algorithms, Neurocomputing 73(10–12) (2010), 1570–1579. doi: 10.1016/j.neucom.2009.11.037.

47.

Eberhart

and Shi

, Comparing inertia weights and constriction factors in particle swarm optimization, in: Proc. the 2000 Congress on Evolutionary Computation, Vol. 1, 2000, pp. 84–88.

48.

Dasgupta

and Ng

, Which clustering do you want? inducing your ideal clustering with minimal feedback, CoRR abs/14015389.

49.

Roweis

and Saul

, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000), 2323–2326.

50.

Kaski

and Sinkkonen

, Principle of learning metrics for exploratory data analysis, The Journal of VLSI Signal Processing Systems for Signal Image and Video Technology 37 (2004), 177–188.

51.

Rahnamayan

Tizhoosh

H.R.

and Salama

M.M.A.

, Opposition-based differential evolution, IEEE Transactions on Evolutionary Computation 12(1) (2008), 64–79.

52.

Tsutsui

Yamamura

and Higuchi

, Multi-parent recombination with simplex crossover in real coded genetic algorithms, in: Proc. the 1999 Genetic and Evolutionary Computation Conference (GECCO-99), 1999, pp. 657–664.

53.

Ying

Wen

Shi

Peng

and Qiao

, Manifold preserving: An intrinsic approach for semisupervised distance metric learning, IEEE Transactions on Neural Networks and Learning Systems 29(7) (2018), 2731–2742.

54.

Jones

and Forrest

, Fitness distance correlation as a measure of problem difficulty for genetic algorithms, in: Proc. the 6th International Conference on Genetic Algorithms (ICGA-95), 1995, pp. 184–192.

55.

Kohonen

, Self-Organizing Maps, Springer-Verlag, 1995.

56.

Hertz

Bar-Hillel

and Weinshall

, Boosting margin based distance functions for clustering, in: Proc. the 21st International Conference on Machine Learning (ICML-04), 2004, pp. 393–400.

57.

Kalintha

Fukui

Ono

Megano

Moriyama

and Numao

, Semi-supervised evolutionary distance metric learning for clustering, in: The 29th Annual Conference of the Japanese Society for Artificial Intelligence, JSAI ’15, 2015.

58.

Kalintha

Fukui

Ono

Megano

Moriyama

and Numao

, Integrating class information and features in cluster analysis based on evolutionary distance metric learning, in: Intelligent and Evolutionary Systems, Springer International Publishing, 2017, pp. 165–181.

59.

Kalintha

Ono

Numao

and Fukui

, Kernelized evolutionary distance metric learning for semi-supervised clustering, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), AAAI-17, 2017.

60.

Kalintha

Megano

Ono

Fukui

and Numao

, Cluster analysis of face images and literature data by evolutionary distance metric learning, in: Proc. of the 35th SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, AI ’15, Springer, 2015, pp. 301–315.

61.

Bian

and Tao

, Learning a distance metric by empirical loss minimization, in: Proc. International Joint Conference on Artificial Intelligence (IJCAI-11), 2011, pp. 1186–1191.

62.

Wang

Chen

Rai

and Carin

, Deep metric learning with data summarization, in: European Conference on Machine Learning and Knowledge Discovery in Databases – Volume 9851, ECML PKDD 2016, Springer-Verlag, Berlin, Heidelberg, 2016, pp. 777–794.

63.

Ying

and Li

, Distance metric learning with eigenvalue optimization, J Mach Learn Res 13 (2012), 1–26.

64.

Zha

Z.-J.

Mei

Wang

and Hua

X.-S.

, Robust distance metric learning with auxiliary knowledge, in: Proc. International Joint Conference on Artificial Intelligence (IJCAI-09), 2009, pp. 1327–1332.

Kernelized evolutionary distance metric learning for semi-supervised clustering

Abstract

Keywords

1. Introduction

2.1 Distance metric learning

2.3 Kernel function

2.6 Clustering index with neighbor relation

2.6.1 Set-based indices

Table 1 Class and cluster confusion matrix of data pairs

3.1 Global distance metric learning

4.1 Integrating kernelization technique to EDML

5.1 Experimental settings

2 http://archive.ics.uci.edu/ml/.

Table 4 The average wPFM@20/5 with standard deviation of EDML, K-EDML and their comparison clustering methods on training data

6. Conclusion

References

Table 1
Class and cluster confusion matrix of data pairs

²
http://archive.ics.uci.edu/ml/.

Table 4
The average wPFM@20/5 with standard deviation of EDML, K-EDML and their comparison clustering methods on training data