Abstract
Data clustering is one of the most important tasks in machine learning and data mining, which aims to discover natural structure of the data, identify relationships between observations inside data sets, or detect outliers. Clustering is traditionally seen as part of unsupervised learning, but in many situations, side information about the clusters may be available in addition to the values of the features. For example, the cluster labels of some observations may be known (called seeds) or certain observations may be known to belong (or not) to the same cluster (pairwise constraints). Clustering algorithms using such information are called semi-supervised algorithms. A problem is that although many semi-supervised clustering algorithms have been presented in literature over the last decades, each of them usually uses one kind of side information. In this work, we aim to propose a new semi-supervised density based clustering which integrates effectively both kinds of side information, and embeds an active learning strategy in the process of finding clusters, named MCSSDBS. In order to evaluate our proposed method and demonstrate its effectiveness compared with a state-of-the-art semi-supervised density-based clustering (SSDBSCAN), a series of experiments is carried out on both synthetic and real world data sets. First is experiments primarily conducted on 6 data sets from UCI repository. Then, especially for the facial expression recognition task, our tests are performed on two facial data sets: A popular one in literature – the extended Cohn Kanade Data set (CK+), and our own new facial data set collected from volunteers in Vietnam – named ITI facial expression data set. Comparative results conducted show that our method can boost the performance of clustering process.
Keywords
Introduction
Clustering is the problem of partitioning a data set
In the past two decades, clustering with side information (known as semi-supervised clustering) has received a great deal of attention [1]. We can note here the semi-supervised K-means [2, 3], semi-supervised Fuzzy-C means [4], semi-supervised spectral clustering [5], semi-supervised density based clustering [6, 7, 8], semi-supervised graph based clustering [11], to mention a few. They showed that clustering can benefit from some users’ knowledge to boost the performance of clustering. Such information can generally be either constraints or seeds. Constraints involve must-link and cannot-link constraints in which the must-link constraint (ML) between two observations
To begin with, clustering with constraints can be divided into two main families: either 1) metric learning-based methods: the constraints are used to learn a metric/objective function or 2) constraint-based method: the constraints are used as hints to guide the algorithm to a find useful solutions [1].
Given a constraint set, the metric learning methods are first trained to satisfy the constraints so that, after training step, data objects associated by a must-link constraint should be close and data objects linked by a cannot-link constraint should be well separated in the learning space.
In constraint-based approaches, two families of methods can be found: on the one hand, algorithms with a strict enforcement, which find the best feasible clustering respecting all the constraints, and, on the other hand, algorithms with partial enforcement, which find the best clustering while maximally respecting the constraints. To this aim, several techniques have been proposed so far in the literature: modifying the clustering objective function so that it includes a term of constraint satisfiability, enforcing all constraints to be satisfied during the assignment step in the clustering process, or initializing clusters and inferring clustering constraints based on neighborhoods derived from labeled example.
Secondly, in seed based clustering, a set of seed can be used for initializing cluster centers, such as in K-Means and Fuzzy C-Means, automatically evaluating parameters in semi-supervised density-based clustering (SSDBSCAN) [6], helping in the label propagation process and learning distance in seed based hierarchical clustering (HISSCLU) [8], or identifying connected components in semi-supervised graph based clustering (SSGC) [11].
Pursing this further, the active learning algorithm aims to obtain high quality using as few labeled objects as possible and minimizes the queried users to get label. Active learning is very benefit in many modern machine learning problems where data may be abundant but labels are expensive or time consuming to obtain [9]. While active learning algorithms for supervised classification has a long history, the problem of active learning for semi-supervised clustering only appears when the research of integrating prior knowledge in clustering proposed in 2001 [2]. For constraint clustering, there have been many works proposed recently. The idea principle is to choose the most useful constraints such that they not only boost the clustering performance but also minimize user questions. In [15], Basu et al. proposed a method based on min-max strategy to constrained K-Means clustering. In [12], Vu et al. proposed a graph-based method to collect constraints for any kind of semi-supervised clustering algorithms, in [16, 17], Abin and Beigy introduced a method based a Kernel, sequential approach for K-Means, DBSCAN. The research of active learning for seed-based clustering, we mention here the work in [13, 10]. This is a method to collect the seed based on min-max strategy (named S-Min-Max). The idea of of S-Min-Max is to build a set of seed which can cover the distribution of input data.
To wrap it all up, in this work, we integrate both pairwise constraints and seeds information, additionally embed an active learning phase in the clustering process to establish a novel algorithm, named MCSSDBS. This work is an extension of our preliminary research [14], presenting a new application in facial expression recognition topic. It is noteworthy that while there exists a number of constraint-based clustering and seed-based clustering algorithm, there is no semi-supervised clustering using both kinds of such information. Besides, as a part of the current direction research in evaluation of citizens’ satisfaction in public services, we collect our own facial expression data set (named ITI facial expression data set) from local volunteers, and then apply our new algorithm to facial expression recognition task. Comparative results conducted from a series of data sets show the effectiveness of our new method.
The rest of this paper is organized as follows. A review of related semi-supervised density-based clustering algorithms is introduced in Section 2. Section 3 presents our proposed semi-supervised density-based clustering algorithm, named MCSSDBS. Section 4 introduces our experiment setups, presents the experiments that have been conducted on benchmark data sets from UCI as well as facial expression data sets. Finally, Section 5 concludes the paper and discusses several lines of future research.
Related work
Density-based clustering
The notion of density based clustering is introduced by Ester et al. in 1996 when they propose algorithm DBSCAN [18]. Given a data set
.
(core object) An object
.
(density reachable) A point
.
(density connected) A point
.
(cluster) Let X be a database of points. A cluster C w.r.t.
The advantage of DBSCAN is that it can detect clusters with arbitrary shape and identify noises. However, as these parameters are set once for all clusters, DBSCAN can only separate clusters well for the data sets with the same density. In some cases where the density among clusters differ widely and the clusters are not totally separated by sparse regions (take one as shown in Fig. 1 as an example), DBSCAN fails to discover exactly the cluster structure. In Fig. 2, DBSCAN extracts only the two clusters: one merges the two denser clusters A and B, and another is C with MinPts
Three clusters: A, B, and C in a data set sample.
DBSCAN clustering with MinPts 
In this case, there is no global set of parameters MinPts and 
SSDBSCAN extends the original DBSCAN algorithm by using a small set of labeled data to cope with the problem of finding clusters in distinct densities data. SSDBSCAN overcomes this problem by using seeds to compute an adapted radius
The rDist(p,q) measure indicates the smallest radius value for which
where
Given a set of seeds
The complexity of SSDBSCAN is
This section presents our proposed method, namely MCSSDBS. Contrary to other semi-supervised clustering algorithms in literature, we try to integrate seed, must-link, and cannot-link constraints in the process of finding clusters. Moreover, we argue that choosing the largest edge in the set of all density-connection paths as separation between two clusters (as in SSDBSCAN) may not be the best solution. Our proposal, instead, uses an active learning phase to tackle with this problem. The main steps of MCSSDBS are presented in Algorithm 3 as follows.
[h]MCSSDBS[1] A set of data X, a set of seed D, and a set of constraints (must-link and cannot-link) Clusters and outliers (option)
Construct the graph G from data w.r.t new_rDist()
Constructing final clusters by using must-link constraints (Option) w.r.t new_rDist() add outliers to the nearest clusters
The first step of MCSSDBS (Line 1) is also the construction of graph with the weights calculated by rDist(), but with embedded must-link and cannot-link constraints, we have the new_rDist(), in which the cDist(o) in Eq. (1) now is the minimal radius such that
Explain steps 2–5 in Algorithm 1: MCSSDBS enlarging process on our data set sample. In our proposed algorithm, the cut point is detected as the edge 7 instead of the edge 16 in SSDBSCAN. The point A, therefore, belongs to the same cluster as Seed 1.
[h] Identifying the cut point[1] a path in graph the cut point Classify all the edges of the path in descending order according to their new_rDist() value
The second step (Lines 2–5) is the enlarging clusters from seeds. The important key here is identifying the cut point. In contrast with SSDBSCAN, we use the active learning phase (Line 3 in Algorithm 3) expressed in Algorithm 4. Our hypothesis is that the longest edge may not be the an appropriate cut point. For clarity of these steps, we explain our MCSSDBS enlarging process on our data sample in Fig. 4. Starting from the Seed 1, the algorithm constructs a graph according to new_rDist() values. The Seed 1 connects to other observations in the following edge sequence as {1, 2, 3, …}. The process stops when another seed with different label (Seed 2 in our example) is added to the tree. Here, SSDBSCAN looks for the currently largest edge, that is the edge 16, so that it cut this edge, and put the point A into the same cluster as the Seed 1. However, in fact the point A should belong to the same cluster as the Seed 2. In our algorithm, MCSSDBS, the cut point should be the edge 7. It is also note that in case we cannot get any CL label from users (the response of users may be “We do not know” for all questions), the largest cut will be used to decide the cut point as in SSDBSCAN.
In the third step (Line 6), ML constraints are used one more time at the last step to merge isolate points such that those have a link with clusters. Finally, the remaining points which do not belong to any clusters can be seen as outliers. Line 7 is an option that the outliers can be added to their nearest clusters with respect to new_rDist() values.
Experiment setup
To evaluate the effectiveness of our new method, a series of experiments are conducted. Firstly, we use data sets from UCI that is the traditional data sets for evaluation of clustering algorithm in literature. We then apply MCSSDBS to facial expression problem on the extend Cohn-Kanada (CK+) data set and our new facial expression data set collected from Vietnam citizens. At each time of running, constraints and seeds are randomly generated and reported results are averaged over 20 times.
Comparing the pairs
Comparing the pairs
Then, the Rand Index is defined as:
The value of RI is a number between 0 and 1 for the original version, in which, RI
Description of UCI datasets
To evaluate the effectiveness of our algorithm, we compare MCSSDBS with SSDBSCAN using 6 data sets from UCI machine learning [19], named: Iris, Haberman, Soybean, Glass, Zoo, and Thyroid. The detail of these data sets are shown in Table 2.
Details of UCI data sets
Details of UCI data sets
We fixed MinPts
Comparison results between MCSSDBS and SSDBSCAN on 6 widely-used data sets from UCI w.r.t Rand Index (the higher, the better).
The results for MCSSDBS and SSDBSCAN are an average over 20 runs, and depicted in Fig. 5a–f in which the blue lines with square markers show results of our proposed method MCSSDBS using both must-link (ML) and cannot-link (CL) constraints, while the pink dash lines with right-pointing triangle markers are results of MCSSDBS in case of not using constraints and the last one represents results of the algorithm compared, SSDBSCAN. Based on that, the following observations can be made:
A glance at the graphs reveals that MCSSDBS with or without using constraints (ML Results in case of not using Constraints indicate that MCSSDBS validates our hypothesis that the longest edge may not be the best criterion in the case of density based clustering algorithms. MCSSDBS considers active learning in the iteration process, answering example-level queries whether or not the two observations belong to the same cluster, so that the more sufficient cut point is detected. Thus, even without using Constraints, MCSSDBS works pretty well compared to SSDBSCAN. Take Iris as an example, the given graph (Fig. 5a) illustrates the advantage of active learning phase in the iteration clustering process. With 5% of the whole data set are seeds (corresponding to 8 seeds), SSDBSCAN method reaches an RI accuracy of 89%. Contrary to SSDBSCAN, MCSSDBS-Without Constraints involves further approximately an average of 11.5 queries, and as the result, it gives an 3% improvement, obtaining the RI accuracy of 92%.
Introduction to facial expression data sets
ITI facial data set is collected in 2017 in Vietnam, aimed for automated facial image analysis and synthesis and for perceptual studies. Three hundred and fifty-four photos are collected in the real condition with different illumination, pose and resolution, with and without glasses from 28 volunteers: 7 women and 21 men. The expression in each photo was ratings over 2 subjects, then is given an emotion label. We use 354 labeled expression images, containing 6 different facial expressions: anger, disgust, happy, neutral, sadness, and surprise for experiments. The number of images belonging to each category is shown in Table 4. Figure 6 shows some examples of collected images in ITI data set. We also note that, in facial expression task, it is not difficult to get label from users for seeds and constraints compared with other domains such as annotating gene and disease, speech recognition task, or media labeling [9].
Distribution of CK+ data set
Distribution of CK+ data set
Distribution of ITI data set
Examples of ITI data set.
After eye region localization, the left eye center
The direction of the rotation (clock-wise or counter clock-wise) depends on the difference
We would also note that histogram equalization transforming the values in an intensity image so that the histogram of the output image is approximately flat. Median filtering is expected to (more or less) keep the edges well maintained while doing image smoothing. It is particularly useful for salt-and-pepper noise where it is highly probable that these noisy pixels will appear the beginning and at the end when sorting pixel neighborhoods.
Comparison results between MCSSDBS and SSDBSCAN on CK+ data set.
Comparison results between MCSSDBS and SSDBSCAN on ITI facial data set.
Conclusion
In this paper, we have presented a density-based clustering algorithm, namely MCSSDBS, that extends our preliminary research [14] for semi-supervised clustering tasks. The special aspect of our algorithm is that it can integrate both kinds of side information: seeds and pairwise constraints. Moreover, the active learning phase was applied to help the algorithm in the iteration process detect the cut point.
Experiment results on UCI data sets show that our algorithm can boost the quality of clustering, compared with SSDBSCAN. To the problem of facial expression recognition, our proposed method is verified on the extend Cohn-Kanade data set and on our new facial expression data set collected from volunteers among Vietnam citizens. Given pretty good results prove that such the semi-supervised clustering MCSSDBS is a good choice for facial expression clustering task. Finally, in near future, we will continue to apply MCSSDBS for other problems in Vietnam.
Footnotes
Acknowledgments
This research is funded by Vietnam National University, Hanoi (VNU) under project number QG.17.43.
