Abstract
This paper describes a new semi-supervised clustering algorithm as part of a more general framework of interactive exploratory clustering, that favors the exploration of possible clustering solutions so that an expert tailors the best clustering according to her domain knowledge and preferences. Contrary to most existing approaches, the novel algorithm considers the feature space as a first class citizen for the exploration of alternative solutions. Our proposal represents and integrates quantitative preferences on attributes that will guide the exploration of possible solutions by learning an appropriate space metric. It also achieves a compromise clustering based on expert confidence, between a data-driven and a user-driven solution and converges with a good complexity. We show experimentally that our method is also able to deal with irrelevant user preferences and correct those choices in order to achieve a better solution. Experiments show that the best results may be achieved only with the addition of preferences to traditional metric learning algorithms and that our approach performs better than state-of-the-art algorithms.
Introduction
Data clustering is a well studied domain of unsupervised machine learning that aims at partitioning datasets into homogeneous groups of items sharing similar properties [37, 1]. As such, it is generally used as part of a more general data exploration process where an expert wants to summarize information or to discover categories, iteratively interacting with data to answer a business question. In Customer Relation Management (CRM), clustering is used to determine meaningful customers profiles from sales records [35, 36, 61].
Traditionally, experts have been able to take advantage of the exploratory nature of clustering algorithms and to compare alternative clustering solutions via the algorithms’ parameters, such as the number or the initial position of clusters centers in k-means [53], the number of nearest neighbors and the size of neighborhood in DB-SCAN [30] or the aggregate criterion in agglomerative hierarchical approaches [54]. However, in this case, there are no mechanism that can easily help the expert finding the parameters that best suit her exploration’s preferences.
Other works have shown that the space of attributes (features) can be a first class citizen to explore new clustering solutions, either by tailoring the metric space using metric learning [70, 38], or by weighting/selecting the most likely interesting features [21, 48, 44], or finally by producing meaningful subspaces [56, 40, 27]. Note that bi-clustering approaches [33, 16] aim to perform a simultaneous clustering of both the objects and attributes of a data matrix. As a consequence, the attributes are used as the objects to be clustered, and not used to define the exploration space. However, these approaches rely solely on data but are agnostic of the expert and are thus not really interactive, and may not be easy to understand during the process of an exploration. For example, in the case of subspace clustering, experts can be left with several partitions of the same dataset, while increasing simultaneously the cost of interpretation.
Therefore, it is was necessary to provide to the users more interactive approaches, allowing them to explore efficiently appropriated clustering solutions. The research studies on interactivity for clustering exploration has gain much attention at the beginning of last decade, with the advent of semi-supervised methods [11, 23, 49, 18]. These approaches introduce expert knowledge as constraints into the clustering algorithm. These constraints help guiding the clustering process in order to improve the quality and the coherence of the built partition according to expert needs. These constraints can be defined on many levels: the instance level [11, 13, 69, 59, 46, 6, 51, 50], the cluster level [9, 34, 10, 71, 31, 47], or the partition level [8, 24, 58].
This paper illustrates the concept of exploratory clustering that builds upon semi-supervised clustering as a way to favor the exploration of possible clustering solutions, so that an expert tailors the best clustering according to domain knowledge and preferences. Following the IDEA principles [17], exploratory clustering should rely on semi-supervised approaches, as well as visualization and appropriate interaction mechanisms. In this context, we describe a new semi-supervised clustering algorithm that considers the feature space as a support for the exploration of alternative solutions, by integrating quantitative user preferences on attributes.
Customer segmentation using the purchase of products 
Let us illustrate the objective of our work by considering a toy example in the field of customer segmentation task. In order to motivate our contribution, we first consider the example of three experts
First, consider an expert
Consider now an expert
Our proposal
In order to tackle the problems and challenges illustrated by our motivating example, we propose in this paper a new semi-supervised clustering algorithm that integrates user preferences on attributes during the construction of clusters. To the best of our knowledge, only the works in [60] addresses the same problem, but with a different model of user preferences. More precisely, the main contributions of this paper are the following:
We propose to use a quantitative model of preferences to represent the user preferences on attributes. Comparing to [60], the authors propose a qualitative model where the user has to express – for a couple of attributes – which is the preferable than the other. However, regardless of this expressivity, the authors underline that their model of preferences can be very difficult to set. By comparison, our quantitative model is easier to use by an expert. Moreover, we show that it guarantees a lower time complexity when we build the clusters, in particular when the user preferences define a total pre-order on the set of descriptive attributes. We present the new clustering algorithm MAPK-means (Metric Attribute Preferential K-means) that minimizes our clustering objective function. This objective function combines the intra-cluster distance weighted by a learned metric with quantitative preferences while considering the confidence level. We prove the convergence of this algorithm and we analyze its complexity. We demonstrate finally that when no user preferences are used, MAPK-means is similar to MPCK-Means [13] without constraints. We provide a large set of experiments on UCI benchmarks to prove the ease of use and the efficiency our approach. In particular, these experiments show that user preferences on attributes improve the quality of clustering and that our approach is robust against erroneous preferences due to the confidence level (as illustrated with expert C in the motivating example). Moreover, our approach achieves a compromise clustering, based on expert confidence, between a data-driven and a user-driven solution. Interestingly, MAPK-means outperforms the state-of-the-art algorithms (CFP [60], K-Means and K-Means with a weighted distance). We detail a use case taking up the scenario of the motivating example on a real dataset that shows in practice the interest of user preferences for guiding efficiently to explore the appropriate customer segmentation. This use case stems from a collaboration with Group Up1
Note that we make the complete source code of MAPK-means and the experiences implementations available2
GitHub:
This paper extends the work presented in [29] that introduced a first version of the algorithm MAPK-means. Compared to this first version, the initialization of the algorithm has been modified to be consistent with user preferences. More importantly, a theoretical analysis of MAPK-means is devised in this paper and the experimental section presents completely new results that provide accurate insights on the behavior of MAPK-means and its robustness to parameter settings. Indeed, reported results concern more UCI benchmark datasets and our protocol now exploits statistical tests to ensure more clearly the significance of results. Furthermore, the use case is a new and important addition illustrating the practical use of MAPK-means by a company, in the context of customer segmentation. Finally, we note that this paper provides an extensive literature survey on semi-supervised clustering approaches that studies the constraints by their types, considering their level of application, their resolution approaches and the impact of their quality on the clustering.
The remainder of the paper is organized as follows. In Section 2, we discuss some related works about semi-supervised clustering. Then, we formulate the problem and show in Section 3 how user preferences on attributes can be integrated in a clustering objective function. The new clustering algorithm MAPK-means that minimizes our new clustering objective function is derived, explained and analyzed in Section 4. Then, in Section 5, we evaluate and compare our algorithm to other approaches using several UCI benchmarks. Section 6 reports a use case relying on a real-world dataset. Finally, we conclude the paper and discuss some perspectives in Section 7.
Related work
The main objective of our work is to let an expert express preferences about features that might be interesting in the context of data clustering, in the sense that the preferred subspace should allow for a better distinction between the clusters, and should help finding a view of the data that is appropriate to answer the business question at hand. This problem can be either considered as a feature selection or weighting process [3, 44] or as a subspace clustering problem [41, 56]. However, both these processes are completely data-driven and none of these allows for the expression of user preferences on the features of interest. Indeed, feature selection approaches are based on some internal criteria that is design to improve clustering accuracy or interpretability without any user interaction: once chosen, the features (weights) are kept during the whole analysis, leaving no room for an expert to provide an alternative view of the problem. In this respect, this selection (or weighting) process can be seen as hard constraints and not preferences on the features. Similarly, subspace clustering algorithms conduct a purely data-driven exploration of different subsets of features where clusters are relevant according to a criterion, for example density as in CLIQUE [2]. As a consequence, and even if there exists several alternative features subsets to choose from, the expert is left with no proper way to select the one that best fit her expert knowledge.
In the rest of this section, we present the different types of user constraints existing in the literature. Those are mainly distinguished by their level of application : on instance level, on cluster level, on partition or on attributes. We then explain how these constraints are resolved in the clustering and we provide a short overview on that the main families of constraints resolution approaches. We finally discuss the sensitivity of the clustering method to the quality of the expert knowledge, how this issue was treated in existing work and in our proposed algorithm.
Expressing expert knowledge as constraints
The solution to our problem can be expressed as a semi-supervised clustering algorithm that improves the relevance of a k-means like clustering algorithm based on an expert knowledge represented as user preferences on the features. As first introduced in [66], expert knowledge is generally provided at the instance level as labels [11], also called seeds, as in supervised classification, or as pairwise instances constraints, sometimes also called equivalence constraints [13]. These pairwise constraints indicate that two objects should be in the same cluster (Must-Link or ML constraints) or not (Cannot Link or CL constraints). Recently, a new form of constraints at the instance level as relative distance constraints, was introduced in [43, 50, 57], but they are more adapted to ranking and instance order preferences.
Numerous clustering approaches like k-means or DB-SCAN have been improved to benefit from side information provided as instances constraints [23, 11, 18, 69, 59]. For example, SS-DBSCAN [46] improves the initial DBSCAN algorithm with an automatic setting of the density parameters based on labels, while in [11] the authors use constraints to define the initial centers of K-means to avoid the convergence to an inappropriate solution. In this case, the difficulty of exploring the initial parameter settings space is replaced by a prior knowledge of the user on the data she wants to partition, which may be in some cases, easier to formulate.
More recent works express cluster level constraints to avoid contradictions at the instance level [15, 28, 71, 31]. In [15, 14] authors propose to use constraints on the clusters size, with the addition of pairwise instance constraints, to avoid ill-formed k-means output with less than k clusters. [25] have introduced cluster compactness and separation constraints to improve the quality of built clusters or to help to satisfy other types of constraints such ML and CL.
Other algorithms express constraints on the whole partition of a dataset. Some works build an equi-sized partition, with the objective of balancing the size of clusters [9, 10, 39]. Other proposals find an alternative partition that is as dissimilar as possible from a reference partition, but with a comparable or better quality [8, 24, 58, 19].
To the best of our knowledge, only very few works have explored the semi-supervised clustering problem with hard constraints or preferences on the features rather than on the instances. An alternative approach to our proposal in [60, 68] consists in expressing qualitative attributes preferences by mean of a triple
Contrary to this approach, our model of quantitative preferences only requires a linear number of preferences, i.e. 1 per attribute. Moreover, in addition to simplifying the interaction with the user, our model leads to a better complexity in learning the metric that corresponds to the best subset of features.
Constraints resolution in clustering
Semi-supervised clustering methods can be categorized depending on the approach used to resolve the constraints. That is how the constraints are expressed and resolved in the method. Hence, the exiting works in the literature can be divided into three main families of resolution methods: ad-hoc, exact and soft. We note that the literature sometimes distinguishes semi-supervised methods depending on the satisfaction strategy used (hard or soft): hard constraints, where all constraints must be satisfied, and soft constraints where the constraint violation is permitted [32]. This distinction between hard and soft constraints is done independently of any particular resolution method.
Ad hoc approaches
In this family of approaches, the constraints are expressed outside the objective function of the clustering method used. The resolution is done by extending an existing algorithm, like K-Means, and modifying or adding some instructions to ensure the satisfaction of the constraints [64, 11, 8, 14, 31]. The COPK-means [64] derives from a k-means algorithm and use ML constraints to define initial cluster centers, while CL constraints are injected during the assignment step to avoid any grouping of objects that should not belong to the same cluster. Authors enforce a hard constraints satisfaction but without a guarantee of the final solution because of contradictions that may be observed between constraints. In [11], the authors propose two variants of K-Means that consider seeds constraint. In the first, the seeds are used only to compute the initial centers, in the second the authors add an instruction to the convergence phase of K-Means, in order to ensures that the seeds remain in the same clusters. [15, 14] modify the assignment step, in a K-Means like approach, to consider minimum cluster size constraints. They use to this aim the linear programming technique. The algorithm presented in [31] forces the constraints satisfaction on the maximum size of clusters during the assignment step. More specifically, a data instance is assigned to the closest cluster satisfying the size constraint.
Exact approaches
The goal of this family of approaches is to find clustering solutions that satisfy all the constraints specified by an expert. To this aim, exiting methods use Constraint Programming (CP) techniques that separate the modeling of the problem from its resolution. If a solution exists, this solution is also optimal with respect to the objective function introduced. In the clustering domain, to our knowledge [20] are the first to use the CP paradigm to solve semi-supervised clustering problems. Their method deals with the problem of expressing constraint at the cluster level with features constraints. This method has the nice property of producing an optimum solution if it can be found, but at the price of a NP-complete complexity. In this respect, the more constrained the problem is, the more efficient is the algorithm. However, the proposed solution is limited to hard constraints that does not match our objective of soft constraints to represent expert preferences. However, it is to be noticed that it would be possible to improve this approach with soft constraints by prioritizing them following the idea of [55].
Soft approaches
Approaches of this family share the general principle to transform the constraints specified by the user in terms of a set of penalties in addition to the objective function to minimize (to find a clustering solution regardless of any constraints). In most of approaches, these penalty terms should be convex. The goal of the methods in this class is thus to minimize the cost of violating the constraints, without guaranteeing that they will be fully satisfied [70, 12, 13, 39, 34, 43, 19]. The solution found is locally optimal in the sense that it minimizes the objective function integrating the constraints by penalty terms. Therefore, the method converges to a local optimum and not a global one. For example, the PCK-Means algorithm introduced in [12], allows a flexible satisfaction of the constraints by assigning a cost of violating ML and CL constraints. The cost of each constraint evolves during the algorithm so as to find the expected compromise between intra-cluster distance minimization and the cost of violation. Another solution to find a compromise is to adjust the topology of the space so as to decrease the distance between points in ML constraints and to increase the distance between points in CL constraints [70]. In [13], authors describe the MPCK-Means algorithm that uses the constraints to learn a weighted Euclidean distance indicating the relative importance of each attributes. This metric is updated during the clustering process while minimizing the intra-cluster distance and the cost of violation, which represents an additional step compared to PCK-Means. A significant bias present in their method is related to the order of points assignment. The assignment is determined by the distance to the clusters and the respect of the ML and CL constraints, it thus influences the centers recalculation step and makes it dependent on the assignment order.
In [60], the authors propose to express a set of attribute order constraints by adding a penalty term to the intra-clusters distance, defined by:
It is straightforward to verify that this term is null if all the constraints are satisfied. The authors add also a regularization term in order to avoid solutions where the learned vector of weight
Our approach falls into this family of constrained clustering methods that achieves a soft enforcement via the learning of a metric space, by taking into account user preferences on attributes. Following our previous example in Fig. 1, we propose to use quantitative preferences on attributes as the expert knowledge paired with an adequate penalty term.
Sensitivity to the expert knowledge
Finally, although semi-supervised algorithms improve significantly the performance or the stability over the traditional algorithms they are derived from, they are still very sensitive to the quality of the expert knowledge that is fed as input [65, 55, 45]. As a solution, some early works [26, 65] try to estimate the quality of a constraints set as an information quantity and an agreement between constraints, to avoid any posterior decrease in the performance of the algorithm. In [63, 62], authors introduce a pairwise constraint utility measure that relies on a k-nearest neighbors graph to determine first, how well connected to their respective neighborhood are the points in the constraint to identify outliers or isolated points, and second how well points of the constraint are connected together. This measure is paired with a propagation mechanism to limit the number of queries proposed to experts for annotation while still improving clustering quality. In [55], the authors propose a method for learning a priority order on constraints, which is then used and updated in their algorithm based on COPK-Means of [64]. The objective is to satisfy in the first place the most important constraints according to a criterion of relevance. Other recent works [5] are interested in determining how certain fuzzy clustering methods make it possible to take into account erroneous or uncertain constraints expressed most often, in the form of partial matrices of membership.
To avoid any unintentional decrease of performance, our proposal provides to the expert the possibility to indicate explicitly the level of confidence in her preferences on attributes. Thus, in the case where the expert has little confidence in her preferences, the algorithm will propose a solution that relies more heavily on the data. Conversely, if the expert is very confident in her preferences, the algorithm will propose a solution guided primarily by the expert choices. The resolution of these preferences is done at the level of the objective function, in a soft resolution approach with a metric learning step in a K-Means like algorithm. The use of a soft approach finally has an advantage in terms of ease of extending the method with other forms of constraints. The learning step allows to modify the space to better satisfy the preferences. Thus, this step is done while respecting the structures naturally present in the data.
Problem statement
Our objective is to propose a new semi-supervised clustering algorithm that can handle quantitative user preferences on attributes. To this aim, we introduce a K-means like algorithm that learns the attribute weights that are the best compromise between the weights provided by the user preferences and the attribute weights that would best fit the natural distribution of data and depending on the user confidence on her preferences.
Notations In the following, a dataset
Quantitative user preferences
The originality of our approach is to incorporate user preferences on attributes to construct the right partition. More precisely, we model user preferences with a quantitative model of preferences in which the end-user assigns to each attribute a weight proportional to his/her interest for this attribute in clustering analysis.
More formally, we use a preference vector
For instance, the preferences
The objective function that will guide the clustering algorithm requires comparing preference vectors on the attributes: between the preferences expressed by
The constraints presented in the Eqs (1) and (2) allow us to assimilate the distributions of preferences on the attributes to probability distributions. For probabilistic distributions, a “divergence” measure is usually used to measure dissimilarity, rather than a metric distance (such as Euclidean or Manhattan). A distance requires checking the properties of symmetry and triangular inequality, whereas it is not necessary for divergence.
Therefore, to measure the dissimilarity between two preference vectors, we propose to use the Kullback-Leibler divergence [42] which measures the dissimilarity between two distributions. This is an asymmetric non-negative measure where 0 indicates that the distributions are equals, a value close to 0 means a similar behavior of the distributions.
In our case, the two compared distributions correspond to the learned vector
This divergence is used later in the objective function, to help learning the weights that respect the user preferences. Note that the asymmetry property of this measure has the advantage that it allows to avoid the trivial solution where one attribute gets all the weight. Indeed, for
In the following, we manipulate two reference vectors
Our clustering objective function consists in three terms that are detailed in the following paragraphs.
Intra-cluster distance
First, as with any K-means like algorithm, we want to minimize the intra-cluster distance of the clusters
where
Consequently, we propose to learn a vector
where
Second, we want that the learned vector
Regularization term
Finally, when learning a Mahalanobis based distance, it is agreed to set the volume of clusters to a constant to avoid convergence to a solution where all attributes receive a weight equal to 0, which minimizes the objective function. Another trivial solution, while learning Mahalanobis distance, consists in assigning the maximum weight to the attribute on which the intra-cluster distances are minimal. The resulting statistical optimal solution has no interest in real use cases because of its low expression related to the use of a single attribute.
Consequently, we add a regularization term that prevents the vector
The overall objective function
By combining these three terms, it is possible to define an attribute preferential clustering objective function that expresses a compromise between the three previous terms:
where
At this point, there are two parameters that have a considerable influence on the objective function:
Intra-cluster distance weight Confidence level
Given a set of data points
Our goal is to find a
Reformulation with a Lagrange multiplier
As mentioned in Section 3.1 in Eqs (1) and (2), several constraints apply to our optimization problem: all the preference vectors of
Our proposal is to solve this constrained minimization problem by using the method of Lagrange multiplier as strategy to obtain normalized weights. This is possible because
If
Assuming that
The update of weight
Our algorithm follows the optimization scheme introduced by MPCK-means [13] which consists in three phases after initialization:
Cluster assignment Centroid re-estimation Metric learning
The metric-learning phase is crucial because it takes into account the preference vector and the confidence level. More specifically, for a given dataset
The weights of attributes for
Algorithm 1 describe the three phases of MAPK-means after the initialization step:
Cluster assignment The assignment step is the same as K-means (lines 5–6), with the only difference that the distances between points and centroid are parameterized with a vector Centroid re-estimation Once all points are assigned to a cluster, we update the center of each cluster by calculating the centroid for each attribute Metric learning In this step, MAPK-means (Algorithm 1) learns the right metric by updating the vector
We use a dichotomic search to determine an approximate solution to this equation (line 9) with a maximum error of
(Upper bound).
The upper bound of the Lagrange multiplier
Proof..
We have
(Lower bound).
The lower bound of the Lagrange multiplier
Proof..
The weights on the attributes must be positive, we have
Algorithm 2 describes the dichotomic search which starts from the midpoint of
Let us illustrate MAPK-means with the running example provided in Fig. 1 (page 3).
With the preference vector Interestingly, with the expert C and and preference vector
MAPK-means algorithm inherits good properties from K-means. Indeed, Properties 3 and 4 show respectively that it converges and that its time complexity is reasonable in comparison with the state-of-the-art methods:
(Termination).
MAPK-means converges to a locally optimal solution in a finite number of steps.
Proof..
Cluster assignment and centroid re-estimation decrease the intra-cluster distance without changing preferential and regularization terms. Besides, the metric learning minimizes the objective function
(Complexity).
The time complexity of MAPK-means is
Note that the search interval of the dichotomic search is
Proof..
The time complexity of cluster assignment and centroid re-estimation steps are respectively O(NKD) and O(ND). The time complexity of metric learning mainly relies on a Lagrange multiplier resolution benefiting from a dichotomic search. Its time complexity is O(m D). ∎
The time complexity of MAPK-means is less than the complexity of [60] where the computation of weights optimization is quadratic with
Interestingly, when the user has no preferences, the minimization of Eq. (6) minimizes the objective function of MPCK-means without constraints [13]:
(MCPK-means equivalence).
If the user has no preferences (i.e.,
Proof..
In the case where a single metric is learned, and where no constraints ML or CL are considered, note first that the MPCK-means weight update equation simply becomes
Furthermore, for
The MAPK-Means weight update equation thus becomes:
and it is easy to verify that the sum of weights is normalized when
which is equivalent to the MPCK-Means weight update equation (and having a coefficient close to the normalization). ∎
Experiments on UCI benchmarks
Extensive experiments were conducted to evaluate the performance of our new semi-supervised algorithm MAPK-means. After presenting the protocol and the datasets from UCI benchmarks used in Section 5.1, we present the following experiments:
First, in Section 5.2, we address the problem of the tuning of the parameter Then, in Section 5.3, we study the impact of using preferences on the attribute preferences on the clustering results. In particular, we study the sensibility of our algorithm to the user preferences in order to answer three main questions: (i) is there a positive impact for a relevant user-specified preferences? (ii) can MAPK-means correct a bad choice of preferences? (iii) who better guides the exploration of the solution: user preferences or data? Finally, in Section 5.4, we compare the performance of MAPK-means with the state-of-the-art algorithms. In particular, we observe that our method can be more efficient than [60], while setting the parameters is easier.
In order to evaluate the performance of our approach, we mainly performed experiments on multivariate attributes datasets from UCI repository3
For all datasets, we normalize the attributes using the Min-Max normalization method, which scales each attribute between 0 and 1. This provides an easy way to compare values that are measured using different scales or different units of measure. Furthermore, this does not favor attributes with large scale variation over others. However, this transformation does not give equal contributions of attributes in the clustering process. Indeed, the distributions of the different normalized attributes are not necessarily similar.
Note that in our pre-processing phase of the datasets, no dependency checking is performed. In fact, our approach can be used without any prior knowledge about the datasets. Furthermore, we show experimentally that our approach is able to correct the resulting clustering when irrelevant choices of preferences are given (cf. Section 5.3.2). Finally, for noisy data or missing attributes values, as our approach is K-means based, note that the same strategies of data pre-processing used for K-means algorithm can be applied [67].
Statistics of benchmark datasets
In the following, we evaluate the quality of the clustering built by our algorithm MAPK-means with the ground truth clustering described in the datasets of the UCI repository. More precisely, in order to compare a clustering or partition
where
Preference initialization
In practice, user preferences on attributes would be specified by domain experts. However, for the ease of reproducibility, we propose in our experiments a method to automatically generate preferences by using the ground truth class information. Intuitively, we assume that a descriptive attribute is most probably relevant to build a clustering if it is strongly correlated with the ground truth partition. In order to evaluate the degree of correlation between a descriptive attribute and the class attribute, we propose to use the Fisher’s ratio of the analysis of variance (ANOVA F-test) [52].
More precisely, given a set of clusters
Note that
Based on the Fisher’s ratio, we introduce two preference vectors, namely “relevant” and “irrelevant”, to automatically define user preferences:
the relevant weights vector
Our assumption is that this preference vector will help our algorithm MAPK-means to find better clusterings. conversely, the irrelevant weights vector
Our assumption is that using this preference vector, our algorithm will have to correct this vector to find good clusterings.
Finally, note that we run a large number of tests (
The aim of this first set of experiments is to show that for
Experimental setting For two initialization scenarios of preferences (
Results Using this experimental setting, we present the results of our experiments in Fig. 2. For each dataset and initialization scenarios (
The best average of NMI scores with the corresponding standard deviation for values of The best average of NMI scores with the corresponding standard deviation for values of
These results show that it is almost always possible to obtain good clustering results with
Comparison of the NMI averages of the clusterings obtained with 
Summary To sum up, the results of the experiments presented in this Section 5.2 mainly show that good clusterings can always be achieved with
This experimental section has two main objectives. First, we show that if the user’s preference vector is relevant (
Experimental setting We consider the same protocol presented in Section 5.1 and already used in Section 5.2. However, all the experiments are conducted with the default value of the intra-cluster distance weight, i.e.
Impact of relevant preferences on the quality of clustering
We first present the results obtained with a relevant choice of preferences in Fig. 3. For all the experiments presented in this figure, the preference vector
In order to do this, we consider two different configurations of MAPK-means:
when the confidence of the user in his/her preferences is maximal, i.e. using when the preference vector of the user is not considered, i.e. using
For these two configurations, we present in Fig. 3 the averages of the NMI scores and the corresponding standard deviation for each dataset.
Comparison of the NMI averages obtained using MAPK-means with 
These results demonstrate that when the preference vector of the user is relevant and the confidence in his/her preferences is high (
In fact, there is only one exception with the Wine dataset, which gets the best NMI score when preferences are ignored in contrast to other datasets. This may be due to the fact that for Wine all attributes are equally important (
We note that when the normalized divergence
Finally, these results show a better stability of the NMI (w.r.t to standard deviation values) when the confidence in preferences is high (
Summary A relevant choice of preferences has a positive impact on the quality of clusterings obtained with our algorithm MAPK-means. It also improves the stability of the results.
The main objective of this experiment is to study the ability of MAPK-means to correct the clustering obtained and improve its quality, when the preference vector
Comparison of the NMI averages obtained using MAPK-means with 
First, we can see in Fig. 4 that when the preference vector of the user is irrelevant, then the best clustering quality is always obtained with
The same figure shows that when the preference vector of the user is not relevant, then the quality of the clusterings always increases if the confidence level of the user in his/her preferences is reduced. For most of the datasets, the quality of the clusterings are already significantly improved with
For some datasets, in particular Ionosphere, Waveform and Wdbc, we can see that the quality of the clusterings are only improved for very small values of
Finally, we note a high instability of the results obtained when the confidence level
Summary The results obtained in this section demonstrate that our algorithm MAPK-means is able to correct a bad choice of preferences to obtain a clustering with better quality. The user has only to reduce the confidence
The experiments in Section 5.3.1 show that relevant preferences have a positive impact on the quality of clustering. In contrast, when the preferences are irrelevant (Section 5.3.2), the results show that the clustering driven only by the data has better quality than the clustering built using the irrelevant preferences.
When the user preferences are relevant, our aim is now to determine what is the best between tree alternatives: (i) the solution guided completely by the preferences (i.e. with
In order to do this, we consider the same scenario of preferences as in Section 5.3.1, i.e. with
For this configuration of MAPK-means and for each dataset, we keep the best value of
The results show that for most of the datasets (6 out of 12), the best confidence value is ranging in [0.25, 0.75], which means that the best quality of clustering is obtained with a compromise between the clustering guided by the relevant user preferences (
Summary Our experiments show that in general the best clustering solution is not a completely data-driven solution, neither a fully user-driven solution, even when the preferences are relevant, but it is a solution of compromise. (i.e.
Variation of the best value of
corresponding to the best average of the NMI scores, with
Variation of the best value of
We now compare the performances of MAPK-means with CFP algorithm introduced in [60]. Compared with our approach, note that CFP uses a qualitative model of preferences on attributes rather than a quantitative model using weights on attributes. However, it is important to note that, similarly to our proposal, [60] learns a metric parametrized by an attribute feature weight vector, i.e. the most appropriate weight vector with respect to the dataset and the user preferences.
Experimental setting
For the purpose of this experiment, we replicate the same protocol as [60]: we first compute a weight vector
Results The clustering results on all the datasets are shown in Table 3. This table compares the clustering results in terms of NMI of the algorithms K-means, K-means with a weighted distance (WK-means), CFP, for which we present only the best result obtained in [60] (using different values of their parameters
NMI values of clusterings obtained using K-means, K-means with a weighted distance, CFP [60] and our algorithm MAPK-means
NMI values of clusterings obtained using K-means, K-means with a weighted distance, CFP [60] and our algorithm MAPK-means
As can be seen in Table 3, the best NMI values obtained with CFP and MAPK-means are very similar on Iris and Pgblocks datasets. With the Optdigits dataset, K-means gives the best NMI value; however, our algorithm MAPK-means outperforms CFP. For this dataset, we notice that the divergence
For all other datasets, the quality of the clusterings produced by MAPK-means is better than the quality of the clusterings produced by CFP, MPCK-means without instance constraints (i.e. MAPK-means with
Summary The results show that our algorithm MAPK-means compete with existing approaches in the literature. Indeed, for most of the datasets, MAPK-means shows a better performance on the clustering quality.
As motivated at the beginning of this paper, the initial purpose of our algorithm MAPK-means is to provide a solid foundation for a new interactive tool for clustering marketing related data sets. For this purpose, a new software prototype has been developed from scratch in collaboration with the French company Group Up to help marketing experts in the time consuming task of customers segmentation. Compared to the UCI Benchmark datasets used in the previous section, it is important to emphasize that in the context of our real use case, alternative interesting clusterings exist, according to the specific needs of the experts. It means that different reference clusterings exist for the same dataset and that, expressing preferences is likely to provide distinguishable results in this specific context.
Dataset and reference clusterings
The experiments presented in this section consider a dataset POS of 1,282 Points Of Sales. This dataset represents the quantities of product sold or returned for the top-3 profitable products in the year 2016 of a French company. More precisely, for each product
Reference partitions construction
In the context of our business use case, different categories of marketing experts can be identified: (i) some analysts are particularly interested in the quantities sold and returned for one specific product, (ii) others analysts are only interested in the quantities sold for all products, (iii) finally others analysts prefer to use only the quantities returned for all the products.
Following these categories of experts, we built 5 different reference clusterings that represent typical results that can be exploited by the company. These clusterings are built using MAPK-means with
Table 4 summarizes how those reference partitions were built.
The attributes used to built the reference partitions (1: used, 0: not used)
In the following experiments, these 5 clusterings are used as ground truth clusterings. In order to evaluate if these 5 clusterings are really different or not, we compute the NMI scores between each pair of clusterings. The average value of the NMI scores is equal to 0.283 (
In our experiments, we consider analysts that put a high preference value on their attributes of interest, and lower preference values on the other descriptive attributes. More precisely, we consider preference vectors such that the sum of the preference weights on the the attributes of interests represents 90% of the total sum, whereas the sum of the preference weights on the other attributes represents only 10% of the total sum. The preference vectors that we consider in our experiments are detailed in Table 5.
Analysts’ preference vectors (with strong values on the preferred attributes)
Analysts’ preference vectors (with strong values on the preferred attributes)
For each preference vectors presented in Table 5, we perform 100 runs of MAPK-means taking
Variation of the average of the NMI scores between the obtained partitions and the different ground truth partitions.
We first notice that when
We can also see that when
This use case illustrates the importance of a method that integrates the expert knowledge to identify the appropriate customer segmentation fitting the expert needs. The results demonstrate that our approach allows to guide efficiently the clustering exploration process.
Finally, we note that without preferences, the customer segmentation would be a compromise between several segmentations and may not be representative of a real need, losing the ease of interpretation gained from the expression of preferences. Moreover, it has to be noticed that we intentionally reported an experiment with only 3 products for the sake of simplicity. Of course, any increase in the number of products would result in an increase in the number of attributes combinations and in the number of potential customer segmentations, which emphasizes the interest of our approach.
Conclusion
We propose in this paper a semi-supervised clustering method that allows the user to express preferences on attributes, relying on the learning of a distance metric. We present an extensive literature survey on the semi-supervised clustering, while we focus on the types of constraints and their resolution approaches. We also outline the sensitivity of the methods to the constraints quality. Then, we formulate the user preferences as a simple vector which is taken into account into the objective function, in conjunction with a confidence parameter used to handle the sensitivity problem. This parameter allows the user to express his/her degree of confidence in his/her choices. We demonstrate that this quantitative model of preferences leads to an efficient metric learning step iterated by our algorithm MAPK-means. We also prove that MAPK-means has several interesting properties such as the convergence guarantee and a lower time complexity compared to the state of the art.
Furthermore, the extensive experimental results illustrate the importance and the good performance of our approach on datasets from the UCI benchmark:
We first show the simplicity of parameter setting of MAPK-means, where the parameter We illustrate the positive impact of user preferences on clustering quality and on helping the method finding the right alternative clustering for the user. We also observe that when the confidence level expressed by the end user is not too high ( We demonstrate that the clustering with best quality is generally a compromise between a user-driven solution and a data-driven solution. We finally show that MAPK-means generally performs better than other algorithms of the literature on UCI benchmarks.
Interestingly, the experiment on the use case illustrates perfectly the need for our proposal in the context of Customer Relationship Management where several distinct customer segmentations coexist. Attribute preferences in this case can effectively guide the exploration of alternative segmentations.
Future work will intend to integrate our algorithm MAPK-means into an exploratory system using OLAP operations to navigate between sets of attributes. At first, it would be pertinent to integrate classical operations such as rollup and drilldown so that the user can navigate between different alternative clusterings. More ambitiously, we would like to deduce the preference vector from the operations performed by the user. For now, the learned vector is used in a descriptive way to explain what attributes are significant to construct the K-partition. This vector could also be used in a prescriptive way to recommend the user to explore alternative subspaces. Finally, we think that the use of constraints at the attribute level offers important perspectives to guarantee other relevant properties on the partition. For example, our approach could be extended to ensure that all clusters have the same distribution for a given attribute (for instance, each customer segment has the same proportion of men and women).
Footnotes
Acknowledgments
We would like to thank KALIDEA – Groupe UP and the National Association of Research and Technology (ANRT) for financial support as part of the thesis (2014/0658).
