MAPK-means: A clustering algorithm with quantitative preferences on attributes

Abstract

This paper describes a new semi-supervised clustering algorithm as part of a more general framework of interactive exploratory clustering, that favors the exploration of possible clustering solutions so that an expert tailors the best clustering according to her domain knowledge and preferences. Contrary to most existing approaches, the novel algorithm considers the feature space as a first class citizen for the exploration of alternative solutions. Our proposal represents and integrates quantitative preferences on attributes that will guide the exploration of possible solutions by learning an appropriate space metric. It also achieves a compromise clustering based on expert confidence, between a data-driven and a user-driven solution and converges with a good complexity. We show experimentally that our method is also able to deal with irrelevant user preferences and correct those choices in order to achieve a better solution. Experiments show that the best results may be achieved only with the addition of preferences to traditional metric learning algorithms and that our approach performs better than state-of-the-art algorithms.

1. Introduction

Data clustering is a well studied domain of unsupervised machine learning that aims at partitioning datasets into homogeneous groups of items sharing similar properties [37, 1]. As such, it is generally used as part of a more general data exploration process where an expert wants to summarize information or to discover categories, iteratively interacting with data to answer a business question. In Customer Relation Management (CRM), clustering is used to determine meaningful customers profiles from sales records [35, 36, 61].

Traditionally, experts have been able to take advantage of the exploratory nature of clustering algorithms and to compare alternative clustering solutions via the algorithms’ parameters, such as the number or the initial position of clusters centers in k-means [53], the number of nearest neighbors and the size of neighborhood in DB-SCAN [30] or the aggregate criterion in agglomerative hierarchical approaches [54]. However, in this case, there are no mechanism that can easily help the expert finding the parameters that best suit her exploration’s preferences.

Other works have shown that the space of attributes (features) can be a first class citizen to explore new clustering solutions, either by tailoring the metric space using metric learning [70, 38], or by weighting/selecting the most likely interesting features [21, 48, 44], or finally by producing meaningful subspaces [56, 40, 27]. Note that bi-clustering approaches [33, 16] aim to perform a simultaneous clustering of both the objects and attributes of a data matrix. As a consequence, the attributes are used as the objects to be clustered, and not used to define the exploration space. However, these approaches rely solely on data but are agnostic of the expert and are thus not really interactive, and may not be easy to understand during the process of an exploration. For example, in the case of subspace clustering, experts can be left with several partitions of the same dataset, while increasing simultaneously the cost of interpretation.

Therefore, it is was necessary to provide to the users more interactive approaches, allowing them to explore efficiently appropriated clustering solutions. The research studies on interactivity for clustering exploration has gain much attention at the beginning of last decade, with the advent of semi-supervised methods [11, 23, 49, 18]. These approaches introduce expert knowledge as constraints into the clustering algorithm. These constraints help guiding the clustering process in order to improve the quality and the coherence of the built partition according to expert needs. These constraints can be defined on many levels: the instance level [11, 13, 69, 59, 46, 6, 51, 50], the cluster level [9, 34, 10, 71, 31, 47], or the partition level [8, 24, 58].

This paper illustrates the concept of exploratory clustering that builds upon semi-supervised clustering as a way to favor the exploration of possible clustering solutions, so that an expert tailors the best clustering according to domain knowledge and preferences. Following the IDEA principles [17], exploratory clustering should rely on semi-supervised approaches, as well as visualization and appropriate interaction mechanisms. In this context, we describe a new semi-supervised clustering algorithm that considers the feature space as a support for the exploration of alternative solutions, by integrating quantitative user preferences on attributes.

Figure 1.

Customer segmentation using the purchase of products $X$ , $Y$ and $Z$ .

1.1 Motivation example

Let us illustrate the objective of our work by considering a toy example in the field of customer segmentation task. In order to motivate our contribution, we first consider the example of three experts $A$ , $B$ and $C$ from a marketing agency who search for a personal customer segmentation that corresponds to their specific needs based on customer purchase of various categories of products named $X$ , $Y$ and $Z$ (Fig. 1). Naturally, the dataset contains four clusters Fig. 1a, but for business reasons, experts want to build only two customer segments. In this case, multiple partitions exist for this dataset and the exploration of these partitions depends on the attribute space. Now, as the experts have different professional experiences, we consider that their degree of interest or preferences on categories of product is not the same.

First, consider an expert $A$ that is more interested by purchases of product $X$ , than purchases of products $Y$ and $Z$ . In this paper, we use a quantitative model to represent the preferences of $A$ . More precisely, we assume that each expert assigns to each descriptive attribute a weight proportional to his/her interest for this attribute in clustering analysis. Thus, the preferences of expert $A$ will be represented by the preference vector $W_{A}=\langle 0.8,0.1,0.1\rangle$ . If this expert wants to build a customer segmentation with two clusters, using a K-means algorithm with a weighted distance, he/she will obtain the clustering result presented in Fig. 1b. From another point of view, an expert $B$ with a preference vector $W_{B}=\langle 0.1,0.1,0.8\rangle$ will obtain the clustering presented in Fig. 1c. It is important to note that the two clusterings obtained by experts $A$ and $B$ are two interesting views of the same dataset. Only the preferences of the experts allow to select one over the other.

Consider now an expert $C$ with a preference vector $W_{C}=\langle 0.1,0.6,0.3\rangle$ . Using a simple K-means with a weighted distance, this expert will obtain the segmentation given by Fig. 1d, which is not satisfactory. Indeed, the expert $C$ formulates a high degree of preference on product $Y$ , while this attribute does not separate well the set of customers. In this paper, to avoid such a problem, we propose a new approach that can take into account the confidence level of an expert in her preferences. The confidence level $\kappa$ of an expert is represented by a real value in [0,1]. Thus, if the expert $C$ has a very high confidence in his/her preferences ( $\kappa=$ 1.0), she will still obtain the segmentation depicted by Fig. 1d. However, with a lower confidence level in her preferences ( $\kappa=$ 0.6), he/she will obtain the segmentation presented by Fig. 1c. Indeed, as the attribute $Y$ does not separate well clusters, our method will tend to favor the other attributes (especially the attribute $Z$ whose preference is slightly higher than that of $X$ ).

1.2 Our proposal

In order to tackle the problems and challenges illustrated by our motivating example, we propose in this paper a new semi-supervised clustering algorithm that integrates user preferences on attributes during the construction of clusters. To the best of our knowledge, only the works in [60] addresses the same problem, but with a different model of user preferences. More precisely, the main contributions of this paper are the following:

•
We propose to use a quantitative model of preferences to represent the user preferences on attributes. Comparing to [60], the authors propose a qualitative model where the user has to express – for a couple of attributes – which is the preferable than the other. However, regardless of this expressivity, the authors underline that their model of preferences can be very difficult to set. By comparison, our quantitative model is easier to use by an expert. Moreover, we show that it guarantees a lower time complexity when we build the clusters, in particular when the user preferences define a total pre-order on the set of descriptive attributes.
•
We present the new clustering algorithm MAPK-means (Metric Attribute Preferential K-means) that minimizes our clustering objective function. This objective function combines the intra-cluster distance weighted by a learned metric with quantitative preferences while considering the confidence level. We prove the convergence of this algorithm and we analyze its complexity. We demonstrate finally that when no user preferences are used, MAPK-means is similar to MPCK-Means [13] without constraints.
•
We provide a large set of experiments on UCI benchmarks to prove the ease of use and the efficiency our approach. In particular, these experiments show that user preferences on attributes improve the quality of clustering and that our approach is robust against erroneous preferences due to the confidence level (as illustrated with expert C in the motivating example). Moreover, our approach achieves a compromise clustering, based on expert confidence, between a data-driven and a user-driven solution. Interestingly, MAPK-means outperforms the state-of-the-art algorithms (CFP [60], K-Means and K-Means with a weighted distance).
•
We detail a use case taking up the scenario of the motivating example on a real dataset that shows in practice the interest of user preferences for guiding efficiently to explore the appropriate customer segmentation. This use case stems from a collaboration with Group Up1
¹
https://www.up.coop/en.html.

(a French company).

Note that we make the complete source code of MAPK-means and the experiences implementations available2
²
GitHub: https://github.com/adnan-el-moussawi/MAPK-Means.

to the community for reproducibility of experiments and for ease of comparisons.

This paper extends the work presented in [29] that introduced a first version of the algorithm MAPK-means. Compared to this first version, the initialization of the algorithm has been modified to be consistent with user preferences. More importantly, a theoretical analysis of MAPK-means is devised in this paper and the experimental section presents completely new results that provide accurate insights on the behavior of MAPK-means and its robustness to parameter settings. Indeed, reported results concern more UCI benchmark datasets and our protocol now exploits statistical tests to ensure more clearly the significance of results. Furthermore, the use case is a new and important addition illustrating the practical use of MAPK-means by a company, in the context of customer segmentation. Finally, we note that this paper provides an extensive literature survey on semi-supervised clustering approaches that studies the constraints by their types, considering their level of application, their resolution approaches and the impact of their quality on the clustering.
1.3 Paper organization

The remainder of the paper is organized as follows. In Section 2, we discuss some related works about semi-supervised clustering. Then, we formulate the problem and show in Section 3 how user preferences on attributes can be integrated in a clustering objective function. The new clustering algorithm MAPK-means that minimizes our new clustering objective function is derived, explained and analyzed in Section 4. Then, in Section 5, we evaluate and compare our algorithm to other approaches using several UCI benchmarks. Section 6 reports a use case relying on a real-world dataset. Finally, we conclude the paper and discuss some perspectives in Section 7.

2. Related work

The main objective of our work is to let an expert express preferences about features that might be interesting in the context of data clustering, in the sense that the preferred subspace should allow for a better distinction between the clusters, and should help finding a view of the data that is appropriate to answer the business question at hand. This problem can be either considered as a feature selection or weighting process [3, 44] or as a subspace clustering problem [41, 56]. However, both these processes are completely data-driven and none of these allows for the expression of user preferences on the features of interest. Indeed, feature selection approaches are based on some internal criteria that is design to improve clustering accuracy or interpretability without any user interaction: once chosen, the features (weights) are kept during the whole analysis, leaving no room for an expert to provide an alternative view of the problem. In this respect, this selection (or weighting) process can be seen as hard constraints and not preferences on the features. Similarly, subspace clustering algorithms conduct a purely data-driven exploration of different subsets of features where clusters are relevant according to a criterion, for example density as in CLIQUE [2]. As a consequence, and even if there exists several alternative features subsets to choose from, the expert is left with no proper way to select the one that best fit her expert knowledge.

In the rest of this section, we present the different types of user constraints existing in the literature. Those are mainly distinguished by their level of application : on instance level, on cluster level, on partition or on attributes. We then explain how these constraints are resolved in the clustering and we provide a short overview on that the main families of constraints resolution approaches. We finally discuss the sensitivity of the clustering method to the quality of the expert knowledge, how this issue was treated in existing work and in our proposed algorithm.

2.1 Expressing expert knowledge as constraints

The solution to our problem can be expressed as a semi-supervised clustering algorithm that improves the relevance of a k-means like clustering algorithm based on an expert knowledge represented as user preferences on the features. As first introduced in [66], expert knowledge is generally provided at the instance level as labels [11], also called seeds, as in supervised classification, or as pairwise instances constraints, sometimes also called equivalence constraints [13]. These pairwise constraints indicate that two objects should be in the same cluster (Must-Link or ML constraints) or not (Cannot Link or CL constraints). Recently, a new form of constraints at the instance level as relative distance constraints, was introduced in [43, 50, 57], but they are more adapted to ranking and instance order preferences.

Numerous clustering approaches like k-means or DB-SCAN have been improved to benefit from side information provided as instances constraints [23, 11, 18, 69, 59]. For example, SS-DBSCAN [46] improves the initial DBSCAN algorithm with an automatic setting of the density parameters based on labels, while in [11] the authors use constraints to define the initial centers of K-means to avoid the convergence to an inappropriate solution. In this case, the difficulty of exploring the initial parameter settings space is replaced by a prior knowledge of the user on the data she wants to partition, which may be in some cases, easier to formulate.

More recent works express cluster level constraints to avoid contradictions at the instance level [15, 28, 71, 31]. In [15, 14] authors propose to use constraints on the clusters size, with the addition of pairwise instance constraints, to avoid ill-formed k-means output with less than k clusters. [25] have introduced cluster compactness and separation constraints to improve the quality of built clusters or to help to satisfy other types of constraints such ML and CL.

Other algorithms express constraints on the whole partition of a dataset. Some works build an equi-sized partition, with the objective of balancing the size of clusters [9, 10, 39]. Other proposals find an alternative partition that is as dissimilar as possible from a reference partition, but with a comparable or better quality [8, 24, 58, 19].

To the best of our knowledge, only very few works have explored the semi-supervised clustering problem with hard constraints or preferences on the features rather than on the instances. An alternative approach to our proposal in [60, 68] consists in expressing qualitative attributes preferences by mean of a triple $(A_{j},A_{k},\delta)$ which indicates that attribute $A_{j}$ is preferred over $A_{k}$ with a degree $\delta>0$ . Note that if a user wishes to indicate that two attributes have approximately equal degree of importance, then the user will have to specify a couple of constraints $(A_{j},A_{k},-\epsilon)$ and $(A_{k},A_{j},-\epsilon)$ where $\epsilon$ is a small positive number. However, when a total pre-order is expected over the set of all attributes as in our example, the expert would have to set a quadratic number of preferences with the attribute number, setting 6 constraints even in the case of the example provided in Fig. 1 where there is only 3 attributes. This work assumes that the distance used to build the clusters is also part of the clustering process. Given the euclidean distance $d_{W}$ parametrized with a vector of weights $W=\langle w_{1},\ldots,w_{D}\rangle$ , such that, $d_{W}(x,y)=\sqrt{\sum_{i=1}^{D}w_{i}(x[i]-y[i])^{2}}$ , the constraint $(A_{j},A_{k},\delta)$ is satisfied if $(w_{j}-w_{k})>\delta$ . For attributes with same importance, the couple of constraints is satisfied if $|w_{j}-w_{k}|\leqslant\epsilon$ .

Contrary to this approach, our model of quantitative preferences only requires a linear number of preferences, i.e. 1 per attribute. Moreover, in addition to simplifying the interaction with the user, our model leads to a better complexity in learning the metric that corresponds to the best subset of features.

2.2 Constraints resolution in clustering

Semi-supervised clustering methods can be categorized depending on the approach used to resolve the constraints. That is how the constraints are expressed and resolved in the method. Hence, the exiting works in the literature can be divided into three main families of resolution methods: ad-hoc, exact and soft. We note that the literature sometimes distinguishes semi-supervised methods depending on the satisfaction strategy used (hard or soft): hard constraints, where all constraints must be satisfied, and soft constraints where the constraint violation is permitted [32]. This distinction between hard and soft constraints is done independently of any particular resolution method.

2.2.1 Ad hoc approaches

In this family of approaches, the constraints are expressed outside the objective function of the clustering method used. The resolution is done by extending an existing algorithm, like K-Means, and modifying or adding some instructions to ensure the satisfaction of the constraints [64, 11, 8, 14, 31]. The COPK-means [64] derives from a k-means algorithm and use ML constraints to define initial cluster centers, while CL constraints are injected during the assignment step to avoid any grouping of objects that should not belong to the same cluster. Authors enforce a hard constraints satisfaction but without a guarantee of the final solution because of contradictions that may be observed between constraints. In [11], the authors propose two variants of K-Means that consider seeds constraint. In the first, the seeds are used only to compute the initial centers, in the second the authors add an instruction to the convergence phase of K-Means, in order to ensures that the seeds remain in the same clusters. [15, 14] modify the assignment step, in a K-Means like approach, to consider minimum cluster size constraints. They use to this aim the linear programming technique. The algorithm presented in [31] forces the constraints satisfaction on the maximum size of clusters during the assignment step. More specifically, a data instance is assigned to the closest cluster satisfying the size constraint.

2.2.2 Exact approaches

The goal of this family of approaches is to find clustering solutions that satisfy all the constraints specified by an expert. To this aim, exiting methods use Constraint Programming (CP) techniques that separate the modeling of the problem from its resolution. If a solution exists, this solution is also optimal with respect to the objective function introduced. In the clustering domain, to our knowledge [20] are the first to use the CP paradigm to solve semi-supervised clustering problems. Their method deals with the problem of expressing constraint at the cluster level with features constraints. This method has the nice property of producing an optimum solution if it can be found, but at the price of a NP-complete complexity. In this respect, the more constrained the problem is, the more efficient is the algorithm. However, the proposed solution is limited to hard constraints that does not match our objective of soft constraints to represent expert preferences. However, it is to be noticed that it would be possible to improve this approach with soft constraints by prioritizing them following the idea of [55].

2.2.3 Soft approaches

Approaches of this family share the general principle to transform the constraints specified by the user in terms of a set of penalties in addition to the objective function to minimize (to find a clustering solution regardless of any constraints). In most of approaches, these penalty terms should be convex. The goal of the methods in this class is thus to minimize the cost of violating the constraints, without guaranteeing that they will be fully satisfied [70, 12, 13, 39, 34, 43, 19]. The solution found is locally optimal in the sense that it minimizes the objective function integrating the constraints by penalty terms. Therefore, the method converges to a local optimum and not a global one. For example, the PCK-Means algorithm introduced in [12], allows a flexible satisfaction of the constraints by assigning a cost of violating ML and CL constraints. The cost of each constraint evolves during the algorithm so as to find the expected compromise between intra-cluster distance minimization and the cost of violation. Another solution to find a compromise is to adjust the topology of the space so as to decrease the distance between points in ML constraints and to increase the distance between points in CL constraints [70]. In [13], authors describe the MPCK-Means algorithm that uses the constraints to learn a weighted Euclidean distance indicating the relative importance of each attributes. This metric is updated during the clustering process while minimizing the intra-cluster distance and the cost of violation, which represents an additional step compared to PCK-Means. A significant bias present in their method is related to the order of points assignment. The assignment is determined by the distance to the clusters and the respect of the ML and CL constraints, it thus influences the centers recalculation step and makes it dependent on the assignment order.

In [60], the authors propose to express a set of attribute order constraints by adding a penalty term to the intra-clusters distance, defined by:

$\displaystyle\sum_{(A_{j},A_{k},\delta)}\max\Big{(}0,\delta-(w_{j}-w_{k})\Big{)}$

It is straightforward to verify that this term is null if all the constraints are satisfied. The authors add also a regularization term in order to avoid solutions where the learned vector of weight $W$ is too far from the uniform distribution. They propose then the algorithm CFP based on a alternate optimization schema, as in MPCK-Means [13].

Our approach falls into this family of constrained clustering methods that achieves a soft enforcement via the learning of a metric space, by taking into account user preferences on attributes. Following our previous example in Fig. 1, we propose to use quantitative preferences on attributes as the expert knowledge paired with an adequate penalty term.

2.3 Sensitivity to the expert knowledge

Finally, although semi-supervised algorithms improve significantly the performance or the stability over the traditional algorithms they are derived from, they are still very sensitive to the quality of the expert knowledge that is fed as input [65, 55, 45]. As a solution, some early works [26, 65] try to estimate the quality of a constraints set as an information quantity and an agreement between constraints, to avoid any posterior decrease in the performance of the algorithm. In [63, 62], authors introduce a pairwise constraint utility measure that relies on a k-nearest neighbors graph to determine first, how well connected to their respective neighborhood are the points in the constraint to identify outliers or isolated points, and second how well points of the constraint are connected together. This measure is paired with a propagation mechanism to limit the number of queries proposed to experts for annotation while still improving clustering quality. In [55], the authors propose a method for learning a priority order on constraints, which is then used and updated in their algorithm based on COPK-Means of [64]. The objective is to satisfy in the first place the most important constraints according to a criterion of relevance. Other recent works [5] are interested in determining how certain fuzzy clustering methods make it possible to take into account erroneous or uncertain constraints expressed most often, in the form of partial matrices of membership.

To avoid any unintentional decrease of performance, our proposal provides to the expert the possibility to indicate explicitly the level of confidence in her preferences on attributes. Thus, in the case where the expert has little confidence in her preferences, the algorithm will propose a solution that relies more heavily on the data. Conversely, if the expert is very confident in her preferences, the algorithm will propose a solution guided primarily by the expert choices. The resolution of these preferences is done at the level of the objective function, in a soft resolution approach with a metric learning step in a K-Means like algorithm. The use of a soft approach finally has an advantage in terms of ease of extending the method with other forms of constraints. The learning step allows to modify the space to better satisfy the preferences. Thus, this step is done while respecting the structures naturally present in the data.

3. Problem statement

Our objective is to propose a new semi-supervised clustering algorithm that can handle quantitative user preferences on attributes. To this aim, we introduce a K-means like algorithm that learns the attribute weights that are the best compromise between the weights provided by the user preferences and the attribute weights that would best fit the natural distribution of data and depending on the user confidence on her preferences.

Notations In the following, a dataset $\cal{X}$ is a set of $N$ data objects described by $D$ attributes. Clustering analysis aims at finding a partition of $K$ clusters, denoted by $\{{\cal X}_{j}\}_{j=1}^{K}$ . The centroid of cluster ${\cal X}_{j}$ is denoted by $c_{j}$ . Given a data object $x\in{\cal X}$ , $x[i]$ returns the value of the $i$ th attribute.

3.1 Quantitative user preferences

The originality of our approach is to incorporate user preferences on attributes to construct the right partition. More precisely, we model user preferences with a quantitative model of preferences in which the end-user assigns to each attribute a weight proportional to his/her interest for this attribute in clustering analysis.

More formally, we use a preference vector $\textbf{W}^{*}$ to model preferences where each weight $w^{*}_{i}$ represents the weight expressed by the user on the $i$ th attribute. Without loss of generality, we consider that all weights $w^{*}_{i}$ are positive or zero and that the overall weight of user preferences needs to be normalized:

$\displaystyle w^{*}_{i}\geqslant 0,\forall i\in\{1,\ldots,D\}$ (1) $\displaystyle\sum_{i=1}^{D}w^{*}_{i}=1$ (2)

For instance, the preferences $\textbf{W}^{*}_{A}=\langle 0.8,0.1,0.1\rangle$ means that the user thinks that the first attribute is much important than the other two. The set of all preference vectors is denoted by ${\cal P}$ . Note that in a real application case, the user can limit to the definition of a part of these preferences, the other part of the preferences vector can be calculated automatically. For example, if the user gives a weight $w^{*}_{i}$ on the $i$ -th attribute, we distribute equally the rest of weight, i.e. $(1-w^{*}_{i})$ , on the others attributes in order to respect the constraint $\sum{w^{*}_{i}}^{D}_{i=1}-1=0$ .

3.1.1 Comparing preferences

The objective function that will guide the clustering algorithm requires comparing preference vectors on the attributes: between the preferences expressed by $\textbf{W}^{*}$ on the one hand and the weights currently used by the clustering algorithm on the other hand. It is then necessary to measure the dissimilarity between two preference vectors of ${\cal P}$ .

The constraints presented in the Eqs (1) and (2) allow us to assimilate the distributions of preferences on the attributes to probability distributions. For probabilistic distributions, a “divergence” measure is usually used to measure dissimilarity, rather than a metric distance (such as Euclidean or Manhattan). A distance requires checking the properties of symmetry and triangular inequality, whereas it is not necessary for divergence.

Therefore, to measure the dissimilarity between two preference vectors, we propose to use the Kullback-Leibler divergence [42] which measures the dissimilarity between two distributions. This is an asymmetric non-negative measure where 0 indicates that the distributions are equals, a value close to 0 means a similar behavior of the distributions.

In our case, the two compared distributions correspond to the learned vector $\textbf{W}=\langle w_{1},\dots,w_{D}\rangle\in{\cal P}$ and a reference vector ${\bf P}=\langle p_{1},\ldots,p_{D}\rangle\in{\cal P}$ . The divergence of Kullback-Leibler is expressed by Eq. (3).

$\displaystyle D_{\textit{KL}}{\bf P}||\textbf{W}=\sum_{i=1}^{D}p_{i}\log\left(% \frac{p_{i}}{w_{i}}\right)$ (3)

This divergence is used later in the objective function, to help learning the weights that respect the user preferences. Note that the asymmetry property of this measure has the advantage that it allows to avoid the trivial solution where one attribute gets all the weight. Indeed, for $D_{KL}\textbf{W}||\textbf{P}$ , the optimal solution of minimizing this measure is to set a weight equal to 1 to the attribute corresponding to $\max_{i\in[1..D]}(p_{i})$ and weights equal to 0 to others attributes. Furthermore, for $\textbf{W}^{*}=\textbf{U}$ , we can obtain an objective function similar to a basic K-means with metric learning. We note that $D_{\textit{KL}}(.||.)$ is unbounded: it is always $>0$ , for $\textbf{W}^{*}\neq\textbf{U}$ , this may give it more importance in the clustering function, relative to other terms. To solve this problem, we define a term ${\cal Z}$ that normalizes this divergence (cf. Section 3.2), compared to other terms used in the clustering objective function (Section 4.2.1).

In the following, we manipulate two reference vectors ${\bf P}$ to express our objective function: the user preferences $\textbf{W}^{*}$ and the uniform vector $\textbf{U}=\langle 1/D,\ldots,1/D\rangle$ . The latter gives a weight equivalent to each attribute.

3.2 Attribute preferential clustering objective function

Our clustering objective function consists in three terms that are detailed in the following paragraphs.

3.2.1 Intra-cluster distance

First, as with any K-means like algorithm, we want to minimize the intra-cluster distance of the clusters $\{{\cal X}_{j}\}_{j=1}^{K}$ . A naive solution could be to directly use the preference vector $\textbf{W}^{*}$ to parameter Euclidean distance as follows:

$\displaystyle||x-c_{j}||_{\text{W}^{*}}=\sqrt{\sum_{i=1}^{D}w^{*}_{i}\times(x[% i]-c_{j}[i])^{2}}$ (4)

where $x$ and $c_{j}$ are respectively a data object and the centroid of the $j$ th cluster. Unfortunately, in this case our solution would only rely on the user expertise and would not take into account the natural distribution of the data. As a side effect, we could output a poor clustering if the user preference vector does not sufficiently discriminate between the data objects (see Fig. 1d as a typical example).

Consequently, we propose to learn a vector $\textbf{W}\in{\cal P}$ that performs a projection of the initial data space so that the clusters $\{{\cal X}_{j}\}_{j=1}^{K}$ are more compact and well separated in the new space. Thus, we want to minimize the following intra-cluster distance:

$\displaystyle\sum_{j=1}^{K}\sum_{x\in{\cal X}_{j}}||x-c_{j}||^{2}_{\textbf{W}}$ (5)

where W represents the weights on the different attributes used for clustering. This vector W is learned according to user-defined W ${}^{*}$ preferences, but also according to the data.

3.2.2 Deviation from attribute preferences

Second, we want that the learned vector W deviates as less as possible from W ${}^{*}$ in order to respect user preferences. Thus, it is necessary to introduce a penalty term to reduce the dissimilarity of W with W ${}^{*}$ . Using the Kullback-Leibler divergence, we want to minimize: $D_{\textit{KL}}(\textbf{W}^{*}||\textbf{W})$ .

3.2.3 Regularization term

Finally, when learning a Mahalanobis based distance, it is agreed to set the volume of clusters to a constant to avoid convergence to a solution where all attributes receive a weight equal to 0, which minimizes the objective function. Another trivial solution, while learning Mahalanobis distance, consists in assigning the maximum weight to the attribute on which the intra-cluster distances are minimal. The resulting statistical optimal solution has no interest in real use cases because of its low expression related to the use of a single attribute.

Consequently, we add a regularization term that prevents the vector W to deviate too much from a traditional K-means where all attributes have equal weights. This idea can be formulated as the divergence between the vector to learn and a uniform vector $\textbf{U}=\langle 1/D,\ldots,1/D\rangle$ : $D_{\textit{KL}}(\textbf{U}||\textbf{W})$ .

3.2.4 The overall objective function

By combining these three terms, it is possible to define an attribute preferential clustering objective function that expresses a compromise between the three previous terms:

$\displaystyle{\cal I}_{\textit{map}}=\alpha\Big{(}{\cal Z}\sum_{j=1}^{K}\sum_{% x\in{\cal X}_{j}}||x-c_{j}||^{2}_{\textbf{W}}\Big{)}+(1-\alpha)\Big{(}\kappa D% _{\textit{KL}}(\textbf{W}^{*}||\textbf{W})+∼{}(1-\kappa)D_{\textit{KL}}(% \textbf{U}||\textbf{W})\Big{)}$ (6)

where ${\cal Z}$ is a normalizing constant greater than 0, the parameters $\alpha$ and $\kappa$ are between 0 and 1. These parameters allow to weight the relative importance of the different terms of the objective function (see the following). Note that ${\cal Z}$ is a normalizing constant between intra-cluster distance and other terms of the objective function. Indeed, the parametrized Euclidean distance and the Kullback-Leibler divergence have significantly different ranges. Section 4 will discuss how to set this constant such that a median value of $\alpha$ corresponds to a default value guaranteeing a trade-off between the two parts.

At this point, there are two parameters that have a considerable influence on the objective function:

•

Intra-cluster distance weight $\alpha$ : This parameter controls the importance of data compared to that of user preferences. Clearly, it seems difficult for the end-user to set this technical parameter. Fortunately, we will see in the experimental section that it is possible to set it to an appropriate default value without degrading performance. By varying this parameter, it is also possible to apply a traditional K-Means by setting its value to 1. Subsequently if $\alpha=1$ , all weights are equal to zero because there is no more regularization.

•

Confidence level $\kappa$ : The user-specified parameter $\kappa$ corresponds to the user level of confidence in her choices. It gives the importance of her preferences compared to traditional data-driven metric learning, where all dimensions have equal weights. When $\kappa=$ 1, the regularization term is not used. The user forces the method to meet his/her preferences. In this case, ${\cal I}_{\textit{map}}$ sets the vector W ${}^{*}$ taking into account only user preferences and intra-cluster distances. In contrast,when ${\kappa}=$ 0, user preferences are ignored and minimizing Eq. (6) is equivalent to minimize the objective function of MPCK-means without constraints [13], i.e. the metric learning is only data-driven (cf. Section 4.3, Property 5).

Given a set of data points ${\cal X}$ , a number of clusters $K\geqslant 1$ , a preference vector $\textbf{W}^{*}\in{\cal P}$ , $\alpha\in[0,1]$ and $\kappa\in[0,1]$ , our goal is to find a partition of $K$ -clusters $\zeta=\{{\cal X}_{j}\}_{j=1}^{K}$ of the data minimizing the objective function ${\cal I}_{\textit{map}}$ while learning a vector $\textbf{W}\in\textbf{P}$ .

4. MAPK-mean algorithm

Our goal is to find a $K$ -partition that minimizes our objective function while learning the metric. For this purpose, we use the method of Lagrange multiplier as a strategy (Section 4.1) to add a metric learning step to the K-means algorithm (Section 4.2). Our algorithm efficiently converges to a locally optimal solution (Section 4.3).

4.1 Reformulation with a Lagrange multiplier

As mentioned in Section 3.1 in Eqs (1) and (2), several constraints apply to our optimization problem: all the preference vectors of ${\cal P}$ are such that each weight is positive and the sum of weights equals to 1. In particular, our minimization problem has to take into account that the learned vector W in objective function ${\cal I}_{\textit{map}}$ satisfies these constraints:

$\displaystyle\min_{\textbf{W}}{\cal I}_{map}$ $\displaystyle\text{subjectto}\sum^{M}_{i=1}w_{i}-1=0;w_{i}>0;$ (7) $\displaystyle\textrm{ for all }i\in\{1,\ldots,D\}$

Our proposal is to solve this constrained minimization problem by using the method of Lagrange multiplier as strategy to obtain normalized weights. This is possible because ${\cal I}_{\textit{map}}$ and $\sum^{m}_{i=1}w_{i}-1$ have continuous first partial derivatives. We introduce a Lagrange multiplier $\lambda$ and consider the following function:

$\displaystyle{\cal I}^{\prime}_{\textit{map}}={\cal I}_{\textit{map}}+\lambda% \Big{(}\sum^{M}_{i=1}w_{i}-1\Big{)}$ (8)

If W minimizes ${\cal I}_{\textit{map}}$ satisfying Eq. (4.1), then there exists a value of $\lambda$ such that W is a stationary point for ${\cal I}^{\prime}_{\textit{map}}$ . The stationary point is the point where the partial derivatives of ${\cal I}^{\prime}_{\textit{map}}$ is zero:

$\displaystyle\frac{\partial{\cal I}^{\prime}_{\textit{map}}}{\partial w_{i}}=% \alpha{\cal Z}\overbrace{\sum_{j=1}^{K}\sum_{x\in{\cal X}_{j}}||{x[i]-c_{j}[i]% }||^{2}}^{S_{i}}-(1-\alpha)\Big{(}\kappa\frac{w^{*}_{i}}{w_{i}}+(1-\kappa)% \frac{1}{Dw_{i}}\Big{)}+\lambda=0$ (9)

Assuming that $S_{i}=\sum_{j=1}^{K}\sum_{x\in{\cal X}_{j}}||{x[i]-c_{j}[i]}||^{2}$ is the total intra-cluster distance on the $i$ -th attribute. We rewrite the above equation for obtaining the update of weight $w_{i}$ :

$\displaystyle w_{i}=\frac{(1-\alpha)\left(\kappa w^{*}_{i}+(1-\kappa)/D\right)% }{\alpha{\cal Z}S_{i}+\lambda}$ (10)

The update of weight $w_{i}$ is central in our algorithm depicted by the next section for learning the metric. It is easy to see that the lower the variance $S_{i}$ , the higher the weight of the attribute $w_{i}$ . Moreover, when $\kappa$ is set to 1, only the preferences are used. Conversely, when $\kappa$ is zero, user preferences are not considered.

4.2 Algorithm derivation

Our algorithm follows the optimization scheme introduced by MPCK-means [13] which consists in three phases after initialization:

1.
Cluster assignment
2.
Centroid re-estimation
3.
Metric learning

The metric-learning phase is crucial because it takes into account the preference vector and the confidence level. More specifically, for a given dataset ${\cal X}$ , a number of clusters $K\geqslant 1$ , a preference vector $\textbf{W}^{}\in{\cal P}$ , a confidence level $\kappa\in[0,1]$ and an intra-cluster distance weight $\alpha\in[0,1]$ , the algorithm MAPK-means (Metric Attribute Preferential K-means), provided by Algorithm 1 returns a $K$ -partition $\{{\cal X}_{j}\}_{j=1}^{K}$ minimizing the objective function ${\cal I}_{\textit{map}}$ by learning a vector W.

Algorithm 1: MAPK-means

input a dataset ${\cal X}$ , a number of clusters $K$ , a preference vector $\textbf{W}^{}$ , $\kappa$ , $\alpha$

output a partition $\zeta=\{{\cal X}_{j}\}_{j=1}^{K}$ and a learned vector W

1: Initialize $w_{i}:\kappa w^{}_{i}+(1-\kappa)\textbf{U}$ for $i\in\{1,\ldots,D\}$

2: Get $K$ center $\{c_{j}\}_{j=1}^{K}$ with K-means++ with a weighted distance

3: Initialize ${\cal Z}:\sum_{i=1}^{D}\frac{\kappa w^{}_{i}+(1-\kappa)/D}{S_{i}}$

4: repeat

5: {update the partition $\{{\cal X}_{j}\}_{j=1}^{K}$ }

6: ${\cal X}_{j}:=\left\{x\in{\cal X}:\arg\min_{l\in\{1,\ldots,K\}}||{x-c_{l}}||^{% 2}_{\textbf{W}}=j\right\}$ for $j\in\{1,\ldots,K\}$

7: $c_{j}[i]:=\frac{\sum_{x\in{\cal X}_{j}}x[i]}{|{\cal{X}|}}$ for $i\in\{1,\ldots,D\}$ and $j\in\{1,\ldots,K\}$

8: {update the vector W}

9: Update $\inf_{\lambda}$ and $\sup_{\lambda}$ that are the bounds for searching the optimal value of $\lambda$ .

10: Compute $\lambda$ using the dichotomic search 2

11: $w_{i}:=\frac{(1-\alpha)\big{(}\kappa w^{*}_{i}+(1-\kappa)/D\big{)}}{\alpha{% \cal Z}S_{i}+\lambda}$ for $i\in\{1,\ldots,D\}$

12: until $\{{\cal X}_{j}\}_{j=1}^{K}$ remains unchanged

13: return $\{{\cal X}_{j}\}_{j=1}^{K}$ and W

4.2.1 Algorithm initialization

Algorithm 1: MAPK-means
input a dataset ${\cal X}$ , a number of clusters $K$ , a preference vector $\textbf{W}^{*}$ , $\kappa$ , $\alpha$
output a partition $\zeta=\{{\cal X}_{j}\}_{j=1}^{K}$ and a learned vector W
1: Initialize $w_{i}:\kappa w^{*}_{i}+(1-\kappa)\textbf{U}$ for $i\in\{1,\ldots,D\}$
2: Get $K$ center $\{c_{j}\}_{j=1}^{K}$ with K-means++ with a weighted distance
3: Initialize ${\cal Z}:\sum_{i=1}^{D}\frac{\kappa w^{*}_{i}+(1-\kappa)/D}{S_{i}}$
4: repeat
5: {update the partition $\{{\cal X}_{j}\}_{j=1}^{K}$ }
6: ${\cal X}_{j}:=\left\{x\in{\cal X}:\arg\min_{l\in\{1,\ldots,K\}}\|\|{x-c_{l}}\|\|^{% 2}_{\textbf{W}}=j\right\}$ for $j\in\{1,\ldots,K\}$
7: $c_{j}[i]:=\frac{\sum_{x\in{\cal X}_{j}}x[i]}{\|{\cal{X}\|}}$ for $i\in\{1,\ldots,D\}$ and $j\in\{1,\ldots,K\}$
8: {update the vector W}
9: Update $\inf_{\lambda}$ and $\sup_{\lambda}$ that are the bounds for searching the optimal value of $\lambda$ .
10: Compute $\lambda$ using the dichotomic search 2
11: $w_{i}:=\frac{(1-\alpha)\big{(}\kappa w^{*}_{i}+(1-\kappa)/D\big{)}}{\alpha{% \cal Z}S_{i}+\lambda}$ for $i\in\{1,\ldots,D\}$
12: until $\{{\cal X}_{j}\}_{j=1}^{K}$ remains unchanged
13: return $\{{\cal X}_{j}\}_{j=1}^{K}$ and W

The weights of attributes for W are initialized using the user preferences W ${}^{*}$ and the confidence parameter (line 1). We use the same initialization as K-means++ [7] (line 2) : the first center is randomly selected from the data objects, then, each of the $K-1$ other initial centers is randomly selected with a probability proportional to the sum of the distances to the centers that have already been chosen. The only one difference is that the distance is weighted by W, to ensure that the centers are chosen taking into account the preferences, as well as it guarantees that the initialization of centers and the cluster assignment are done in the same metric space. Finally, the constant ${\cal Z}$ is initialized such that the intra-cluster distance and the other terms have a similar impact during the update of a weight $w_{i}$ (see Eq. (10)) when $\alpha=$ 0.5. For this, we choose a ${\cal Z}$ value that is identical to that of MPCK-means [13] when $\alpha=$ 0.5 (line 3). More precisely, this value is defined by:

$\displaystyle{\cal Z}=\sum_{i=1}^{D}{\frac{\kappa w^{*}_{i}+(1-\kappa)/D}{S_{i% }}}$ (11)

Algorithm 1 describe the three phases of MAPK-means after the initialization step:

Cluster assignment

The assignment step is the same as K-means (lines 5–6), with the only difference that the distances between points and centroid are parameterized with a vector ${\bf W}$ . Each point is assigned to the nearest cluster (line 6). This assignment reduces the intra-cluster distance and it also minimizes the objective function ${\mathcal{I}}_{\textit{map}}$ .

Centroid re-estimation

Once all points are assigned to a cluster, we update the center of each cluster by calculating the centroid for each attribute $i$ (line 7). It is important to note that the calculation of centers in our approach is insensitive to the order of assignment of points. We differentiate at this level from the MPCK-Means approach [13], in which the points assignment order is determined by the distance to the clusters and the respect of the ML and CL constraints and thus influences the computation coordinates of new centers.

Metric learning

In this step, MAPK-means (Algorithm 1) learns the right metric by updating the vector ${\bf W}$ that minimizes the objective function ${\mathcal{I}}_{\textit{map}}$ (lines 8–10). As explained in Section 4.1, the update of ${\bf W}$ is obtained by having the derivative $\frac{\partial{\mathcal{I}}_{\textit{map}}}{\partial w_{i}}$ equal to 0. In order to get the exact update of ${\bf W}$ , we have to compute the Lagrange multiplier $\lambda$ (see Eq. (10)). We introduce $p_{i}=(1-\alpha)\big{(}\kappa w_{i}^{*}+(1-\kappa)/M\big{)}$ as numerator part and $q_{i}=\alpha{\mathcal{Z}}S_{i}$ as denominator part (excluding $\lambda$ ). Then, Eq. (10) becomes: $w_{i}=\frac{p_{i}}{q_{i}+\lambda}$ and the calculation of the $\lambda$ consists in solving the following equation:

$\displaystyle\sum_{i=1}^{D}\frac{p_{i}}{q_{i}+\lambda}=1$ (12)

We use a dichotomic search to determine an approximate solution to this equation (line 9) with a maximum error of $\epsilon$ . Consequently, it is necessary to bound the $\lambda$ to initialize this search:

(Upper bound).

The upper bound of the Lagrange multiplier $\lambda$ is given by :

${\sup}_{\lambda}=D\times{\underset{i\in[1..D]}{\max}(p_{i})}-\underset{i\in[1.% .D]}{\min}(q_{i})$ (13)

Proof..

We have $1=\sum_{i=1}^{D}w_{i}\leqslant\sum_{i=1}^{D}\underset{i\in[1..D]}{\max}(w_{i})% \leqslant D\times\underset{i\in[1..D]}{\max}(w_{i})$ . Furthermore, $\underset{i\in[1..D]}{\max}(w_{i})\leqslant\frac{\underset{i\in[1..D]}{\max}(p% _{i})}{\underset{i\in[1..D]}{\min}(q_{i})+\lambda}$ . This implies that $1\leqslant D\times\frac{\underset{i\in[1..D]}{\max}(p_{i})}{\underset{i\in[1..% D]}{\min}(q_{i})+\lambda}$ . Consequently, we get $\underset{i\in[1..D]}{\min}(q_{i})+\lambda\leqslant D\times\underset{i\in[1..D% ]}{\max}(p_{i})$ , which concludes the proof that $\lambda\leqslant D\times\underset{i\in[1..D]}{\max}(p_{i})-\underset{i\in[1..D% ]}{\min}(q_{i})$ . ∎

(Lower bound).

The lower bound of the Lagrange multiplier $\lambda$ is given by:

$\displaystyle{\inf}_{\lambda}=-\underset{i\in[1..D]}{\min}(q_{i})$ (14)

Proof..

The weights on the attributes must be positive, we have $w_{i}=\frac{p_{i}}{q_{i}+\lambda}\geqslant 0$ . Moreover, as the user’s preferences ( $w_{i}^{*}$ ) are always positive and that the confidence level $\kappa\in[0,1]$ , we know that $p_{i}\geqslant 0$ for all $i\in[1..D]$ . Consequently, $\forall i\in[1..D]$ , we necessarily have $q_{i}+\lambda\geqslant 0$ . This implies that $\underset{i\in[1..D]}{\min}(q_{i})+\lambda\geqslant 0$ , which concludes the proof that $\lambda\geqslant-\underset{i\in[1..D]}{\min}(q_{i})$ . ∎

Algorithm 2: Dichotomic Search
input $\inf_{\lambda}$ , $\sup_{\lambda}$ et $\epsilon$
output $\lambda$
1: repeat
2: $\lambda:=\frac{\inf_{\lambda}+\sup_{\lambda}}{2}$
3: if $(\sum_{i=1}^{M}{w_{i}}-1<0)$
4: $\sup_{\lambda}:=\lambda$
5: else if $(\sum_{i=1}^{M}{w_{i}}-1>0)$
6: $\inf_{\lambda}:=\lambda$
7: end if
8: until $\|\sum_{i=1}^{M}{w_{i}}-1\|\leqslant\epsilon$
9: return $\lambda$

Algorithm 2 describes the dichotomic search which starts from the midpoint of $[\inf_{\lambda}$ , $\sup_{\lambda}]$ and alternates until converging to a value of $\lambda$ satisfying $|\sum_{i=1}^{M}{w_{i}}-1|\leqslant\epsilon$ .

4.2.2 Illustrative example

Let us illustrate MAPK-means with the running example provided in Fig. 1 (page 3).

•
With the preference vector ${\bf W}^{}=\langle 0.8,0.1,0.1\rangle$ and a high confidence level $\kappa=1$ , MAPK-means returns the clustering of Fig. 1b with ${\bf W}=\langle 0.874,0.078,0.048\rangle$ . Similarly, with ${\bf W}^{}=\langle 0.1,0.1,0.8\rangle$ and $\kappa=1$ , it returns the clustering of Fig. 1c with ${\bf W}=\langle 0.040,0.075,0.885\rangle$ . Note that in both cases, metric learning follows the expert’s recommendations. Whatever the level of confidence, the clustering for this two preference vectors remains roughly the same because the preferences make sense with the data distribution.
•
Interestingly, with the expert C and and preference vector ${\bf W}^{*}=\langle 0.1,0.6,0.3\rangle$ , even for $\kappa=1$ , MAPK-means favors the clustering of Fig. 1c where ${\bf W}=\langle 0.032,0.458,0.510\rangle$ . This case illustrates the compromise behavior performed by MAPK-means : on the one hand the algorithm follows the preferences of the expert C and retains high weight on the second attribute, but on the other hand MAPK-means favors the third attribute that fit better with the data distribution and had a stronger preference than the first attribute.

4.3 Theoretical analysis of MAPK-means

MAPK-means algorithm inherits good properties from K-means. Indeed, Properties 3 and 4 show respectively that it converges and that its time complexity is reasonable in comparison with the state-of-the-art methods:

(Termination).

MAPK-means converges to a locally optimal solution in a finite number of steps.

Proof..

Cluster assignment and centroid re-estimation decrease the intra-cluster distance without changing preferential and regularization terms. Besides, the metric learning minimizes the objective function ${\mathcal{I}}_{\textit{map}}$ . As the three steps of MAPK-means decrease ${\mathcal{I}}_{\textit{map}}$ (which is bounded by 0), it converges to a locally optimal solution in a finite number of steps. ∎

(Complexity).

The time complexity of MAPK-means is $O(l(\textit{NKD}+\textit{ND}+\textit{m D}))$ where $l$ is the number of iterations and $m$ the number of dichotomic search iterations.

Note that the search interval of the dichotomic search is $[\lambda_{inf},\lambda_{\textit{sup}}]$ . Let $\epsilon_{0}=(\lambda_{\textit{sup}}-\lambda_{\textit{inf}})$ be the interval size and the maximal search error and $\epsilon$ a given tolerance threshold (precision). The number of iterations needed to achieve the tolerance $\epsilon$ is : $m=\frac{\log(\epsilon_{0})-\log(\epsilon)}{\log(2)}$ [22].

Proof..

The time complexity of cluster assignment and centroid re-estimation steps are respectively O(NKD) and O(ND). The time complexity of metric learning mainly relies on a Lagrange multiplier resolution benefiting from a dichotomic search. Its time complexity is O(m D). ∎

The time complexity of MAPK-means is less than the complexity of [60] where the computation of weights optimization is quadratic with $P+D$ (where $P$ is the number of preferences which is upper bounded by $D^{2}$ ).

Interestingly, when the user has no preferences, the minimization of Eq. (6) minimizes the objective function of MPCK-means without constraints [13]:

(MCPK-means equivalence).

If the user has no preferences (i.e., ${\bf W}^{*}$ equals to ${\bf U}$ ) and $\alpha$ equals to $0.5$ , then the algorithm MAPK-means is equivalent of the algorithm MPCK-means [13] without constraints $M L$ and $C L$ , when a single metric is learned for all clusters.

Proof..

In the case where a single metric is learned, and where no constraints ML or CL are considered, note first that the MPCK-means weight update equation simply becomes $w_{i}=\frac{{|{{\mathcal{X}}}|}}{S_{i}}$ , $\forall i\in[1..D]$ . On the other hand, in the case of our algorithm MAPK-means, the weight modification equation is : $w_{i}=\frac{(1-\alpha)\big{(}\kappa w_{i}^{*}+(1-\kappa)/D\big{)}}{\alpha{% \mathcal{Z}}S_{i}+\lambda}$ . Where $\alpha=0.5$ and $w_{i}^{*}=\frac{1}{D}$ for all $i\in[1..D]$ , this equation can be simply rewritten as follows :

$\displaystyle w_{i}={\displaystyle\frac{0.5\big{(}\kappa/D+(1-\kappa)/D\big{)}% }{0.5{\mathcal{Z}}S_{i}+\lambda}}={\displaystyle\frac{0.5/D}{0.5{\mathcal{Z}}S% _{i}+\lambda}}.$

Furthermore, for $i\in[1..D]$ , we have:

$\displaystyle{\mathcal{Z}}=\sum_{i=1}^{D}\frac{\kappa w_{i}^{*}+(1-\kappa)/D}{% S_{i}}=\sum_{i=1}^{D}\frac{\kappa/D+(1-\kappa)/D}{S_{i}}=\frac{1}{D}\sum_{i=1}% ^{D}\frac{1}{S_{i}}.$

The MAPK-Means weight update equation thus becomes:

$\displaystyle w_{i}={\displaystyle\frac{0.5/D}{0.5/D(\sum_{i=1}^{D}\frac{1}{S_% {i}})\times S_{i}+\lambda}}.$

and it is easy to verify that the sum of weights is normalized when $\lambda=0$ . We then have:

$\displaystyle w_{i}={\displaystyle\frac{0.5/D}{0.5/D\sum_{i=1}^{D}\frac{1}{S_{% i}}\times S_{i}+\lambda}}={\displaystyle\frac{1/S_{i}}{\sum_{i=1}^{D}\frac{1}{% S_{i}}}}$

which is equivalent to the MPCK-Means weight update equation (and having a coefficient close to the normalization). ∎

5. Experiments on UCI benchmarks

Extensive experiments were conducted to evaluate the performance of our new semi-supervised algorithm MAPK-means. After presenting the protocol and the datasets from UCI benchmarks used in Section 5.1, we present the following experiments:

•
First, in Section 5.2, we address the problem of the tuning of the parameter $\alpha$ . More precisely, we answer the following question: is 0.5 a good default value for $\alpha$ ?
•
Then, in Section 5.3, we study the impact of using preferences on the attribute preferences on the clustering results. In particular, we study the sensibility of our algorithm to the user preferences in order to answer three main questions: (i) is there a positive impact for a relevant user-specified preferences? (ii) can MAPK-means correct a bad choice of preferences? (iii) who better guides the exploration of the solution: user preferences or data?
•
Finally, in Section 5.4, we compare the performance of MAPK-means with the state-of-the-art algorithms. In particular, we observe that our method can be more efficient than [60], while setting the parameters is easier.

5.1 Datasets and protocol

In order to evaluate the performance of our approach, we mainly performed experiments on multivariate attributes datasets from UCI repository3

³
archive.ics.uci.edu/ml/datasets.html.

for the ease of reproducibility and comparison with other approaches like [60]. Moreover, the presence of ground truth partitions for the used datasets allows to evaluate the quality of the clusterings obtained by the different methods. Finally, in our experiments the ground truth helps us to simulate two different scenarios of preferences: relevant and irrelevant (cf. Section 5.1.2).

For all datasets, we normalize the attributes using the Min-Max normalization method, which scales each attribute between 0 and 1. This provides an easy way to compare values that are measured using different scales or different units of measure. Furthermore, this does not favor attributes with large scale variation over others. However, this transformation does not give equal contributions of attributes in the clustering process. Indeed, the distributions of the different normalized attributes are not necessarily similar.

Note that in our pre-processing phase of the datasets, no dependency checking is performed. In fact, our approach can be used without any prior knowledge about the datasets. Furthermore, we show experimentally that our approach is able to correct the resulting clustering when irrelevant choices of preferences are given (cf. Section 5.3.2). Finally, for noisy data or missing attributes values, as our approach is K-means based, note that the same strategies of data pre-processing used for K-means algorithm can be applied [67].

Table 1

Statistics of benchmark datasets

	Abalone	Glass	Lonosphere	Iris	Optdigits	Pendigits
$N$	4177	214	351	150	5620	10992
$M$	8	9	32	4	62	16
$K$	28	6	2	3	10	10
	Pgblocks	Pima	Vowel	Waveform	Wdbc	Wine
$N$	5473	768	990	5000	569	178
$M$	10	8	10	21	30	13
$K$	5	2	11	3	2	3

5.1.1 Quality evaluation

In the following, we evaluate the quality of the clustering built by our algorithm MAPK-means with the ground truth clustering described in the datasets of the UCI repository. More precisely, in order to compare a clustering or partition $X$ with a reference partition $Y$ , we use the Normalized Mutual Information (NMI) [4, 60]. This measure is a quality index that evaluates the agreement between two partitions based on the amount of knowledge we gain about one partition knowing the other. Its value ranges from $0$ to $1$ : $0$ indicates that the two partitions are completely independent, whereas $1$ means that the two partitions are identical. Formally, given two partitions $X$ and $Y$ , we recall that the NMI measure is defined by:

$\displaystyle\textit{NMI}(X,Y)∼{}=∼{}\frac{I(X,Y)}{\sqrt{H(X)H(Y)}}$

where $I(X,Y)$ is the mutual information between two data partitions $X$ and $Y$ and H(X) denotes the entropy of $X$ .

5.1.2 Preference initialization

In practice, user preferences on attributes would be specified by domain experts. However, for the ease of reproducibility, we propose in our experiments a method to automatically generate preferences by using the ground truth class information. Intuitively, we assume that a descriptive attribute is most probably relevant to build a clustering if it is strongly correlated with the ground truth partition. In order to evaluate the degree of correlation between a descriptive attribute and the class attribute, we propose to use the Fisher’s ratio of the analysis of variance (ANOVA F-test) [52].

More precisely, given a set of clusters ${\zeta}=\{{\mathcal{X}}_{j}\}_{j=1}^{K}$ , let $c_{j}$ be the centroid of each ${\mathcal{X}}_{j}$ and $\bar{c}$ be the centroid of the whole dataset. With respect to the descriptive attribute $i$ , the Fisher’s ratio $F_{i}$ is defined by:

$\displaystyle F_{i}=\frac{(N-K)\sum_{j=1}^{K}(c_{j}[i]-\bar{c}[i])^{2}}{(K-1)% \sum_{j=1}^{K}\sum_{x\in{\mathcal{X}}_{j}}(x[i]-c_{j}[i])^{2}}$ (15)

Note that $F_{i}=$ 0 when the descriptive attribute $i$ is completely independent from the reference clustering ${\zeta}$ . Conversely, the higher the value of $F_{i}$ , the greater the probability that the descriptive attribute $i$ is correlated with the reference clustering ${\zeta}$ .

Based on the Fisher’s ratio, we introduce two preference vectors, namely “relevant” and “irrelevant”, to automatically define user preferences:

•

the relevant weights vector ${\bf W^{\textit{Rel}}}=\langle w^{Rel}_{1},ldots,w^{Rel}_{i},\ldots,w^{Rel}_{D}\rangle$ is defined for $i\in[1..D]$ by:

$\displaystyle w_{i}^{\textit{Rel}}=\frac{F_{i}}{\sum_{d=1}^{D}F_{d}}$ (16)

Our assumption is that this preference vector will help our algorithm MAPK-means to find better clusterings.

•

conversely, the irrelevant weights vector ${\bf W^{\textit{Irr}}}=\langle w^{Irr}_{1},\ldots,w^{Irr}_{i},\ldots,w^{Irr}_{% D}\rangle$ is defined for $i\in[1..D]$ by:

$\displaystyle w_{i}^{\textit{Irr}}=\frac{\frac{1}{F_{i}}}{\sum_{d=1}^{D}\frac{% 1}{F_{d}}}$ (17)

Our assumption is that using this preference vector, our algorithm will have to correct this vector to find good clusterings.

Finally, note that we run a large number of tests ( $n=$ 100) to ensure the significance of the results. Thus, we can compute the average and the standard deviation of the NMI scores and use the ANOVA statistical test to ensure the significance of the results (for more details see Annex Test of significance).

5.2 0.5 as a good default value of

\alpha

The aim of this first set of experiments is to show that for $\alpha=$ 0.5 (the default value of the intra-cluster distance weight), it is always possible to achieve good clustering results on benchmark datasets. More precisely, we show that a user can always obtain a good clustering considering different values of the confidence level parameter $\kappa$ , without adjusting the default value of the intra-cluster distance weight $\alpha$ . Note that in similar approaches such as [60], the authors do not provide equivalent experiments and default values in order to help the user to set the parameters of their method.

Experimental setting For two initialization scenarios of preferences ( ${\bf W^{\textit{Rel}}}$ and ${\bf W^{\textit{Irr}}}$ ), we apply our algorithm MAPK-means for different values of intra-cluster distance weight $\alpha$ and confidence level $\kappa$ . More precisely, for each value of $\alpha$ and $\kappa$ in $[0,1]$ (with a step of 0.05), we run $n=$ 100 tests to ensure the significance of the results.

Results Using this experimental setting, we present the results of our experiments in Fig. 2. For each dataset and initialization scenarios ( ${\bf W}^{*}={\bf W^{\textit{Rel}}}$ or ${\bf W}^{*}={\bf W^{\textit{Irr}}}$ ), Fig. 2 shows:

•
The best average of NMI scores with the corresponding standard deviation for values of $\alpha$ and $\kappa$ ranging between 0 and 1 (with steps of 0.05).
•
The best average of NMI scores with the corresponding standard deviation for values of $\kappa$ ranging between $0$ and $1$ , while $\alpha$ is set to its default value, i.e. $\alpha=$ 0.5.

These results show that it is almost always possible to obtain good clustering results with $\alpha=$ 0.5. Indeed, even if the average of NMI score with $\alpha=$ 0.5 can be lower than the average of NMI score with $\alpha\in[0,1]$ , the differences are generally not significant. Indeed, the quality of the clusterings is comparable for most of the datasets (10 out of 12 datasets).

Figure 2.
Comparison of the NMI averages of the clusterings obtained with $\alpha=0.5$ and $\kappa$ varying $\in[0,1]$ with the clusterings obtained for all possible values of parameters $\kappa$ and $\alpha$ .

Summary To sum up, the results of the experiments presented in this Section 5.2 mainly show that good clusterings can always be achieved with $\alpha=$ 0.5. Thus, MAPK-means avoids a frequent drawback of clustering algorithms which require the user to set several parameters, a parameters configuration process that is in general non trivial and complex. In the following, all the experiments are conducted using the default value of the parameter $\alpha=$ 0.5.
5.3 Sensibility of MAPK-means to user preferences

This experimental section has two main objectives. First, we show that if the user’s preference vector is relevant ( ${\bf W}^{*}={\bf W^{\textit{Rel}}}$ ) and the user has a high confidence in his/her preferences ( $\kappa=$ 1), then our algorithm MAPK-means can compute a clustering that is better than a clustering obtained using only metric learning ( $\kappa=$ 0). Then, in a second experimental setting, we study the behavior of our algorithm MAPK-means when the user’s preference vector is not relevant ( ${\bf W}^{*}={\bf W^{\textit{Irr}}}$ ). In that case, we show that the quality of the clusterings increases if the confidence of the user in his/her preferences is reduced. Thus, these experiments show that our algorithm can correct the user’s preference vector when his/her intuitions are not appropriate.

Experimental setting We consider the same protocol presented in Section 5.1 and already used in Section 5.2. However, all the experiments are conducted with the default value of the intra-cluster distance weight, i.e. $\alpha=$ 0.5.

5.3.1 Impact of relevant preferences on the quality of clustering

We first present the results obtained with a relevant choice of preferences in Fig. 3. For all the experiments presented in this figure, the preference vector ${\bf W}^{*}$ is set to ${\bf W^{\textit{Rel}}}$ . In this context, we want to compare the results when considering or ignoring the user preferences. In other word, we want to study the impact of using a metric emphasizing the relevant attributes.

In order to do this, we consider two different configurations of MAPK-means:

1.
when the confidence of the user in his/her preferences is maximal, i.e. using $\kappa=$ 1.
2.
when the preference vector of the user is not considered, i.e. using $\kappa=$ 0. In this case, the algorithm relies only on data to learn the metric.

For these two configurations, we present in Fig. 3 the averages of the NMI scores and the corresponding standard deviation for each dataset.

Figure 3.
Comparison of the NMI averages obtained using MAPK-means with ${\bf W}^{}={\bf W^{\textit{Rel}}}$ , $\kappa=$ 0 and $\kappa=$ 1.

These results demonstrate that when the preference vector of the user is relevant and the confidence in his/her preferences is high ( $\kappa=$ 1), the quality of the clustering obtained is always significantly better than when there are no preferences expressed or taken into account, i.e. MAPK-means with $\kappa=$ 0.

In fact, there is only one exception with the Wine* dataset, which gets the best NMI score when preferences are ignored in contrast to other datasets. This may be due to the fact that for Wine all attributes are equally important ( $\overline{D}_{\textit{KL}}({{\bf W}^{}}\|{{\bf U}})=$ 0.124) and this balance must be respected as much as possible to ensure the best performance.

We note that when the normalized divergence $\overline{D}_{\textit{KL}}({{\bf W}^{}}\|{{\bf U}})$ between the preference vector ${\bf W}^{}$ and the uniform vector ${\bf U}$ is close to zero, then the quality of the clustering obtained with $\kappa=$ 0 or $\kappa=$ 1 is not significantly different. See in particular, the averages of the NMI scores obtained for the datasets Abalone, Pendigits* and Waveform ( $\overline{D}_{\textit{KL}}({{\bf W}^{}}\|{{\bf U}})=$ 0.008, 0.045 and 0.068 respectively). This behavior is due to our heuristic initialization of the best weight ${\bf W^{\textit{Rel}}}$ that is not able to adapt the distribution of weights to the distribution of these datasets.

Finally, these results show a better stability of the NMI (w.r.t to standard deviation values) when the confidence in preferences is high ( $\kappa=$ 1).

Summary* A relevant choice of preferences has a positive impact on the quality of clusterings obtained with our algorithm MAPK-means. It also improves the stability of the results.
5.3.2 The ability of MAPK-means to correct irrelevant preferences

The main objective of this experiment is to study the ability of MAPK-means to correct the clustering obtained and improve its quality, when the preference vector ${\bf W}^{*}$ is set to ${\bf W^{\textit{Irr}}}$ . To this aim, we evaluate and compare the quality of the clusterings obtained by varying the confidence level of the user in his/her preferences between $0$ and $1$ . We present in Fig. 3 the average values of the NMI scores with ${\bf W}^{*}={\bf W^{\textit{Irr}}}$ and $\kappa\in\{0,0.5,1\}$ .

Figure 4.

Comparison of the NMI averages obtained using MAPK-means with ${\bf W}^{*}={\bf W^{\textit{Irr}}}$ , $\kappa\in\{0,0.5,1\}$ .

First, we can see in Fig. 4 that when the preference vector of the user is irrelevant, then the best clustering quality is always obtained with $\kappa\approx 0$ , which means that MAPK-means ignores the irrelevant preferences of the user, and that MAPK-means manages to learn a better metric using the characteristics of the datasets.

The same figure shows that when the preference vector of the user is not relevant, then the quality of the clusterings always increases if the confidence level of the user in his/her preferences is reduced. For most of the datasets, the quality of the clusterings are already significantly improved with $\kappa=$ 0.5.

For some datasets, in particular Ionosphere, Waveform and Wdbc, we can see that the quality of the clusterings are only improved for very small values of $\kappa$ (less than 0.2). This can be explained by the large number of dimensions of these tree datasets ( $D=$ 32, 21 and 30, respectively), and preference vectors ${\bf W^{\textit{Irr}}}$ that are particularly unfavorable. Concerning this remark, it is interesting to see that the normalized divergence $\overline{D}_{\textit{KL}}({{\bf W^{\textit{Irr}}}}\|{{\bf U}})$ between the preference vector ${\bf W^{\textit{Irr}}}$ and the uniform vector ${\bf U}$ is very high for the datasets Ionosphere,Waveform and Wdbc $\big{(}\overline{D}_{\textit{KL}}({{\bf W^{\textit{Irr}}}}\|{{\bf U}})=$ 0.820, 0.779 and 0.694 respectively $\big{)}$ .

Finally, we note a high instability of the results obtained when the confidence level $\kappa$ is strictly greater than 0. It is due to the fact that MAPK-means is trying to built the clustering with the irrelevant preferences, highly inconsistent with the data structure.

Summary The results obtained in this section demonstrate that our algorithm MAPK-means is able to correct a bad choice of preferences to obtain a clustering with better quality. The user has only to reduce the confidence $\kappa$ in his/her choices.

5.3.3 Data-driven clustering vs user-driven clustering

The experiments in Section 5.3.1 show that relevant preferences have a positive impact on the quality of clustering. In contrast, when the preferences are irrelevant (Section 5.3.2), the results show that the clustering driven only by the data has better quality than the clustering built using the irrelevant preferences.

When the user preferences are relevant, our aim is now to determine what is the best between tree alternatives: (i) the solution guided completely by the preferences (i.e. with $\kappa=$ 1), (ii) the solution guided only by the data (i.e. with $\kappa=$ 0), or (iii) a compromise between the two solutions.

In order to do this, we consider the same scenario of preferences as in Section 5.3.1, i.e. with ${\bf W}^{*}={\bf W^{\textit{Rel}}}$ , while varying the confidence $\kappa$ in [0, 1] (with a step $=$ 0.05).

For this configuration of MAPK-means and for each dataset, we keep the best value of $\kappa$ that corresponds to the best average of the NMI scores over all values of $\kappa\in$ [0, 1]. Table 2 presents the values of the best $\kappa$ for each dataset obtained with ${\bf W}^{*}={\bf W^{\textit{Rel}}}$ .

The results show that for most of the datasets (6 out of 12), the best confidence value is ranging in [0.25, 0.75], which means that the best quality of clustering is obtained with a compromise between the clustering guided by the relevant user preferences ( ${\bf W}^{*}={\bf W^{\textit{Rel}}}$ ) and the solution guided by the quality of the data (expressed only by the intra-clusters distance). We can also see that the average value of the best $\kappa$ for all datasets is in general closer to 1 than 0, which emphasizes the conclusion of Section 5.3.1 on the positive impact of relevant preferences.

Summary Our experiments show that in general the best clustering solution is not a completely data-driven solution, neither a fully user-driven solution, even when the preferences are relevant, but it is a solution of compromise. (i.e. $\kappa\in]0,1[$ ). However, it is important to note that a user-driven solution is more subjective than a data-driven solution. Therefore, the adoption of one of these solutions (data-driven or user-driven) may depend on the user knowledge about the data and on his/her interest in exploring new solutions. In general, a user-driven clustering will be more suitable when the user has a prior knowledge on the dataset and a high confidence is his preferences. Conversely, a data-driven solution can be used to obtain clusterings with good quality, fitting the data structure, independently from the user knowledge. Thus, if a user has interest to explore new clustering solutions, he can opt for a solution relying more on the data. Otherwise, he can favor the construction of clusterings relying completely on his/her preferences.

Table 2
Variation of the best value of $\kappa$ corresponding to the best average of the NMI scores, with ${\bf W}^{*}={\bf W^{\textit{Rel}}}$

Data set	$\kappa_{\textit{best}}$ with ${\bf W}^{*}={\bf W^{\textit{Rel}}}$
Abalone	1
Glass	0.55
Ionosphere	1
Iris	1
Optdigits	0.55
Pendigits	0.45
Pgblocks	1
Pima	0.7
Vowel	0.5
Waveform	0.25
Wdbc	0.3
Wine	0
	AVG $=$ 0.61

5.4 Comparative experiments

We now compare the performances of MAPK-means with CFP algorithm introduced in [60]. Compared with our approach, note that CFP uses a qualitative model of preferences on attributes rather than a quantitative model using weights on attributes. However, it is important to note that, similarly to our proposal, [60] learns a metric parametrized by an attribute feature weight vector, i.e. the most appropriate weight vector with respect to the dataset and the user preferences.

Experimental setting

For the purpose of this experiment, we replicate the same protocol as [60]: we first compute a weight vector $\tilde{W}$ based on the intra-cluster distortion and we use it to initialize our preference vector ${\bf W}^{*}$ , i.e. ${\bf W}^{*}$ = $\tilde{W}$ . Then, similarly to [60], in order to select the best clustering over different runs and different values of $\kappa\in[0,1]$ , we consider the one that minimizes the value of our objective function (see Eq. (6)). Finally, we set $\alpha=$ 0.5 and run 100 tests to ensure the significance of the results.

Results The clustering results on all the datasets are shown in Table 3. This table compares the clustering results in terms of NMI of the algorithms K-means, K-means with a weighted distance (WK-means), CFP, for which we present only the best result obtained in [60] (using different values of their parameters $m$ ) and finally MAPK-means for which we provide the best result obtained while $\kappa\in[0,1]$ . More precisely, we show the value of NMI of the clustering that minimizes the objective function (see Eq. (6)) while varying the confidence $\kappa$ . We also give the associated value of the confidence parameter $\kappa_{\textit{best}}$ .

Table 3
NMI values of clusterings obtained using K-means, K-means with a weighted distance, CFP [60] and our algorithm MAPK-means

Data set	$\overline{D}_{\textit{KL}}({\tilde{{\bf W}}}\\|{{\bf U}})$	K-Means	WK-Means	CFP	MAPK-means
					( $\alpha=$ 0.5 and $\kappa\in[0,1]$ )
Iris	0.310	0.742	0.864	0.864	0.864	1
Optdigits	0.009	0.756	0.754	0.715	0.719	0.95
Pendigits	0.014	0.682	0.716	0.707	0.735	0.8
Pgblocks	0.005	0.152	0.145	0.204	0.204	0.7
Vowel	0.039	0.386	0.368	0.424	0.457	1
Wdbc	0.018	0.623	0.667	0.628	0.677	0.75
		NMI			$\max(\textit{NMI})$	$\kappa_{\textit{best}}$

As can be seen in Table 3, the best NMI values obtained with CFP and MAPK-means are very similar on Iris and Pgblocks datasets. With the Optdigits dataset, K-means gives the best NMI value; however, our algorithm MAPK-means outperforms CFP. For this dataset, we notice that the divergence $\overline{D}_{\textit{KL}}({\tilde{W}}\|{{\bf U}})$ is very small (0.009), $\tilde{W}$ is therefore almost uniform. However, the score of NMI obtained with $W K$ -means is also higher than CFP and MAPK-means, so we assume that the correction of the metric (see Eq. (10), page 10) due to learning is less efficient with a high number of attributes (62 attributes in the case of Optdigits).

For all other datasets, the quality of the clusterings produced by MAPK-means is better than the quality of the clusterings produced by CFP, MPCK-means without instance constraints (i.e. MAPK-means with $\kappa=$ 0) and a basic K-means. Finally, our experimental results show that even a K-means whose metric is set with the relevant weights cannot compete with MAPK-means.

Summary The results show that our algorithm MAPK-means compete with existing approaches in the literature. Indeed, for most of the datasets, MAPK-means shows a better performance on the clustering quality.

6. Real use case application

As motivated at the beginning of this paper, the initial purpose of our algorithm MAPK-means is to provide a solid foundation for a new interactive tool for clustering marketing related data sets. For this purpose, a new software prototype has been developed from scratch in collaboration with the French company Group Up to help marketing experts in the time consuming task of customers segmentation. Compared to the UCI Benchmark datasets used in the previous section, it is important to emphasize that in the context of our real use case, alternative interesting clusterings exist, according to the specific needs of the experts. It means that different reference clusterings exist for the same dataset and that, expressing preferences is likely to provide distinguishable results in this specific context.

6.1 Dataset and reference clusterings

The experiments presented in this section consider a dataset POS of 1,282 Points Of Sales. This dataset represents the quantities of product sold or returned for the top-3 profitable products in the year 2016 of a French company. More precisely, for each product $P_{i}$ ( $i\in\{1,2,3\}$ ), the descriptive attributes $P_{i}^{+}$ and $P_{i}^{-}$ denote the sold out quantity and the returned quantity of the product $P_{i}$ , respectively.

6.1.1 Reference partitions construction

In the context of our business use case, different categories of marketing experts can be identified: (i) some analysts are particularly interested in the quantities sold and returned for one specific product, (ii) others analysts are only interested in the quantities sold for all products, (iii) finally others analysts prefer to use only the quantities returned for all the products.

Following these categories of experts, we built 5 different reference clusterings that represent typical results that can be exploited by the company. These clusterings are built using MAPK-means with $K=$ 3 and $\kappa=$ 0, without any user preferences. In this case, clusterings are built by learning the metrics guided by the data. We define our 5 reference partitions, by restricting the starting attributes that are actually used to build the partition, as follows:

•
${\zeta}_{P_{i}}^{}$ ( $i\in\{1,2,3\}$ ) denotes the clustering obtained taking into account only the 2 descriptive attributes $P_{i}^{+}$ and $P_{i}^{-}$ . These $3$ clusterings consider only the sold and returned quantities of one specific product.
•
${\zeta}_{P_{S}}^{}$ denotes the clustering obtained using only the 3 descriptive attributes that represent sold quantities, i.e. $P_{1}^{+}$ , $P_{2}^{+}$ and $P_{3}^{+}$ .
•
${\zeta}_{P_{R}}^{}$ denotes the clustering obtained using only the 3 descriptive attributes that represent returned quantities, i.e. $P_{1}^{-}$ , $P_{2}^{-}$ and $P_{3}^{-}$ .

Table 4 summarizes how those reference partitions were built.

Table 4
The attributes used to built the reference partitions (1: used, 0: not used)

$P_{1}^{+}$ $P_{1}^{-}$ $P_{2}^{+}$ $P_{2}^{-}$ $P_{3}^{+}$ $P_{3}^{-}$

${\zeta}_{X}^{}$ 1 1 0 0 0 0

${\zeta}_{Y}^{}$ 0 0 1 1 0 0

${\zeta}_{Z}^{}$ 0 0 0 0 1 1

${\zeta}_{T_{S}}^{}$ 1 0 1 0 1 0

${\zeta}_{T_{R}}^{}$ 0 1 0 1 0 1

In the following experiments, these 5 clusterings are used as ground truth clusterings. In order to evaluate if these 5 clusterings are really different or not, we compute the NMI scores between each pair of clusterings. The average value of the NMI scores is equal to 0.283 ( $\pm$ 0.099). This confirms that the sales profiles are not the same for the different products, and not the same if we consider the quantities sold or returned.
6.2 Experimental setting

	$P_{1}^{+}$	$P_{1}^{-}$	$P_{2}^{+}$	$P_{2}^{-}$	$P_{3}^{+}$	$P_{3}^{-}$
${\zeta}_{X}^{*}$	1	1	0	0	0	0
${\zeta}_{Y}^{*}$	0	0	1	1	0	0
${\zeta}_{Z}^{*}$	0	0	0	0	1	1
${\zeta}_{T_{S}}^{*}$	1	0	1	0	1	0
${\zeta}_{T_{R}}^{*}$	0	1	0	1	0	1

In our experiments, we consider analysts that put a high preference value on their attributes of interest, and lower preference values on the other descriptive attributes. More precisely, we consider preference vectors such that the sum of the preference weights on the the attributes of interests represents 90% of the total sum, whereas the sum of the preference weights on the other attributes represents only 10% of the total sum. The preference vectors that we consider in our experiments are detailed in Table 5.

Table 5
Analysts’ preference vectors (with strong values on the preferred attributes)

Weight	$P_{1}^{+}$	$P_{1}^{-}$	$P_{2}^{+}$	$P_{2}^{-}$	$P_{3}^{+}$	$P_{3}^{-}$
$W_{P_{1}}$	90%		5%		5%
$W_{P_{2}}$	5%		90%		5%
$W_{P_{3}}$	5%		5%		90%
$W_{P_{S}}$	30%	3.33%	30%	3.33%	30%	3.33%
$W_{P_{R}}$	3.33%	30%	3.33%	30%	3.33%	30%

For each preference vectors presented in Table 5, we perform 100 runs of MAPK-means taking $\alpha=$ 0.5 and $\kappa\in\{0,0.5,1\}$ . Then, we compute the averages and standard deviations of the NMI scores comparing the clusterings built by our algorithm MAPK-means and the ground truth partitions introduced in Section 6.1. The results of these experiments are presented Fig. 5 and discussed in the next section.

6.3 Results

Figure 5.

Variation of the average of the NMI scores between the obtained partitions and the different ground truth partitions.

We first notice that when $\kappa=$ 0, we always obtain the same set of NMI scores. Indeed, when $\kappa=$ 0, the user preferences are not taken into account, and the clustering obtained is only data-driven. This clustering is the same whatever the user preferences or specific needs. By contrast, when $\kappa>0$ , the solutions built by MAPK-means are very different from each other, and represent a compromise between a data-driven and a user-driven solution. More precisely, we can see that when the user’s preference vector is set to ${\bf W}^{*}=W_{X}$ $\big{(}$ with $X\in\{P_{1},P_{2},P_{2},P_{S},P_{R}\}\big{)}$ , then MAPK-means always converges to a solution similar to the ground truth clustering ${\zeta}_{X}^{*}$ .

We can also see that when $\kappa=$ 1, the average of the NMI scores between the clustering obtained using MAPK-means with $W_{X}$ and the ground truth clustering ${\zeta}_{X}^{*}$ is approximately 0.6. Furthermore, the difference between this value and all the other values obtained for ${\zeta}_{Y}^{*}$ ( with $Y\neq X$ ), is always greater than 0.2. This difference is significantly high even for $\kappa=$ 0.5. On the one hand, it shows that compromise solutions are exploitable. On the other hand, it points out the expert does not need to be extremely precise in his choice of preferences, or in the evaluation of his level of confidence to find the alternative partition to which he/she want to attend.

6.3.1 Summary

This use case illustrates the importance of a method that integrates the expert knowledge to identify the appropriate customer segmentation fitting the expert needs. The results demonstrate that our approach allows to guide efficiently the clustering exploration process.

Finally, we note that without preferences, the customer segmentation would be a compromise between several segmentations and may not be representative of a real need, losing the ease of interpretation gained from the expression of preferences. Moreover, it has to be noticed that we intentionally reported an experiment with only 3 products for the sake of simplicity. Of course, any increase in the number of products would result in an increase in the number of attributes combinations and in the number of potential customer segmentations, which emphasizes the interest of our approach.

7. Conclusion

We propose in this paper a semi-supervised clustering method that allows the user to express preferences on attributes, relying on the learning of a distance metric. We present an extensive literature survey on the semi-supervised clustering, while we focus on the types of constraints and their resolution approaches. We also outline the sensitivity of the methods to the constraints quality. Then, we formulate the user preferences as a simple vector which is taken into account into the objective function, in conjunction with a confidence parameter used to handle the sensitivity problem. This parameter allows the user to express his/her degree of confidence in his/her choices. We demonstrate that this quantitative model of preferences leads to an efficient metric learning step iterated by our algorithm MAPK-means. We also prove that MAPK-means has several interesting properties such as the convergence guarantee and a lower time complexity compared to the state of the art.

Furthermore, the extensive experimental results illustrate the importance and the good performance of our approach on datasets from the UCI benchmark:

•

We first show the simplicity of parameter setting of MAPK-means, where the parameter $\alpha$ has 0.5 as a good default value.

•

We illustrate the positive impact of user preferences on clustering quality and on helping the method finding the right alternative clustering for the user.

•

We also observe that when the confidence level expressed by the end user is not too high ( $\kappa\leqslant$ 0.5), MAPK-means is able to partially correct the clustering suggested by erroneous preferences.

•

We demonstrate that the clustering with best quality is generally a compromise between a user-driven solution and a data-driven solution.

•

We finally show that MAPK-means generally performs better than other algorithms of the literature on UCI benchmarks.

Interestingly, the experiment on the use case illustrates perfectly the need for our proposal in the context of Customer Relationship Management where several distinct customer segmentations coexist. Attribute preferences in this case can effectively guide the exploration of alternative segmentations.

Future work will intend to integrate our algorithm MAPK-means into an exploratory system using OLAP operations to navigate between sets of attributes. At first, it would be pertinent to integrate classical operations such as rollup and drilldown so that the user can navigate between different alternative clusterings. More ambitiously, we would like to deduce the preference vector from the operations performed by the user. For now, the learned vector is used in a descriptive way to explain what attributes are significant to construct the K-partition. This vector could also be used in a prescriptive way to recommend the user to explore alternative subspaces. Finally, we think that the use of constraints at the attribute level offers important perspectives to guarantee other relevant properties on the partition. For example, our approach could be extended to ensure that all clusters have the same distribution for a given attribute (for instance, each customer segment has the same proportion of men and women).

Footnotes

Acknowledgments

We would like to thank KALIDEA – Groupe UP and the National Association of Research and Technology (ANRT) for financial support as part of the thesis (2014/0658).

Appendix

References

Aggarwal

C.C.

and Reddy

C.K.

, Data clustering: algorithms and applications, CRC press, 2013.

Agrawal

Gehrke

Gunopulos

and Raghavan

, Automatic subspace clustering of high dimensional data for data mining applications, SIGMOD Rec 27(2) (June 1998), 94–105.

Alelyani

Tang

and Liu

, Feature selection for clustering: A review, In Data Clustering: Algorithms and Applications, Chapman and Hall/CRC, 2013, pp. 29–60.

Alexander

and Ghosh

, Cluster ensembles – a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3 (2003), 583–617.

Antoine

and Labroche

, Semi-supervised fuzzy c-means variants: A study on noisy label supervision, In Information Processing and Management of Uncertainty, volume 2, 2018, pp. 51–62.

Antoine

Labroche

and Vu

V.-V.

, Evidential seed-based semi-supervised clustering, In Proc. of SCIS-ISIS, 2014.

Arthur

and Vassilvitskii

, k-means++: The advantages of careful seeding, In Proc. Symp. Discrete Algorithms, 2007, pp. 1027–1035.

Bae

and Bailey

, Coala: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity, In Proceedings of the Sixth International Conference on Data Mining, ICDM ’06, 2006, pp. 53–62.

Banerjee

and Ghosh

, Scalable clustering algorithms with balancing constraints, Data Mining and Knowledge Discovery 13(3) (2006), 365–395.

10.

Banerjee

and Ghosh

, Clustering with balancing constraints, In Constrained Clustering, Chapman and Hall/CRC, 08 2008, pp. 171–200.

11.

Basu

Banerjee

and Mooney

R.J.

, Semi-supervised clustering by seeding, In Proc. of the 19th ICML, ICML ’02, Morgan Kaufmann Publishers Inc., 2002, pp. 27–34.

12.

Basu

Banerjee

and Mooney

R.J.

, Active semi-supervision for pairwise constrained clustering, In Proc. of the 2004 SIAM Inter, Conference on Data Mining, 2004, pp. 333–344.

13.

Bilenko

Basu

and Mooney

R.J.

, Integrating constraints and metric learning in semi-supervised clustering, In In Proc. of the 21st ICML, ACM, 2004, p. 11.

14.

Bradley

P.S.

Bennett

K.P.

and Bhattacharya

, Using assignment constraints to avoid empty clusters in k-means clustering, In Constrained Clustering, Chapman and Hall/CRC, 08 2008, pp. 201–220.

15.

Bradley

P.S.

Bennett

K.P.

and Demiriz

, Constrained k-means clustering, Technical Report MSR-TR-2000-65, Microsoft Research, 5 2000.

16.

Charrad

and Ben Ahmed

, Simultaneous clustering: A survey, In Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), volume 6744 LNCS, Springer, Berlin, Heidelberg, 2011, pp. 370–375.

17.

Chau

D.H.

Vreeken

van Leeuwen

and Faloutsos

, editors, IDEA ’13: Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics, New York, NY, USA, 2013, ACM.

18.

Covoes

T.F.

Hruschka

E.R.

and Ghosh

, A study of k-means-based algorithms for constrained clustering, Intelligent Data Analysis 17(3) (2013), 485–505.

19.

Dang

X.-H.

and Bailey

, A hierarchical information theoretic technique for the discovery of non linear alternative clusterings, In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, 2010, pp. 573–582.

20.

Dao

T.B.H.

Vrain

Duong

K.C.

and Davidson

, A Framework for Actionable Clustering using Constraint Programming, In 22nd ECAI, Aug. 2016.

21.

Dash

Choi

Scheuermann

and Liu

, Feature selection for clustering – a filter solution, In Proc. of the 2nd ICDM, 2002, pp. 115–122.

22.

David Kincaid

E.W.C.

and Kincaid

D.R.

, Numerical Analysis: Mathematics of Scientific Computing, American Mathematical Soc., 2009.

23.

Davidson

and Basu

, A survey of clustering with instance level constraints, ACM Transactions on Knowledge Discovery from data, 2007, pp. 1–41.

24.

Davidson

and Qi

, Finding alternative clusterings using constraints, In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, 2008, pp. 773–778.

25.

Davidson

and Ravi

, Clustering with constraints: Feasibility issues and the k-means algorithm, In Proceedings of the 2005 SIAM international conference on data mining, SIAM, 2005, pp. 138–149.

26.

Davidson

Wagstaff

K.L.

and Basu

, Measuring constraint-set utility for partitional clustering algorithms, In Proc. of the 10th ECML PKDD, Springer-Verlag, 2006, pp. 115–126.

27.

Deng

Choi

K.-S.

Jiang

Wang

and Wang

, A survey on soft subspace clustering, Inf Sci 348(C) (June 2016), 84–106.

28.

Dubey

Bhattacharya

and Godbole

, A cluster-level semi-supervision model for interactive clustering, In ECML PKDD, Berlin, Heidelberg, 2010, pp. 409–424.

29.

El Moussawi

Cheriat

Giacometti

Labroche

and Soulet

, Clustering with quantitative user preferences on attributes, In 2016 IEEE 28th International Conference on Tools with Artificial Intelligence, Nov 2016, pp. 383–387.

30.

Ester

Kriegel

H.P.

Sander

and Xu

, A density-based algorithm for discovering clusters in large spatial databases with noise, AAAI Press, 1996, pp. 226–231.

31.

Ganganath

Cheng

C.T.

and Tse

C.K.

, Data clustering with cluster size constraints using a modified k-means algorithm, In 2014 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 2014, pp. 158–161.

32.

Han

Pei

and Kamber

, Data mining: concepts and techniques, Elsevier, 2011.

33.

Hanczar

and Nadif

, Using the bagging approach for biclustering of gene expression data, Neurocomputing 74(10) (May 2011), 1595–1605.

34.

Höppner

and Klawonn

, Clustering with size constraints, In Computational Intelligence Paradigms, Springer, 2008, pp. 167–180.

35.

Huang

J.-J.

Tzeng

G.-H.

and Ong

C.-S.

, Marketing segmentation using support vector clustering, Expert Systems with Applications 32(2) (2007), 313–317.

36.

Hung

and Tsai

C.-F.

, Market segmentation based on hierarchical self-organizing map for markets of multimedia on demand, Expert Systems with Applications 34(1) (2008), 780–787.

37.

Jain

A.K.

, Data clustering: 50 years beyond k-means, Pattern Recognition Letters 31(8) (2010), 651–666.

38.

Jing

M.K.

and Huang

J.Z.

, An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data, IEEE Trans on Knowl and Data Eng 19(8) (Aug 2007), 1026–1041.

39.

Klawonn

and Höppner

, Equi-sized, homogeneous partitioning, In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, Springer, 2006, pp. 70–77.

40.

Kriegel

H.-P.

Kröger

and Zimek

, Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering and correlation clustering, ACM Trans Knowl Discov Data 3(1) (Mar. 2009), 1:1–1:58.

41.

Kriegel

H.-P.

and Zimek

, Subspace clustering, ensemble clustering, alternative clustering, multiview clustering: what can we learn from each other, In Proc. ACM SIGKDD Workshop MultiClust, 2010.

42.

Kullback

and Leibler

R.A.

, On information and sufficiency, The Annals of Mathematical Statistics 22(1) (1951), 79–86.

43.

Kumar

and Kummamuru

, Semisupervised clustering with metric learning using relative comparisons, IEEE Transactions on Knowledge and Data Engineering 20(4) (2008), 496–503.

44.

Kumar

and Minz

, Feature selection: A literature review, Smart CR 4(3) (2014), 211–229.

45.

Lampert

Dao

T.-B.-H.

Lafabregue

Serrette

Forestier

Crémilleux

Vrain

and Gançarski

, Constrained distance based clustering for time-series: a comparative and experimental study, Data Mining and Knowledge Discovery, May 2018.

46.

Lelis

and Sander

, Semi-supervised density-based clustering, In Proc. of the 9th IEEE ICDM, 2009, pp. 842–847.

47.

and Zhang

, Clustering with diversity, In International Colloquium on Automata, Languages, and Programming, Springer, 2010, pp. 188–200.

48.

Dong

and Hua

, Localized feature selection for clustering, Pattern Recognition Letters 29(1) (2008), 10–18.

49.

Liu

E.Y.

Guo

Zhang

Jojic

and Wang

, Metric learning from relative comparisons by minimizing squared residual, In Proc. IEEE 12th ICDM, 2012, pp. 978–983.

50.

Liu

E.Y.

Zhang

and Wang

, Clustering with relative constraints, In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011, pp. 947–955.

51.

Liu

and Fu

, Clustering with partition level side information, In IEEE ICDM, Nov 2015, pp. 877–882.

52.

Lomax

R.G.

and Hahs-Vaughn

D.L.

, Statistical concepts: A second course, Routledge, 2013.

53.

MacQueen

J.B.

, Some methods for classification and analysis of multivariate observations, In Cam

L.M.L.

and Neyman

, editors, Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, 1967, pp. 281–297.

54.

Murtagh

and Contreras

, Algorithms for hierarchical clustering: an overview, II, Wiley Interdiscip Rev Data Min Knowl Discov 7(6) (Nov 2017), e1219.

55.

Okabe

and Yamada

, Clustering by learning constraints priorities, In 2013 IEEE 13th ICDM, Volume 0, IEEE Computer Society, 2012, pp. 1050–1055.

56.

Parsons

Haque

and Liu

, Subspace clustering for high dimensional data: A review, SIGKDD Explor Newsl 6(1) (June 2004), 90–105.

57.

Pei

Fern

X.Z.

Rosales

and Tjahja

T.V.

, Discriminative clustering with relative constraints, arXiv preprint arXiv:1501.00037, 2014.

58.

and Davidson

, A principled and flexible framework for finding alternative clusterings, In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, 2009, pp. 717–726.

59.

Ruiz

Spiliopoulou

and Menasalvas

, Density-based semi-supervised clustering, Data Mining and Knowledge Discovery 21(3) (2010), 345–370.

60.

Sun

Zhao

Xue

Shen

and Shen

, Clustering with feature order preferences, Intelligent Data Analysis 14 (2010), 479–495.

61.

Tsai

C.-F.

Y.-H.

and Lu

Y.-H.

, Customer segmentation issues and strategies for an automobile dealership with two clustering techniques, Expert Sys: J Knowl Eng 32(1) (Feb. 2015), 65–76.

62.

V.V.

Labroche

and Bouchon-Meunier

, Boosting clustering by active constraint selection, In Proc. of the 2010 19th ECAI, 2010, pp. 297–302.

63.

V.V.

Labroche

and Bouchon-Meunier

, An efficient active constraint selection algorithm for clustering, In 2010 20th ICPR, 2010, pp. 2969–2972.

64.

Wagstaff

Cardie

Rogers

and Schroedl

, Constrained k-means clustering with background knowledge, In Proc. of the 18th ICML, 2001, pp. 577–584.

65.

Wagstaff

K.L.

, When is constrained clustering beneficial and why, In in AAAI, 2006.

66.

Wagstaff

K.L.

and Cardie

, Clustering with instance-level constraints, In Proc. of the 17th ICML, 2000, pp. 1103–1110.

67.

Wang

and Su

, An improved K-Means clustering algorithm, In 2011 IEEE 3rd Int. Conf. Commun. Softw. Networks, IEEE, May 2011, pp. 44–46.

68.

Wang

and Li

, Clustering with instance and attribute level side information, Int Journal of Computational Intelligence Systems 3(6) (2010), 770–785.

69.

Wang

and Davidson

, Flexible constrained spectral clustering, In Proc. of KDD, 2010, pp. 563–572.

70.

Xing

E.P.

A.Y.

Jordan

and Russel

, Distance metric learning, with application to clustering with side-information, In Proc. of NIPS, 2002, pp. 505–512.

71.

Zhu

Wang

and Li

, Data clustering with size constraints, Knowledge-Based Systems 23(8) (2010), 883–889.

MAPK-means: A clustering algorithm with quantitative preferences on attributes

Abstract

1. Introduction

1.2 Our proposal

2. Related work

2.1 Expressing expert knowledge as constraints

2.2 Constraints resolution in clustering

2.2.1 Ad hoc approaches

2.2.2 Exact approaches

2.2.3 Soft approaches

2.3 Sensitivity to the expert knowledge

3. Problem statement

3.1 Quantitative user preferences

3.2.1 Intra-cluster distance

3.2.3 Regularization term

3.2.4 The overall objective function

4.1 Reformulation with a Lagrange multiplier

(Upper bound).

Proof..

(Lower bound).

Proof..

(Termination).

Proof..

(Complexity).

Proof..

(MCPK-means equivalence).

Proof..

5. Experiments on UCI benchmarks

3 archive.ics.uci.edu/ml/datasets.html.

5.1.2 Preference initialization

5.3.1 Impact of relevant preferences on the quality of clustering

Table 2 Variation of the best value of κ corresponding to the best average of the NMI scores, with 𝐖 * = 𝐖 𝑅𝑒𝑙

Table 3 NMI values of clusterings obtained using K-means, K-means with a weighted distance, CFP [60] and our algorithm MAPK-means

6.1 Dataset and reference clusterings

6.1.1 Reference partitions construction

Table 5 Analysts’ preference vectors (with strong values on the preferred attributes)

7. Conclusion

Footnotes

Acknowledgments

Appendix

References

³
archive.ics.uci.edu/ml/datasets.html.

Table 2
Variation of the best value of $\kappa$ corresponding to the best average of the NMI scores, with ${\bf W}^{*}={\bf W^{\textit{Rel}}}$

Table 3
NMI values of clusterings obtained using K-means, K-means with a weighted distance, CFP [60] and our algorithm MAPK-means

Table 5
Analysts’ preference vectors (with strong values on the preferred attributes)