Collaborative DCA: An intelligent collective optimization scheme,and its application for clustering

Abstract

This paper deals with a new and efficient collective optimization approach, based on DC (Difference of Convex functions) programming and DCA (DC Algorithm), powerful tools of nonconvex programming. Exploiting the efficiency and the flexibility of DCA we develop the so-called collaborative DCA in which divers DCA based algorithms are cooperated in an effective way. Two versions of collaborative DCA are proposed and their applications on clustering, a fundamental problem in unsupervised learning, are studied. Numerical experiments are performed on several datasets. The comparative results with three DCA component algorithms show that the collaborative DCA outperforms them on quality and it realizes a good trade-off between the quality of solutions and the running time.

Keywords

Collective optimization DC programming DCA Collaborative DCA Clustering.

1 Introduction

Nonconvex (differentiable/nondifferentiable) programming and global optimization have known, during the past three decades, dramatic developments around the world. The absence of convexity creates a source of difficulties of all kinds, in particular the distinction between the local and global minima, the nonexistence of verifiable characterizations of global solutions, etc. Many current approaches for nonconvex programming are not generally effective in large-scale problems from real-world applications. A great challenge of nonconvex programming is finding efficient algorithms that realize a compromise between the quality and the scalability to tackle large-scale problems.

DC (Difference of Convex functions) and DCA (DC Algorithms) constitute the backbone of nonconvex programming and global optimization. Having more than thirty years of history, these efficient tools have been exploited by many researchers and practitioners in the world, to model and solve nonconvex programs from many fields of applied sciences (see e.g. [13 , 16–18] and references quoted therein). DC programming and DCA address the problem of minimizing a function f which is a difference of two convex functions on the whole space $ℝ^{p}$ or on a convex set $Δ \subset ℝ^{p}$ . A standard DC program takes the form $α = inf {f (x) : = g (x) - h (x) : x \in ℝ^{p}} (P_{dc})$ where g, h are lower semicontinuous proper convex functions on $ℝ^{p}$ . Such a function f is called DC function, and g - h, DC decomposition of f while g and h are DC components of f . The main idea of DCA is simple: each iteration of DCA approximates the convex function h by its affine minorant defined by y^k ∈ ∂h (x^k) (∂h denotes the subdifferential of the convex function h), and solves the resulting convex program. $\begin{matrix} y^{k} & \in \partial h (x^{k}), \\ x^{k + 1} & \in arg min_{x \in ℝ^{p}} {g (x) - h (x^{k}) - 〈 x - x^{k}, y^{k} 〉} .: (P_{k}) \end{matrix}$

The construction of DCA involves DC components g and h but not the function f itself. Hence, for a DC program, each DC decomposition corresponds to a different version of DCA. Since a DC function f has an infinite number of DC decompositions which have crucial impacts on the qualities (speed of convergence, robustness, efficiency, globality of computed solutions, etc) of DCA, the search for a “good” DC decomposition is important from an algorithmic point of view. Exploiting the nice effect of DC decompositions of the same DC function as well as the one of various equivalent DC formulations of the same problem are challenging issues in DC programming and DCA. For this purpose, we propose in this paper a collective optimization scheme named Collaborative DCA or CoDCA for short. The idea is simple, but attractive: at each running cycle of CoDCA we run parallel some iterations (typically one iteration) of several DCA based algorithms (called DCA component) from their best current solution, and then exchange the obtained solution between the component algorithms to update the best solution to each of them. The CoDCA is terminated when all DCA components are stopped. Such a collective intelligent optimization scheme strongly contributes to the transfer of knowledge and power from the individual to the collective.

Collective optimization has been investigated in the literature, for instance the practical Swarm optimization (see the recent survey [19]) or the Bee algorithms (see e.g. [15]). However these approaches are completely different to our collaborative framework in which divers algorithms are collaborated while applying on different formulations of the same problem. A recent collaborative approach (in the same sense of our work) is the metastorming [22] which collaborates some metaheuristic methods. Nonetheless, the collaboration between iterative deterministic algorithms has been never considered. Perhaps that is because there are a very few efficient iterative deterministic algorithms for the same nonconvex problem. The idea of CoDCA is motivated by the efficiency and the flexibility of DCA, more precisely

DCA is a descent method which often gives a global solution from a good starting point.

There are several DC formulations, and consequently several DCA schemes for the same DC problem.

CoDCA aims to exploit the efficiency of all DCA components at each running cycle to get a good starting point for each of them.

Various versions of the collaborative DCA can be investigated for several applications. The following questions are crucial for designing CoDCA:

Search of various DC decomposition of the same objective function or equivalent DC formulation of the considered problem?

Design of corresponding DCA schemes?

How long (or how many iterations of) each DCA component would be applied in each running cycle?

How to define “the best current solution” for each DCA component?

These questions should be considered in the particular context of the considered problem for well exploiting its special structure.

In this paper, we propose two collaborative DCA schemes for hard clustering. Our work is motivated by the facts that there are some efficient, fast and scalable DCA based algorithms for clustering. We will address all the about question while designing CoDCA.

The paper is organized as follows. The generic collaborative DCA schemes are presented in Section 2 codca while collaborative DCA for clustering is discussed in Section 3 sec:clustering. We start this section by introducing different formulations of the clustering problem and their corresponding DCA, and then indicate how CoDCA works for clustering. The computational experiments are reported in Section 4 experimentalResults. Finally, Section 5 conclusion concludes the paper.

2 Collaborative DCA: a generic scheme

Consider now the optimization problem of the form $min {f (x) : x \in Ω},$ (1) where f is a DC function and Ω is a convex set, i.e. (1) is a standard DC program. Note meanwhile that the idea of CoDCA can also be applied to general DC programs where Ω is a DC set (a set defined by DC constraints). Suppose that there are p DCA components, denoted DCA_i for i = 1, …, p, for solving the p equivalent DC formulations of Problem (1) that have the form $min {f_{i} (x) : x \in Ω_{i}} .$ (2)The philosophy of the collaborative DCA: starting with a suitable initial solution for each DCA component, each running cycle of CoDCA performs the following cooperative procedure:

execute parallel a certain number of iterations of each DCA component;

exchange and update the best current solution to restart each DCA component in the next cycle.

The efficiency of each scheme depends mainly on the three following elements:

the efficiency of DCA components - this is the most important element of CoDCA which results to the quality of the shared solution;

the number of iterations in each component algorithm at each step (the moment when the exchange information occurs);

the way to moderate the exchange information, more precisely, the way to update the best current solution for restarting each DCA component.

By changing at least one of the above elements we get various CoDCA algorithms for the problem (1).

We propose here two versions of CoDCA that differs from one to other by the way to update “the best solution” for each DCA component. In the first scheme, named CoDCA1, the same best solution is taken for all DCA component algorithms: using a common objective function, denoted f_C, to evaluate the solutions given by DCA_i, we define the “winner” at the end of (k - 1) ^th cycle as the one having the smallest value via f_C. More precisely $x_{win}^{k} : = arg min {f_{C} (x_{i}^{k}) : i = 1, \dots, p},$ where $x_{i}^{k}$ denotes the solution given by DCA_i at the end of the (k - 1) ^th cycle. In the second scheme, named CoDCA2, the best solution of each DCA component DCA_i is defined as the one having the smallest objective function value of the corresponding DC optimization problem (2), say $x_{i}^{k} : = arg min {f_{i} (x_{j}^{k}) : j = 1, \dots, p} .$ The CoDCA2 promotes the accelerate of convergence of each DCA component, while CoDCA1 may change the trajectory of DCA component and therefore may avoid a premature convergence of DCA components to a bad solution.

The two CoDCAs are described in the Figure 1.

Fig.1

Generic scheme of CoDCA.

3 Collaborative DCA for clustering

Clustering, which aims at dividing a dataset into groups or clusters containing similar data, is a fundamental problem in unsupervised learning and has many applications in various domains. In recent years, there has been significant interest in developing clustering algorithms to massive datasets (see e.g. [1–5 , 20] and the references therein.

The general term “clustering” covers many different types of problems. All consist of subdividing a dataset into groups of similar elements, but there are many measures of similarity, many ways of measuring, and various concepts of subdivision (see [6, 7] for a survey). DCA has been extensively investigated for various kinds of clustering problems: hard/fuzzy/partitional/hierarchical/weighted clustering (refer to [14] for a short review). In the sequel, we will use the term “clustering” to indicate the hard partitional clustering.

An instance of the clustering problem consists of a dataset $X : = {x_{1}, \dots, x_{n}}$ of n entities in ^mbrd, a measured distance, and an integer k with 2 ≤ k ≤ n; we are to choose k members _vl (i = 1, …, k) and assign each member of $X$ to its closest centroid (or center). Among many criteria used for cluster analysis, the minimum sum-of-squares (MSSC in short) is one of the most popular since it expresses both homogeneity and separation. MSSC consists in partitioning the set $X$ into k clusters in order to minimize the sum of squared distances from the entities to the centroid of their cluster. Since k is given, one supposes that clusters should not be empty. This problem may be formulated mathematically in several ways, which suggest different possible algorithms. In this paper, we consider the two most widely used models that are a bilevel MSSC problem and a mixed integer program. The Gaussian Kernel version of the bilevel MSSC is also considered.

3.1 DC formulations and DCA based algorithms for clustering

The formulation of MSSC based on bilevel programming was first introduced in [21]: $min F_{B} (V) : = \frac{1}{2} \sum_{i = 1}^{n} min_{l = 1, \dots, k} {‖ v_{l} - x_{i} ‖}^{2}$ (3) $s . t . V \in ℝ^{d \times k},$ where || . || is the Euclidean norm, and V the (d × k) - matrix whose l-th column is V_:l = v_l ∈ ^d, the center of cluster l-th. This is a nonconvex nonsmooth optimization problem and is very hard to solve. While several heuristic methods based on k-means algorithm have been proposed for solving (3), there are few deterministic approaches that address directly this bilevel problem.

The kernel version of (3) via Gaussian kernel function $κ (x, y) : = \exp (- \frac{{‖ x - y ‖}^{2}}{2 α^{2}})$ was introduced in [12], it takes the form $min F_{K} (V) : = \sum_{i = 1}^{n} min_{l = 1, \dots, k} \frac{- 2}{\exp (\frac{{‖ v_{l} - x_{i} ‖}^{2}}{2 α^{2}})}$ (4) $s . t . V \in ℝ^{d \times k} .$

Another equivalent formulation of MSSC, which is a DC integer problem, was proposed in [12]. Let U = (u_li) ∈ ^mbrk ×n with l = 1, …, k and i = 1, …, n be the matrix defined by $u_{li} : = {\begin{matrix} 1 & if x_{i} belongs to cluster l - th, \\ 0 & otherwise . \end{matrix}$

Then a straightforward mixed integer formulation of MSSC is $min F_{I} (U, V) : = \sum_{i = 1}^{n} \sum_{l = 1}^{k} u_{li} ∥ v_{l} - x_{i} ∥^{2}$ (5) $\begin{matrix} s . t . \sum_{l = 1}^{k} u_{li} = 1 \forall i = 1, \dots, n, \\ u_{li} \in {0, 1} \forall l = 1, \dots, k; i = 1, \dots, n . \end{matrix}$

The problem (5) was reformulated as a continuous optimization problem in [12] thanks to an exact penalty technique, where the resulting problem was proved to be a DC program.

DC programming and DCA have been developed in Le Thi et al. [10] to (3) and Le Thi et al. [12] to (4) as well as to (5). The proposed DCA schemes are original and very inexpensive because either DCA is explicitly computed [10, 12] or it amounts to computing, at each iteration, the projection of points onto a simplex and/or onto a ball, and/or onto a box, that are all determined in the explicit form [12]. We investigate in this work collaborative DCA schemes using these DCA based algorithms as components. We shortly describe below the DCA component algorithms, the reader is referred to [10, 12] for more details about the DC formulations of these problems and the construction of DCA for them.

Algorithm 1 B-DCA: DCA for solving (3)

initialization: Choose an initial solution V⁰ ∈ ^mbrd ×k. Let ɛ > 0 be sufficiently small, t ← 0.

repeat

1. For i = 1, …, n, choose $j (i) \in arg min {∥ v_{j}^{t} - x_{i} ∥^{2} : j = 1, \dots, k} .$

2. For l = 1, …, k, compute $\begin{matrix} v_{l}^{t + 1} & = & (1 - \frac{| {1 \leq i \leq n : j (i) = l} |}{n}) v_{l}^{t} \\ + \frac{1}{n} \sum_{1 \leq i \leq n : j (i) = l} x_{i} . \end{matrix}$

3. t ← t + 1

until ||V^t-1 - V^tArrowvert ≤ ɛ or F_B (V^t-1) - F_B (V^t) < ɛ.

Algorithm 2 GK-DCA: DCA for solving (4)

initializations: Let ɛ > 0 be small enough and V⁰ be given. Set t ← 0;

repeat

1. For i = 1, …, n, choose $l (i) \in arg min {∥ v_{l}^{t} - x_{i} ∥^{2} : l = 1, \dots, k} .$

2. Compute V^t+1 by setting: ∀l = 1, …, k, $v_{: l}^{t + 1} = v_{l}^{t} - \sum_{i = 1, l (i) = l}^{n} \frac{2 (v_{l}^{t} - x_{i})}{n ρ σ^{2}} exp$ $(- \frac{∥ v_{l}^{t} - x_{i} ∥^{2}}{2 σ^{2}}) .$

3. t ← t + 1.

until ||V^t-1 - V^tArrowvert ≤ ɛ or F_K (V^t-1) - F_K (V^t) < ɛ.

Remark 1. These algorithms B-DCA and GK-DCA require only elementary operations on vectors and can so handle large-scale clustering problems.

Let Δ_k be the (k - 1)–simplex in ^mbrk defined by $Δ_{k} : = {u \in [0, 1]^{k} : \sum_{l = 1}^{k} u_{l} = 1}$ and let $C$ be the Euclidean ball centered at the origin with radius r in ^mbrd. The problem (5) can be rewritten as $\begin{matrix} min F_{I} (U, V) \\ s . t . U = [U_{: 1}, \dots, U_{: n}] \in Δ_{k}^{n} \cap {0, 1}^{k \times n}, V \in C^{k} . \end{matrix}$

Considering the penalty function P defined on ^mbrk ×n by $P (U) : = \sum_{l = 1}^{k} \sum_{i = 1}^{n} u_{li} (1 - u_{li})$ and the penalty parameter τ > 0 large enough, the corresponding continuous reformulation is written as $min F_{I}^{τ} (U, V) : = F_{I} (U, V) + τ P (U)$ (6) $s . t . (U, V) \in Δ_{k}^{n} \times C^{k} .$

Algorithm 3 IP-DCA: DCA for solving (6)

initialization: Choose the memberships U⁰ and the cluster centers V⁰. Let ɛ > 0 be sufficiently small, t ← 0.

repeat

1. Define (Y^t, Z^t) by setting: ∀i = 1, …, n; l = 1, …, k, $\begin{matrix} Y_{li}^{t} & = & (ρ - 2 {‖ v_{l}^{t} - x_{i} ‖}^{2} u_{li}^{t} - τ, \\ Z_{: l}^{t} & = & ρ v_{l}^{t} - 2 \sum_{i = 1}^{n} (u_{li}^{t})^{2} (v_{l}^{t} - x_{i}) . \end{matrix}$

2. Define (U^t+1, V^t+1) by setting: ∀i = 1, …, n; l = 1, …, k, $\begin{matrix} U_{: i}^{t + 1} = {Proj}_{Δ_{k}} (\frac{Y_{: i}^{t}}{ρ}), \\ v_{l}^{t + 1} = {Proj}_{C} (\frac{Z_{: l}^{t}}{ρ}) = \\ {\begin{matrix} \frac{Z_{: l}^{t}}{ρ} & if ∥ Z_{: l}^{t} ∥ \leq ρ r, \\ \frac{{rZ}_{: l}^{t}}{∥ Z_{: l}^{t} ∥} & otherwise . \end{matrix} \end{matrix}$

3. t ← t + 1

until ||(U^t-1, V^t-1) - (U^t, V^t) Arrowvert ≤ ɛ or $F_{I}^{τ} (U^{t - 1}, V^{t - 1}) - F_{I}^{τ} (U^{t}, V^{t}) < ɛ$ .

Remark 2. Each iteration of IP-DCA consists of computations of the projection of points onto a simplex [8] and/or onto a ball, that all are explicitly computed. So IP-DCA does not require an iterative method for the convex subproblem at each iteration.

3.2 Collaborative DCA schemes for clustering

We will show how CoDCA is performed for clustering by indicating the parameters to be used in CoDCA schemes. As we have three DCA components, p is equal to 3. The common objective function in CoDCA1 is taken as the clustering cost, say f_C : = f_B. f₁, f₂ and f₃ in this scheme are the objective function of (3), (4) and (6) respectively: f₁ : = F_B, f₂ : = F_K and f₃ : = F_I. DCA_i, for i = 1, …, 3, are respectively, Algorithm 1, Algorithm 2, Algorithm 3. Note that when f_C : = f_B the behavior of DCA₁ is the same in the two schemes CoDCA1 and CoDCA2. The CoDCA is terminated when all DCA components converge, say $[f_{i} (x_{i}^{k - 1}) - f_{i} (x_{i}^{k})] / f_{i} (x_{i}^{k}) \leq ɛ$ .

From the convergence properties of DCA we have

Theorem 1. i) CoDCA generates the three sequences {V^i,k}, i = 1, 2, 3 and the sequence {U^3,k} so that the sequences {f_i (V^i,k)}, i = 1, 2 and {f₃ (U^3,k, V^3,k)} are decreasing. ii) The sequences {V^i,k}, i = 1, 2 (resp. {(U^3,k, V^3,k)}) converge (resp. converges) to the point V^* (resp. (U^*, V^*)) which is a critical point of the problems (3) and (4) respectively (6)).

4 Numerical Experiments

Results We have implemented the algorithms in the Visual C++ 2010 environment, with OpenMP enabled for parallel programming. Numerical experiments are conducted on an Intel (R) Xeon (R) CPU E5-2630 v2 @2.60 GHz with 32 GB of RAM.

4.1 Experiment setting and testing datasets

All the datasets for testing are described in the Table 1.

Table 1
The description of the datasets

Dataset #Element #Attribute #Class

PIMA 768 8 2

YEAST 1484 8 10

AND 3186 60 3

VOTE 435 16 2

LYMPHO 148 18 4

WAVE 5000 40 3

PAPILLON 23 4 4

IRIS 150 4 3

INOS 315 34 2

WINE 178 13 3

BREAST 683 9 2

STATLOG 4435 36 6

GLASS 214 10 6

COMP 3891 10 3

DIM32 1024 32 16

DIM64 1024 64 16

DIM128 1024 128 16

Dataset	#Element	#Attribute	#Class
PIMA	768	8	2
YEAST	1484	8	10
AND	3186	60	3
VOTE	435	16	2
LYMPHO	148	18	4
WAVE	5000	40	3
PAPILLON	23	4	4
IRIS	150	4	3
INOS	315	34	2
WINE	178	13	3
BREAST	683	9	2
STATLOG	4435	36	6
GLASS	214	10	6
COMP	3891	10	3
DIM32	1024	32	16
DIM64	1024	64	16
DIM128	1024	128	16

Parameters used in our experiment are set as following: the penalty parameter τ in IP-DCA is set to 0.06, σ in GK-DCA is set to 5, and ɛ is equal to 10^-5 in all algorithms. In CoDCA we set $t_{i}^{k} = 3$ for k ≤ 3 and $t_{i}^{k} = 1$ otherwise. The initial point of each DCA component is the same when it performs independently and/or parallel in CoDCA.

4.2 Experiment result and comments

In the Table 2, we present the accuracy of clustering and the CPU time (measured in seconds) of the five comparative algorithms: the three DCA component (B-DCA, GK-DCA and IP-DCA) when they perform independently, and the two collaborative algorithms (CoDCA1 and CoDCA2).

Table 2
Accuracy (in percent) and CPU time of the comparative algorithms. Bold characters indicate the best results.

Name Accuracy Time (seconds)

B-DCA GK-DCA IP-DCA CoDCA1 CoDCA2 B-MSSC GK-DCA IP-DCA CoDCA1 CoDCA2

PIMA 67.06 67.45 73.05 73.05 73.31 0.002 0.003 0.011 0.009 0.010

YEAST 42.18 42.18 40.03 49.93 50.07 0.174 0.030 0.513 0.634 0.662

AND 52.86 59.86 68.83 69.55 69.55 0.306 0.870 10.890 6.652 1.100

VOTE 86.67 86.67 89.42 89.66 89.66 0.002 0.003 0.007 0.007 0.007

LYMPHO 64.19 56.76 64.86 72.97 72.97 0.007 0.002 0.011 0.030 0.020

WAVE 77.12 67.9 56.48 77.14 77.82 0.079 0.077 1.620 0.352 0.437

PAPILLON 100 100 100 100 100 0.000 0.000 0.000 0.001 0.172

IRIS 96 96 95.33 96 96 0.000 0.001 0.003 0.001 0.003

INOS 70.94 71.23 73.79 74.93 74.93 0.002 0.006 0.061 0.061 0.093

WINE 92.7 91.01 91.57 93.82 93.82 0.001 0.001 0.002 0.005 0.083

BREAST 95.9 97.22 94.73 97.36 97.36 0.006 0.006 0.234 0.078 0.094

STATLOG 67.13 70.1 75.36 76.51 75.36 0.568 0.139 0.461 3.514 3.001

GLASS 72.04 74.77 74.77 76.64 77.57 0.005 0.003 0.014 0.015 0.028

COMP 97.61 96.47 89.95 97.61 97.61 0.014 0.024 0.088 0.080 0.171

DIM32 87.21 82.91 86.82 87.21 100.00 0.689 0.034 18.603 10.904 28.06

DIM64 87.30 81.05 85.94 93.65 100.00 0.302 0.015 20.765 3.032 8.42

DIM128 93.65 93.26 87.30 93.65 100.00 0.578 0.026 14.574 36.278 16.112

Average 79.44 78.52 79.31 83.51 85.06 0.16 0.07 3.99 3.63 3.44

Name	Accuracy	Time (seconds)
PIMA	67.06	67.45	73.05	73.05	73.31	0.002	0.003	0.011	0.009	0.010
YEAST	42.18	42.18	40.03	49.93	50.07	0.174	0.030	0.513	0.634	0.662
AND	52.86	59.86	68.83	69.55	69.55	0.306	0.870	10.890	6.652	1.100
VOTE	86.67	86.67	89.42	89.66	89.66	0.002	0.003	0.007	0.007	0.007
LYMPHO	64.19	56.76	64.86	72.97	72.97	0.007	0.002	0.011	0.030	0.020
WAVE	77.12	67.9	56.48	77.14	77.82	0.079	0.077	1.620	0.352	0.437
PAPILLON	100	100	100	100	100	0.000	0.000	0.000	0.001	0.172
IRIS	96	96	95.33	96	96	0.000	0.001	0.003	0.001	0.003
INOS	70.94	71.23	73.79	74.93	74.93	0.002	0.006	0.061	0.061	0.093
WINE	92.7	91.01	91.57	93.82	93.82	0.001	0.001	0.002	0.005	0.083
BREAST	95.9	97.22	94.73	97.36	97.36	0.006	0.006	0.234	0.078	0.094
STATLOG	67.13	70.1	75.36	76.51	75.36	0.568	0.139	0.461	3.514	3.001
GLASS	72.04	74.77	74.77	76.64	77.57	0.005	0.003	0.014	0.015	0.028
COMP	97.61	96.47	89.95	97.61	97.61	0.014	0.024	0.088	0.080	0.171
DIM32	87.21	82.91	86.82	87.21	100.00	0.689	0.034	18.603	10.904	28.06
DIM64	87.30	81.05	85.94	93.65	100.00	0.302	0.015	20.765	3.032	8.42
DIM128	93.65	93.26	87.30	93.65	100.00	0.578	0.026	14.574	36.278	16.112
Average	79.44	78.52	79.31	83.51	85.06	0.16	0.07	3.99	3.63	3.44

From the numerical experiments we observe that: Not surprisingly, the two collaborative DCAs improve the accuracy of clustering of all three algorithms B-DCA, GK-DCA, IP-DCA: the two CoDCAs outperform the three DCA components in all datasets, except for IRIS (resp. COMP) dataset the two CoDCAs have the same performance with B-DCA and GK-DCA (resp. B-DCA). In terms of rapidity, B-DCA and GK-DCA are, on average, the fastest algorithms, followed by CoDCA1 and CoDCA2; IP-DCA is the slowest. Clearly, CoDCA1 and CoDCA2 need more time than B-DCA and GK-DCA because of IP-DCA. Overall, the two CoDCAs realize well tradeoff between the accuracy and the rapidity.

About the two versions of CoDCA: in terms of accuracy, CoDCA2 outperforms CoDCA1 on 7/17 datasets and they give the same result in 9/17 datasets. Regarding the rapidity, CoDCA2 is much faster than CoDCA1 in 5 datasets while the later is slightly faster than the former in 9 datasets. On average, CoDCA2 is faster. Finally, we note that DCA based algorithms give the best results in case of spherical clusters, since the MSSC model is suitable for this type of data.

5 Conclusions

We have proposed an efficient collective optimization approach based on DCA for solving DC programs. The idea is to cooperate several DCA versions applied on different DC formulations of the considered problem to exploit the power of each of them. Two collaborative DCA schemes were developed and applied on clustering, an important and hard topic of data mining. Numerical experiments are performed on several datasets. The results indicated the effectiveness of the cooperative procedure and their superiority in comparison with component algorithms. This work opens the door for an attractive research direction in intelligence collective optimization: the nice effect of DC decomposition in the design of DCA should be exploited to develop collaborative DCA for several nonconvex optimization problems. Work in this direction is in progress.

Footnotes

Acknowledgment

This research is funded by Foundation for Science and Technology Development of Ton Duc Thang University (FOSTECT), website: http://fostect.tdtu.edu.vn, under Grant FOSTECT.2017.BR.10.

References

Arora and

Kannan , LearningMixtures of Arbitrary Gaussians, in: Proceedings of the Thirty-third Annual ACM Symposium on Theory of Computing, New York, NY, USA, (2001), pp. 247–257.

A.M.

Bagirov , Modified global k-means algorithm for minimumsum-of-squares clustering problems, Pattern Recognition 41(10) (2008), 3192–3199.

M.J.

Brusco , A repetitive branch-and-bound procedure for minimum within-cluster sum of squares partitioning, Psychometrika 71(2) (2006), 347–363.

Dhillon ,

Kogan and

Nicholas , Feature Selection and Document Clustering, in: Survey of Text Mining,

M.W.

Berry , ed., Springer New York, 2004, pp. 73–100.

Filippone ,

Camastra ,

Masulli and

Rovetta , A survey of kernel and spectral methods for clustering, Pattern Recognition 41(1) (2008), 176–190.

A.K.

Jain , Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31 (2010), 651–666.

A.K.

Jain ,

M.N.

Murty and

P.J.

Flynn , Data Clustering: A Review, ACM Comput Surv 31(3) (1999), 264–323.

J.J.

Júdice ,

Raydan ,

S.S.

Rosa and

S.A.

Santos , On the solution of the symmetric eigenvalue complementarity problem by the spectral projected gradient algorithm, Numerical Algorithms 47(4) (2008), 391–407.

H.M.

Le ,

H.A.

Le Thi ,

Pham Dinh and

V.N.

Huynh , Block Clustering Based on Difference of Convex Functions (DC) Programming and DC Algorithms, Neural Comput 25(10) (2013), 2776–2807.

10.

H.A.

Le Thi ,

M.T.

Belghiti and

Pham Dinh , A new efficient algorithm based on DC programming and DCA for clustering, Journal of Global Optimization 37(4) (2007), 593–608.

11.

H.A.

Le Thi ,

H.M.

Le and

Pham Dinh , Fuzzy clustering based on nonconvex optimisation approaches using difference of convex (DC) functions algorithms, Advances in Data Analysis and Classification 1(2) (2007), 85–104.

12.

H.A.

Le Thi ,

H.M.

Le and

Pham Dinh , New and efficient DCA based algorithms for minimum sum-of-squares clustering, Pattern Recognition 47(1) (2014), 388–401.

13.

H.A.

Le Thi and

PhamDinh , The DC (Difference of Convex functions) Programming and DCA revisited with DC models of real world nonconvex optimization problems, Ann Oper Res 133(1–4) (2005), 23–48.

14.

H.A.

Le Thi and

Pham Dinh , DC programming and DCA: Thirty years of developments, Mathematical programming, Special Issue: DC Programming - Theory, Algorithms and Applications 169(1) (2018), 5–68.

15.

Pham Dinh and

Castellani , The Bees Algorithm: Modelling foraging behaviour to solve continuous optimization problems, Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 223(12) (2009), 2919–2938.

16.

Pham Dinh and

H.A.

Le Thi , Convex analysis approach to d.c. programming: Theory, Algorithms and Applications, Acta Mathematica Vietnamica 22(1) (1997), 289–355, dedicated to Professor Hoang Tuy on the occasion of his 70th birthday.

17.

Pham Dinh and

H.A.

Le Thi , A D.C. Optimization Algorithm for Solving the Trust-Region Subproblem, SIAMJournal on Optimization 8(2) (1998), 476–505.

18.

Pham Dinh and

H.A.

Le Thi , Recent Advances in DC Programming and DCA, in: Transactions on Computational Intelligence XIII,

N.-T.

Nguyen and

H.A.

Le Thi , eds, LectureNotes in Computer Science, Vol. 8342, Springer Berlin Heidelberg, 2014, pp. 1–37.

19.

Sengupta ,

Basak and

R.A.

Peters , Particle Swarm Optimization: A Survey of Historical and Recent Developments with Hybridization Perspectives, Machine Learning and Knowledge Extraction 1(1) (2018), 157–191.

20.

B.K.

Sriperumbudur ,

D.A.

Torres and

G.R.G.

Lanckriet , Sparse Eigen Methods by D.C. Programming, in: Proceedings of the 24th International Conference on Machine Learning, ICML ’07, ACM, New York, NY, USA, 2007, pp. 831–838.

21.

H.D.

Vinod , Integer Programming and the Theory of Grouping, Journal of the American Statistical Association 64(326) (1969), 506–519.

22.

Yagouni and

H.A.

Le Thi , A Collaborative Metaheuristic Optimization Scheme: Methodological Issues, in: Advanced Computational Methods for Knowledge Engineering,

van Do ,

H.A.

Le Thi and

N.T.

Nguyen , eds, Springer International Publishing, 2014, pp. 3–14.