Learning speed of supervised neural networks as similarity measurement in unsupervised cluster analysis

Abstract

Cluster analysis or clustering is one of the most important and widely used techniques for data exploration and knowledge discovery that concerned with partitioning a set of objects in such a way that objects in the same groups, called clusters, are more similar to each other than to those in other clusters. However, obtaining the clusters that exhibit high within-cluster similarity or homogeneity and high between-cluster dissimilarity or heterogeneity is critically depended on the similarity notion, which has not been yet clearly defined for clustering purposes. Distance and correlation are the most important and commonly used mathematics and statistics-based similarity measurements in the literature of the clustering, respectively. In this paper, the learning speed of the supervised neural networks is proposed as novel intelligent similarity measurement for unsupervised clustering problems. On the other hand, the main aim of this paper is to answer this question that can convergence speed of the different objects to the given target be used for measuring the similarity. Empirical results of the simulated data sets indicate that the proposed measurement not only can be used as similarity measurement in clustering tasks, but also can produce accurate results. In this way, for first time and in contrast of the literature, it is demonstrated that a supervised model can be used for handling the unsupervised tasks.

Keywords

Classification & clustering similarity measurement Artificial Neural Networks (ANNs)supervised and unsupervised processes pattern recognition data mining

1. Introduction

Cluster analysis or clustering is one of the basic data mining techniques for exploring the underlying structure of a given data set and is being applied in a wide variety and broad range of scientific disciplines to discover hidden knowledge and information. There are generally two main purposes for using cluster analysis, including understanding and utility. Clustering has been extensively studied for understanding the conceptually meaningful groups of object that share common characteristics to better analyze and describe in many areas such as medicine, psychology, biology, society, climate, business, etc. Utility purposes of cluster analysis such as summarization, comparison, and efficiently finding nearest neighbors has also attracted a lot of research attention in the field of data mining and knowledge discovery. Clustering provides an abstraction from individual data objects to the clusters in which they reside. In addition, some clustering techniques characterize each cluster in terms of cluster prototype; i.e. a data object that is representative of other objects in the cluster. These prototypes can be applied as the basis for a number of data analysis or data preprocessing purposes. Therefore, in the context of utility, clustering is the study of techniques for finding the most representative cluster prototypes [1].

While it is easy to consider the idea of a data cluster on a rather informal basis, it is very difficult to give a formal and universal definition of a cluster. The definition of what constitutes a cluster is not well defined; and hence, not surprisingly, there are several different definitions for clustering in the literature [2]. However, several working definitions of a cluster that are commonly used are as follows:

1)
Well-separated Cluster Definition: A cluster is a set of objects such that any object in a cluster is more similar to every other object in the cluster than to any object not in the cluster. Sometimes a threshold level is used to specify that all objects in a cluster must be sufficiently similar to one another.
2)
Center-based Cluster Definition: A cluster is a set of objects such that an object in a cluster is more similar to the center (e.g. mean, median, etc.) of a cluster, than to the center of any other cluster.
3)
Contiguous Cluster Definition (Nearest neighbor or Transitive Clustering): A cluster is a set of objects such that an object in a cluster is more similar to one or more other objects in the cluster than to any object that are not in the cluster.
4)
Density-based Definition: A cluster is a dense region of objects, which is separated by low-density regions, from other regions of high density. This definition is more often used when the clusters are irregular or intertwined, and when noise and outliers are present.
5)
Similarity-based Cluster Definition: A cluster is a set of objects that are similar, and objects in other clusters are not similar. A variation on this is to define a cluster as a set of objects that together create a region with a uniform local property, e.g., density or shape.

According to the different notions of cluster, different definitions for clustering have been also presented in the literature [3]. Commonly, a clustering algorithm aims to find the natural structures or relationships in an unlabeled data set. Clustering is a process to identify and classify objects (observations or variables) into one of a (unspecified) number of groups, called clusters, so that clusters exhibit high within-cluster similarity (homogeneity) and high between-cluster dissimilarity (heterogeneity). Clustering is the technique of grouping a set of physical or abstract objects into different clusters, such that objects with in a cluster are more similar to one another and are dissimilar to the objects in other clusters. A good clustering algorithm generates high quality clusters to yield low inter cluster similarity and high intra cluster similarity. Cluster analysis is a process of partitioning a given set of inputs into natural groups (clusters) such a way that each input can be assigned to each cluster with a certain degree of belongingness. Clustering groups objects based on the information found in the data describing the objects or their relationships. The goal is that the objects in a group will be similar (or related) to one other and different from (or unrelated to) the objects in other groups.

Similarity is the intersection point of all clustering definitions and is doubtlessly the most critical notion in the cluster analysis area. In addition, measuring the similarity is one of the most challenging problems in clustering and knowledge discovery tasks. Theoretically, if a measurement satisfies a number of requirements, it can be generally applied as similarity or proximity measurement. These requirements are as follows [4]:

1a)
For dissimilarity: $p_{i,i}=0$ for all $i$ . (Points are not different from themselves)
1b)
For similarity: $p_{i,i}\geqslant Max\ {p_{i,j}}$ for all $i, j$ . (Points are most similar to themselves)
2)
$p_{i,j}=p_{j,i}$ (Symmetry)
3)
$p_{i,j}\geqslant 0$ for all $i$ & $j$ (Positivity)

Additionally, if the proximity measure is real-valued and is a true metric in a mathematical sense, then the following two conditions also hold in addition to conditions 2 and 3.
4)
$p_{i,j}=0$ only if $i=j$ .
5)
$p_{i,k}\leqslant p_{i,j}+p_{j,k}$ for all $i, j, k$ . (Triangle inequality)

Dissimilarities that satisfy conditions 1–5 are called distances, while dissimilarity is a term commonly reserved for those dissimilarities that satisfy only conditions 1–3. Distance is the most important and widely used similarity measurement in clustering tasks. Some well-known distance functions include Minkowski distance (Manhattan (L ${}_{1}$ ), Euclidean (L ${}_{2}$ ), Supremum (L ${}_{\infty}$ )), Canberra distance, Czekanowski distance, and cosine distance. Despite of all advantages of distance as similarity measurement, some cautions should be taken when using distance measurement in clustering problems. Many metric are particularly sensitive to outliers. Therefore, the outliers should be eliminated before clustering. Different metrics may lead to different cluster solutions. Hence, it is advisable to use several measurements and compare the results to theoretical or known patterns.

Correlation is another well-known and widely used similarity measurement in literature of clustering problems. The need for correlation as similarity measurement is based on this fact that distance functions may be totally inadequate for capturing correlations among the objects. On the other hand, strong correlation structures may still exist between a set of objects even if they are far apart from each other as measured by the distance functions. Mahalanobis distance is a well-known mixture similarity measurement of distance and correlation, developed in order to overcome the problem of potential intercorrelation among the objects. In this measurement, the sample variance-covariance matrix in which the principal diagonal elements are sample variances and the off principal diagonal elements are sample covariances is used in order to adjust the intercorrelation among the objects.

In addition to a wide range of similarity measurement, a vast number of procedures have been also proposed, in order to construct different clustering algorithms to address different aspects of the clustering problems. Generally, clustering procedures can be categorized in two main categories including hierarchical (nested) and nonhierarchical (unnested) clustering procedures [5]. Hierarchical procedures produce a nested sequence of partitions, with a single, all inclusive cluster at the top (bottom) and singleton clusters of individual points at the bottom (top). Therefore, there are two general types of hierarchical clustering procedures – agglomerative and divisive. In an agglomerative procedure, each object starts out as its own cluster. In the subsequent steps, the two closest clusters/objects are combined into a new aggregate cluster, thus reducing the number of clusters by one in each step. Eventually, all objects are grouped into one large cluster. Five popular agglomerative procedures used to form clusters are single linkage (MIN), complete linkage (MAX or CLIQUE), average linkage, centroid, Ward’s method. In a divisive clustering procedure, named Diana, the process proceeds in the opposite direction, it start out with one large cluster containing all objects; in the subsequent steps, the objects that are most dissimilar are split off and turned into smaller clusters; this process continues until each object forms a cluster of itself. The hierarchical clustering result is often displayed graphically using a tree-like diagram, called dendrogram, which displays both the cluster-subcluster relationships and the order in which the clusters were merged or split (agglomerative or divisive view).

Unlike to hierarchical clustering procedures, nonhierarchical or partitional clustering procedures do not involve the treelike construction process. Instead, its first step is to select a cluster center (or seed), and all objects within a prespecified threshold similarity are included in the resulting cluster. Nonhierarchical clustering procedures are frequently referred to as $K$ -means clustering [6]. There are three different partitional procedures including Sequential and Parallel threshold and Optimizing procedure. The sequential threshold procedure starts by selecting one cluster seed, and includes all objects within a prespecified similarity. Then second cluster seed is selected, beyond the first cluster, and process continues as before. An object is no longer considered for subsequent seeds if it is already included in a cluster.

The parallel threshold procedure selects several cluster seeds simultaneously in the beginning, and objects within the threshold distance are assigned to the nearest seed. As the process evolves, threshold distances can be adjusted to include fewer or more objects in the clusters. Also, in some methods, objects remain unclustered if they are beyond the prespecified similarity from any cluster seed. The optimizing procedure is similar to the other two except that it allows for reassignment of objects to another cluster from the original on the basis of some overall optimizing criterion. The number of clusters, $K$ , may either be specified in advance or determined as part of the procedure. A matrix of similarities does not have to be determined, and the basic data do not have to be stored during the computational run, nonhierarchical procedures can therefore be applied much larger data sets than can hierarchical procedures [4].

In the literature, several different techniques have been proposed, based on the different notions of cluster, different definitions of clustering, different measurements of similarity, and different procedures of clustering. There is no objectively correct clustering technique, and many researchers believe that clustering is in the eye of the beholder [1]. The most appropriate clustering techniques for a particular problem often need to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another [2]. In addition, there are no definite categorization standards for different clustering techniques [7]. Some of the most prominent clustering techniques in the literature are listed and categorized below:

1)
Connectivity-based techniques, such as CURE [8] model that use agglomerative hierarchical to build clusters based on distance connectivity.
2)
Centroid-based techniques, such as K-means [9] and K-medoid [10] models that are based on this idea that each cluster can be represented by a single center (e.g. centroid, mean, median, medoid, etc.) vector.
3)
Distribution-based techniques, such as models that use statistical distributions, e.g. multivariate normal distributions to model clusters using the Expectation-maximization algorithm.
4)
Density-based techniques, such as DBSCAN [11], DENCLUE [12], and OPTICS [13] that define clusters as connected dense regions in the data space.
5)
Subspace-based techniques, such as Biclustering [14] (co-clustering or two-mode-clustering) that clusters are modeled with both cluster members and relevant attributes.
6)
Graph-based techniques, such as models in which a clique, i.e., a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster.

In this paper, a new notion of cluster and similarity is presented; and consequently, a new clustering technique is proposed. The notion is based on the idea that a group of objects which represent the same learning behavior can be considered as a cluster. In other words, objects can be partitioned base on their learning and convergence speed to the given target. Therefore, a cluster can be defined as follows:

“A cluster is a set of objects such that any object in a cluster has close and similar learning speed”

In fact, in this learning-based definition of the cluster, the learning speed plays role of the similarity measurement in the traditional clustering techniques. However, it is clearly known that the learning rate for using as similarity measurement in the clustering tasks must at least satisfy the first three aforementioned requirements. However, it is easy to show that the learning speed satisfies these requirements. Therefore, it has the necessary conditions and can be theoretically considered as a clustering similarity measurement.

However, in order to measure the similarity, based on the learning speed, a learning-based technique is first needed in order to model the objects generation process and also relationships and correlation structures between them. Then the model is used in order to calculate the learning and convergence speed of each object. In the literature, there are several different learning-based and machine learning techniques which theoretically can be applied for this purpose [15]. Artificial neural networks (ANNs) and fuzzy inference systems (FISs) are the most important and commonly used hard/crisp and soft/fuzzy learning-based techniques, respectively. Hard/crisp clustering techniques assign each data point to exactly one cluster while soft/fuzzy clustering techniques may assign each data point to several clusters with varying degrees of memberships. In this paper, multi-layer perceptrons (MLPs), which are the most well-known neural networks, are used, as example, for modeling and measuring the learning speed in hard and crisp clustering tasks. Soft and fuzzy clustering problems can be also solved by replacing the hard and crisp models with soft and fuzzy learning-based models. The rest of the paper is organized as follows. In the next section, the basic concepts and modeling approaches of the multi-layer perceptrons (MLPs) are briefly reviewed. In section 3, the formulating and designing of the proposed clustering model is introduced. In Section 4, the feasibility and performance of the proposed model is evaluated in clustering problems using generated data sets. In Section 5, the desired features of the proposed clustering model are reviewed. Our concluding remarks are presented in Section 6.
2. Multi-layer perceptrons (MLPs)

Multi-layer perceptrons (MLPs) are computer systems developed to mimic the operations of the human brain by mathematically modeling its neuro-physiological structure. In MLPs, computational units called neurons replace the nerve cells and the strengths of the interconnections are represented by weights, in which the learned information is stored. This unique arrangement can acquire some of the neurological processing ability of the biological brain such as learning and drawing conclusions from experience. Multi-layer perceptrons have been shown to be effective at approximating complex nonlinear functions, which is very useful in a wide rang of scientific tasks, especially forecasting and classification [16]. Single hidden layer feed forward neural network is the most widely used form of the artificial neural network for modeling, forecasting, and classification. The model is characterized by a network of three layers of simple processing units connected by acyclic links (Fig. 1). The relationship between the output ( $y$ ) and the inputs ( $x_{1},x_{2},\ldots,x_{p}$ ) has the following mathematical representation:

$\displaystyle y_{t}=f\left({w_{0}+\sum\limits_{j=1}^{q}{w_{j}\cdot g\left(w_{0% ,j}+\sum\limits_{i=1}^{p}{w_{i,j}\cdot x_{t,i}}\right)}}\right)+\varepsilon_{t},$ (1)

where, $w_{i,j}(i=0,1,2,\ldots,p,j=1,2,\ldots,q)$ and $w_{j}(j=0,1,2,\ldots,q)$ are model parameters often called connection weights; $p$ is the number of input nodes; and $q$ is the number of hidden nodes. Data enters the network through the input layer, moves through hidden layer, and exits through the output layer. Each hidden layer and output layer node collects data from the nodes above it (either the input layer or hidden layer) and applies an activation function. Activation functions can have several forms. The type of activation function is indicated by the situation of the neuron within the network. In the majority of cases input layer neurons do not have an activation function, as their role is to transfer the inputs to the hidden layer. The logistic and hyperbolic functions (Eqs (2) and (3)) are often used for the hidden and output neurons for classification problems. Other transfer functions such as linear and quadratic can also be used, each with a variety of modeling applications.

$\displaystyle\textit{Sig}\left(x\right)=\frac{1}{1+\exp(-x)}.$ (2) $\displaystyle\textit{Tanh}\left(x\right)=\frac{1-\exp(-2x)}{1+\exp(-2x)}.$ (3)

The simple network given by Eq. (1) is surprisingly powerful in that it is able to approximate the arbitrary function as the number of hidden nodes when $q$ is sufficiently large. On the other hand, it is universal functional approximators that can approximate any continuous function to any desired accuracy [17, 18, 19]. In practice, simple network structure that has a small number of hidden nodes often works well in out-of-sample forecasting. This may be due to the overfitting effect typically found in the neural network modeling process. An overfitted model has a good fit to the sample used for model building but has poor generalizability to data out of the sample.

Figure 1.

Multi-Layer Perceptrons structure ( $N^{(p-q-1)}$ ).

There exist many different approaches such as the pruning algorithm, the polynomial time algorithm, the canonical decomposition technique, and the network information criterion for finding the optimal architecture of an artificial neural network. These approaches can be generally categorized as follows [20]: (i) Empirical or statistical methods that are used to study the effect of internal parameters and choose appropriate values for them based on the performance of model. The most systematic and general of these methods utilizes the principles from Taguchi’s design of experiments. (ii) Hybrid methods such as fuzzy inference where the artificial neural network can be interpreted as an adaptive fuzzy system or it can operate on fuzzy instead of real numbers. (iii) Constructive and/or pruning algorithms that, respectively, add and/or remove neurons from an initial architecture using a previously specified criterion to indicate how artificial neural network performance is affected by the changes. The basic rules are that neurons are added when training is slow or when the mean squared error is larger than a specified value. In opposite, neurons are removed when a change in a neuron’s value does not correspond to a change in the network’s response or when the weight values that are associated with this neuron remain constant for a large number of training epochs. (iv) Evolutionary strategies that search over topology space by varying the number of hidden layers and hidden neurons through application of genetic operators and evaluation of the different architectures according to an objective function.

Although many different approaches exist in order to find the optimal architecture of an artificial neural network, these methods are usually quite complex in nature and are difficult to implement. Furthermore, none of these methods can guarantee the optimal solution for all real forecasting problems. To date, there is no simple clear-cut method for determination of these parameters and the usual procedure is to test numerous networks with varying numbers of hidden units, estimate generalization error for each and select the network with the lowest generalization error. Once a network structure is specified, the network is ready for training a process of parameter estimation. The neural network training is an unconstrained nonlinear minimization problem in which weights and biases of a network are iteratively modified to minimize the cost function. Cost function is an overall accuracy criterion such as the following mean squared error:

$\displaystyle E=\frac{1}{N}\sum_{n=1}^{N}(e_{i})^{2}=\frac{1}{N}\sum_{n=1}^{N}% \left(y_{t}-\left(w_{0}+\sum_{j=1}^{q}w_{j}\ g\left(w_{0j}+\sum_{i=1}^{p}w_{i,% j}\ x_{t,i}\right)\right)\right)^{2},$ (4)

where, $N$ is the number of error terms. There are many different optimization methods in the literature that it consequently provides various choices for training of neural networks. However, there is no algorithm currently available to guarantee the global optimal solution for a general nonlinear optimization problem in a reasonable amount of time. The most popularly used training method is the backpropagation algorithm that is essentially a gradient steepest descent method [21], in which the parameters of the neural network, $w_{i,j}$ , are changed by an amount $\Delta w_{i,j}$ , according to the following formula:

$\displaystyle\Delta w_{i,j}=-\eta\frac{\partial E}{\partial w_{i,j}},$ (5)

where, the parameter $\eta$ is the learning rate and ${\partial E}\mathord{\left/{\vphantom{{\partial E}{\partial w_{i,j}}}}\right.% \kern-1.2pt}{\partial w_{i,j}}$ is the partial derivative of the function $E$ with respect to the weight $w_{i,j}$ . This derivative is commonly computed in two passes. In the forward pass, an input vector from the training set is applied to the input units of the network and is propagated through the network, layer by layer, producing the final output. During the backward pass, the output of the network is compared with the desired output and the resulting error is then propagated backward through the network, adjusting the weights accordingly. To speed up the learning process, while avoiding the instability of the algorithm, Rumelhart and McClelland [21] introduced a momentum term $\delta$ in Eq. (5), thus obtaining the following learning rule:

$\displaystyle\Delta w_{i,j}({t+1})=-\eta\frac{\partial E}{\partial w_{i,j}}+% \delta\ \Delta w_{i,j}(t)$ (6)

The momentum term may also be helpful to prevent the learning process from being trapped into poor local minima, and is usually chosen in the interval [0; 1]. Finally, the estimated model is evaluated using a separate hold-out sample that is not exposed to the training process.

3. Formulating and designing of the proposed learning-based clustering model

It can be achieved from various clustering algorithms that there are significantly differences between the notion of what constitutes a cluster and how to efficiently find them. The most popular traditional notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals, correlations or particular statistical distributions [22]. In this paper, a new notion of cluster is presented, based on this idea that objects with same learning behavior can be partitioned in the same groups or clusters. Consequently, the learning and convergence speed of the objects to the given target is defined as new similarity measurement, and then, a new clustering model is proposed. Therefore, in the first stage of the proposed model, a learning-based technique is designed in order to model objects generation process and also relationships and correlation structures between them. In the second stage of the proposed model, the obtained results of the first stage and behavior of hidden neurons are analyzed and then objects are partitioned in desired clusters.

Data-driven learning-based techniques are generally built from either labeled or unlabeled data sets. Unlabeled data sets do not require domain experts in order to label data points, which is an expensive, time-consuming, and error-prone process. However, if only the unlabeled data sets are available, only the unsupervised methods can be used to build model. However, the unsupervised methods have to use the unsupervised learning algorithms, which are expensive, time-consuming, and error-prone processes. Therefore, both these processes – labeled data sets and supervised algorithms or unlabeled data sets and unsupervised algorithms- are costly, inefficient, and inappropriate for measuring the learning speed in the learning-based techniques for clustering purposes. Traditionally, clustering models follow the second strategy; however, the idea of this paper, is to use unlabeled data sets in supervised models.

Our idea is a subset of a general idea that believes “Modeling Process Is Continuous”. On the other hand, there is no significant difference between supervised and unsupervised modeling processes and there is a one-to-one correspondence between supervised and unsupervised models. In other words, there is an unsupervised model for each supervised model and vice versa. An unsupervised modeling process with $n$ explanatory variables is equivalent to a supervised modeling process with $n$ explanatory variables and $m$ target variables; where, target variables are pair wise independent and also independent from all explanatory variables. It is based on this fact that, when target is independent from explanatory variables, there is no relationship between them. Thus, inter-correlation structures between explanatory variables are only modeled by in a supervised model. In addition, these inter-correlation structures are not biased based on target values. In other words, only unbiased inter-correlation structures in such circumstances are modeled by a supervised model. That is exactly similar to what unsupervised models do. On the contrary, a supervised modeling process with $n$ explanatory and $m$ target variables is equivalent to an unsupervised modeling process with $n+m$ explanatory variables; where, where, target variables are pair wise independent and also independent from all explanatory variables.

In our learning-based clustering process, by assuming that similar objects have the similar behavior in the different situations, it is concluded that there is no significant difference between target values in the modeling processes. On the other hand, since learning speed is independent from target values, aforementioned idea can be used in order to construct a learning-based clustering model by replacing target values with a logical pre-specified learnable values. In this way, the necessary theoretical background for supervised discovering patterns, relationships and structures in unlabeled data sets for unsupervised clustering tasks are available. In this paper, a constant value is considered as target. The reason is that it is the simplest function that is independent from each explanatory variable. Therefore, it is enough that a supervised learning-based technique is used for modeling process. As mentioned previously, the multi-layer perceptrons are selected in this paper for practical modeling of data as last section of the modeling stage of the proposed clustering model.

The multi-layer perceptrons are the most important and widely used types of learning-based techniques especially for modeling purposes, because of their inherent capability of arbitrary input–output mapping. Despite of the many satisfactory characteristics of the multi-layer perceptrons, building a MLP for a particular modeling problem is a nontrivial task [23]. Modeling issues that affect the performance of a multi-layer perceptron must be considered carefully. One of the most critical decisions is to determine the appropriate architecture, that is, the number of layers, the number of neurons in each layer, and the numbers of arcs which inter connect with the neurons. Other network design decisions include the selection of activation functions of the hidden and output neurons, training algorithm, data preprocessing, training and test sets, and performance measures. Fortunately, the design process of the multi-layer perceptrons for the unsupervised clustering purposes is very similar to the traditional supervised MLPs.

The selection of these parameters is basically problem-dependent. The number of input and output neurons in unsupervised version similar to the supervised one is relatively easy to specify as it is directly related to the problem under study and correspond to the number of the explanatory and dependent variables, respectively. The hidden neurons play the most important role in the supervised modeling process of multi-layer perceptrons. It is the hidden neurons that allow the perceptrons to detect the structures, capture the patterns in the data, and to perform complicated nonlinear mapping between explanatory and depended variables. The issue of determining the optimal number of hidden neurons is a crucial yet complicated one. In general, multi-layer perceptrons with fewer hidden neurons are preferable as they usually have better generalization ability and less overfitting problem. However, perceptrons with too few hidden neurons may not have enough power to model and learn the data. There is no theoretical basis for selecting this parameter although a few systematic approaches are reported. For examples, both algorithms for pruning out unnecessary hidden nodes and adding hidden nodes to improve network performance have been suggested. The most common way in determining the number of hidden nodes is via experiments or by trial-and error. Several rules of thumb have been also proposed [24].

The hidden neurons similar to the supervised version play the most important role in the unsupervised clustering process. Basically, the number of the hidden neurons in the supervised version is simultaneously depended on the complexity of the input-output function and the number of existing structures and patterns in the data. However, in the unsupervised version, based on this fact that there is no complex input-output function, the number of hidden neurons is just depended on existing structures and patterns in the data. Therefore, if the number of hidden neurons is correctly chosen in the design process, then the behavior of each neuron for similar structures and patterns will be similar and differed by other neurons. On the other hand, structures and patterns with certain behavior in a given neuron are similar together and dissimilar with other those structures and patterns. Therefore, each hidden neuron in the unsupervised version exactly represents the notion of a cluster; and hence, can be considered as a cluster. On the other hands, the similarity of each object is calculated using the weight of each hidden neuron. Thus, weight of each hidden neuron determines the cluster that data must be assigned to it.

The proposed model, in contract of traditional clustering models, does not assume the certain relationship for measuring the similarity and dissimilarity between different objects and they are intelligently detected by designed network. Therefore, there is no requirement to define and predetermine a given relationship such as distance or correlation as similarity measurement; or predetermine a given type of the center such as mean or median and threshold level for assigning the objects; or compare the yielded clusters for consistency of results; or others; but, clusters are automatically determined in the learning process. In addition, the proposed model, in contract of traditional iterative clustering models, does not require a vast number of iterations and also store the huge amounts of information in order to yield the final results.

It is must be noted that this claim is not in contrary to this fact that the proposed model is used an iterative supervised learning algorithm such as backpropagation in its clustering process. Because, as mentioned previously, the proposed model in order to cluster the objects only need the speed of learning and convergence; hence, it is not necessary that the learning process is completed. On the other hand, the first iteration is practically sufficient for measuring the learning speed and further iterations are not needed; of course, when the learning process is converged to the target. There is no theoretical evidence for determining the iteration that learning process will be converged after it. However; the author’s experiences indicate that in the proposed model, maybe because of having too simple target, learning process will often converge to the target from the first stage. Therefore, the proposed model is faster and easier to use than other those traditional clustering models, especially for large scale problem. From another point of view, increasing the number of used data that are the main reason of increasing the processing time and required space in the traditional clustering models, in the proposed model will cause to increase the accuracy. Because, it is clearly known from literature that in multilayer perceptrons, as in any statistical approach, the accuracy is closely related to the training sample size and will be generally increased if the training sample size increases.

The activation functions, also called the transfer functions, determine the relationship between inputs and outputs of the neurons. The activation functions introduce a degree of nonlinearity and complexity that is valuable for most applications of supervised networks. In general, any differentiable function can qualify as an activation function in theory. However, only a small number of well-behaved (bounded, monotonically increasing, and differentiable) activation functions, such as sigmoid (logistic), hyperbolic tangent (tanh), sine or cosine, and linear functions are practically used. There are some heuristic rules for the selection of the activation functions. For example, using the logistic activation functions for classification problems which involve learning about average behavior, or using the hyperbolic tangent functions if the problem involves learning about deviations from the average such as the forecasting problem [25]. Although in the unsupervised version, similar to the supervised, it is not clear whether different activation functions have major effects on the performance, the author’s experiences indicate that the logistic or tanh and linear activation functions seem well suited for the hidden neuron for the clustering problems; where the number of clusters is even and odd, respectively.

The process of selection other parameters of unsupervised version is exactly similar to the supervised one; hence, more descriptions about them has been omitted from this paper. Once the design process is finished and optimal architecture of the unsupervised MLP is specified, the network is ready for training and estimating the suboptimal values of parameters. The structure of the proposed unsupervised multilayer perceptrons model is shown in Fig. 2.

Figure 2.

Unsupervised multilayer perceptrons structure ( $N^{(p-q-1)}$ ).

After measuring the similarity of objects to clusters ( $S_{j}$ ), in the second stage of the proposed model, each object is assigned to a cluster with higher similarity. On the other hand,

$\displaystyle\textit{If}\ S_{k}=\mathop{Max}\limits_{j}\ \{S_{j}\}\ \textit{% Then}\ \textit{Object}\in C(k),$ (7)

where $S_{k}$ is the similarity of a given object to the $k$ th hidden neuron, and $C(k)$ is the $k$ th cluster. A logical function can be also used for necessary transferring the similarity values. This transfer function is basically used for selecting the type (e.g. linear or nonlinear) of the discriminant function between clusters. Thus the general form of the Eq. (7) can be represented as follows:

$\displaystyle\textit{If}\ F(S_{k})=\mathop{Max}\limits_{j}\ \{F(S_{j})\}\ % \textit{Then}\ \textit{Object}\in C(k),$ (8)

where $F$ is the similarity transfer function. According to the Eq. (8), the discriminant function between each pair wise cluster can be calculated by equaling their similarity values. For example, by assuming $p$ explanatory variable and linear similarity transfer function, the hyper-plane, which discriminates cluster $i$ from cluster $j$ can be written in the general form of $AX+B$ as follows:

$\displaystyle F(S_{i})=F(S_{j})\xrightarrow{F=\textit{linear}}\ {S_{i}=S_{j}}\Rightarrow$ $\displaystyle\quad g\left(w_{0,i}+\sum\limits_{k=1}^{p}w_{k,i}\ x_{k}\right)=g% \left(w_{0,j}+\sum_{k=1}^{p}w_{k,j}\ {x_{k}}\right)\Rightarrow w_{0,i}+\sum_{k% =1}^{p}\ w_{k,i}\ {x_{k}}$ $\displaystyle\quad=w_{0,j}+\sum_{k=1}^{p}w_{k,j}\ {x_{k}}\Rightarrow$ (9) $\displaystyle\quad\mbox{\dbox{$\sum\limits_{k=1}^{p}(w_{k,i}-w_{k,i})x_{k}+(w_% {0,i}-w_{0,j})=0$}}$

This discriminant function between clusters can be also given in the form of a hyper-ellipsoidal (in the general form of $AX^{2}+BX+C)$ , by considering a quadratic function as similarity transfer function as follows. The structure of an unsupervised clustering multilayer perceptrons in the general form is shown in Fig. 3.

$\displaystyle F(S_{i})=F(S_{j})\ \xrightarrow{F=\textit{Quadritic}}\ {S_{i}^{2% }=S_{j}^{2}}\Rightarrow$ $\displaystyle\quad\left(g\left(w_{0,i}+\sum_{k=1}^{p}w_{k,i}\ {x_{k}}\right)% \right)^{2}=\left(g\left(w_{0,j}+\sum_{k=1}^{p}w_{k,j}\ {x_{k}}\right)\right)^% {2}\Rightarrow$ $\displaystyle\quad w_{0,i}^{2}+2w_{0,i}\sum_{k=1}^{p}\left(w_{k,i}\ {x_{k}}% \right)+\left(\sum_{k=1}^{p}\left(w_{k,i}\ {x_{k}}\right)\right)^{2}=w_{0,j}^{% 2}+2w_{0,j}\sum_{k=1}^{p}\left(w_{k,j}\ {x_{k}}\right)$ (10) $\displaystyle\quad+\left(\sum_{k=1}^{p}\left(w_{k,j}\ {x_{k}}\right)\right)^{2}\Rightarrow$ $\displaystyle\quad{\dbox{$\sum\limits_{k=1}^{p}\sum\limits_{k^{\prime}=1}^{p}% \left(w_{k,i}w_{{k}^{\prime},i}\!-\!w_{k,j}w_{{k}^{\prime},j}\right)x_{k}^{2}% \!+\!2\left(w_{0,i}\!-\!w_{0,j}\right)\sum\limits_{k=1}^{p}\left(w_{k,i}\!-\!w% _{k,j}\right)x_{k}\!+\!(w_{0,i}^{2}\!-\!w_{0,j}^{2})=0$}}$

Figure 3.

Unsupervised clustering multilayer perceptrons structure ( $N^{(p-q-1)}$ ).

4. Evaluation of the proposed clustering model

In this section, in order to practically demonstrate the feasibility and suitability of the proposed clustering model, it is evaluated using simulated data sets. Therefore, in the first subsection, the evaluation measures are briefly reviewed in order to choose the most consistent one for the proposed clustering model. In the second subsection, the process of generating the fitting problems for evaluating the proposed model is introduced; and finally, the performance of the proposed clustering model in simulated problems is evaluated and analyzed.

4.1 Evaluation measures for clustering models

In the supervised tasks, such as classification, the evaluation of the results is an integral part of the developing process of the classification models. In these cases, there are the well-accepted evaluation measures and procedures, e.g. accuracy and cross-validation, respectively. However, because of its very nature, cluster evaluation is not a well-developed or commonly used part of cluster analysis. Nonetheless, cluster analysis, or cluster validation as it is more traditionally called, is very important and should be a part of any cluster analysis. The evaluation measures, or indices, that are applied in the literature in order to judge various aspects of cluster validity are traditionally categorized into three main categories as follows [1]:

1.
Unsupervised indices: The unsupervised, called also internal, indices measure the goodness of a clustering structure without respect to external information and use only information present in the data set. Unsupervised measures of cluster validity are often further divided into two subcategories, as follows:

1.1.
The cohesion and separation-based indices: Many internal measurements for partitional clustering schemes are based on the notion of cohesion and separation. These measures of cluster cohesion (compactness, tightness), which determine how closely related the objects in the cluster are, and measures of cluster separation (isolation), which determine how distinct or well-separated a cluster is from other clusters.
1.2.
The similarity matrix-based indices: These indices for evaluating the goodness of clustering measure the correlation between the similarity matrix and an ideal version of similarity matrix based on the cluster labels. An ideal cluster is one whose points have a similarity of 1 to all points in the cluster, and similarity of 0 to all points in the other clusters.

2.
Supervised indices: When there is the external information about data, which is typically in the form of externally derived class labels for the objects, the usual procedures for evaluating the goodness of clustering are the supervised indices that measure the degree of correspondence between the cluster labels and the class labels. On the other hand, the supervised indices measure the extent to which the clustering structure discovered by a cluster algorithm matches some external structure such as entropy which measures how well cluster labels match externally supplied class labels. Motivations for such an analysis are the comparison of clustering with the “ground truth” or the evaluation of the extent to which a manual classification process can be automatically produced by cluster analysis. Supervised measures are often called external indices, because they use information not present in the data set. Supervised measures of cluster validity are often further divided into two subcategories, as follows:

2.1.
Class-oriented indices: There are a number of measures, such as entropy, purity, precision, Recall, and F-measure [26], which are commonly used to evaluate the performance of the classification models. In the case of classification, the degree to which predicted class labels correspond to the actual class labels is measured. The class-oriented indices use these classification measures by using cluster labels instead of predicted class labels. These indices evaluate the extent to which a cluster contains objects of a single class.
2.2.
Similarity-oriented indices: The Similarity-oriented indices are related to the similarity measures for binary data, such as the Jaccard and RAND measures [23]. All these indices are based on the premise that any two objects that are in the same cluster should be in the same class and vice versa. This work is generally done by comparing two matrices: 1) the ideal cluster similarity matrix, which has a 1 in the ijth entry if both objects $i$ and $j$ , are in the same cluster and 0, otherwise, and 2) the ideal class similarity matrix defined with respect to the class labels, which has a 1 in the ijth entry if both objects $i$ and $j$ , belong to the same class, and a 0, otherwise.

3.
Relative: The relative indices compare the different clustering models and different clusters. A relative clustering index is a supervised or unsupervised evaluation measure that is used for the purpose of comparison. Thus relative measures are not actually a separate type of cluster evaluation measure, but are instead a specific use of such measures.

As mentioned previously, in this paper the simulated data sets will be used in order to practically demonstrate the feasibility of the proposed clustering model; therefore, the data generation process and class labels for each object are available. Hence, the supervised indices will be the most consistent and the most appropriate measures for evaluating the goodness of the proposed clustering model. In this paper, the overall precision (a supervised class-oriented index) is used for evaluating the obtained results of the proposed model, which is calculated as follows:

$\displaystyle\textit{ Overall precision}=\frac{\textit{No. of correct % clustering}}{\textit{No. of evaluation sample}}$ (11)
4.2 Problems generation process for clustering evaluation

In the general, the performance of a clustering model depends on the particular problem and data under consideration. In this paper, in order to demonstrate the feasibility of the proposed model in practice, different characteristics of data sets and clusters are first briefly described and then, a clustering problem with the simplest necessary conditions is generated in order to show that the proposed procedure and the proposed model can be basically considered as a clustering model or not. Then, if the answer was positive, the restricted conditions would be lifted one at a time in order to show the appropriateness and effectiveness of the proposed model in different situations.

4.2.1 Describing the different underlying data sets and cluster characteristics

In the clustering literature, several different characteristics have been reported for data and the clusters, which can be effective on the performance of the clustering models [28]. Therefore, the robust clustering models must be taken into account all of these data and clusters characteristics. Some of the most important characteristics (10 top) of data sets and clusters, in the literature can be generally summarized as follows:

1.
Dimensionality: Dimension of the underlying data sets and clusters is one of the most important factors in performance, speed, and required space of the clustering models, especially for distance-based models. It is shown in the literature that the distances between the closest and farthest neighbors of a point may be very similar in high dimensional spaces. Perhaps an intuitive way to see this is to realize that the volume of a hyper-sphere with radius, $r$ , and dimension, $d$ , is proportional to $r_{d}$ , and thus, for high dimensions a small change in radius, means a large change in volume. Distance based clustering approaches may not work well in such cases. Yet another set of problems has to do with how to weight the different dimensions. If different aspects of the data are being measured in different scales, then a number of difficult issues arise. Most distance functions will weight dimensions with greater ranges of data more highly. Also, clusters that are determined by using only certain dimensions may be quite different from the clusters determined by using different dimensions. Therefore, the performance of some clustering models depends on the dimension of the underlying data set.
2.
Noise and Outliers: A point which is noise or is simply an atypical point (outlier) can often be effective on the performance of the clustering models and can distort them. Some clustering models can detect outliers and delete them or otherwise eliminate their negative effects, by applying tests that determine if a particular point really belongs to a given cluster. This processing can occur either while the clustering process is taking place or as a post-processing step. However, in some instances, points cannot be discarded and must be clustered as well as possible. In such cases, it is important to make sure that these points do not distort the clustering process for the majority of the points. Therefore, the performance of some clustering models depends on that the underlying data set has noise and outliers.
3.
Statistical Distribution: The data generation processes of some data sets follow a given statistical distribution such as Gaussian (normal) or uniform, but this is frequently not the case. Some clustering models such as distribution-based models that use the Expectation-Maximization algorithm to model clusters need to know the statistical distribution of the underlying data sets. Therefore, the performance of some clustering models depends on that the underlying data set has a given statistical distribution.
4.
Cluster Number: The number of clusters is another effective factor in performance, speed, and required space of the clustering models. The performance of the clustering models will be generally decreased by increasing the number of clusters. Moreover, the processing time and required space of the clustering models have a positive relationship with the number of clusters. Therefore, the performance of the clustering models depends on the number of the clusters in the underlying problem.
5.
Cluster Shape: Some clusters have the regular shape such as rectangular or globular; and are convex. Some clustering models such as $K$ -means need that clusters are regularly shaped or they are convex, but irregularly shaped, non-convex clusters are common. Therefore, the performance of some clustering models depends on that clusters have the given shape or are convex.
6.
Cluster Size: Some clusters have the similar size, but this is frequently not the case. Some clustering models such as $K$ -means don’t work well in the presence of different size clusters. Therefore, the performance of some clustering models depends on that clusters have the equal size.
7.
Cluster Density: Some clusters have the similar density, but this is frequently not the case. Some clustering models such as $K$ -means don’t work well in the presence of different density clusters. Therefore, the performance of some clustering models depends on that clusters have the equal density.
8.
Cluster Separation: Some clusters are well-separated, but in many other cases the clusters may touch or overlap. Many clustering models have the problem in the presence of the overlapping clusters. Therefore, the performance of some clustering models depends on that clusters are well-separated.
9.
Many and Mixed Attribute Types: Some data sets have many attributes or include both continuous and categorical attributes (mixed attributes). A mix of attribute types is usually handled by having a proximity function that can combine all the different attributes in an “intelligent” way. Many clustering models have assumption about the data and cluster characteristics and do not work well if those assumptions are violated. In such cases the clustering model may miss clusters, split clusters, merge clusters, or just produce poor clusters. Therefore, the performance of some clustering models depends on that data sets have many or mixed attributes.
10.
Type of data space, e.g., Euclidean or non-Euclidean: Some clustering techniques calculate means or use other vector operations that often only make sense in Euclidean space.

4.2.2 Generating data sets for feasibility evaluating of the proposed clustering process

In the literature, there is no reasonable measure for feasibility evaluation of a given clustering model. On the other hand, the minimum logical necessary conditions for a given clustering model have not been yet determined. In this section, in order to show the feasibility of the proposed clustering model, data sets with the simplest characteristics are generated. Now, if the performance of the proposed model on the generated data sets is equal or greater than the performance of the randomly assigning of objects into clusters, then the proposed model can be considered as clustering model. Therefore, based on the aforementioned characteristics, the ideal problems and data sets for feasibility evaluation can be considered with the following characteristics. These characteristics indicate that the dimension of the under-study data set, there is noise and/or outliers in the data or not, the data follow a particular statistical distribution or not, the number of clusters that exist in the data, the shape of clusters, the number of data in each cluster, density of clusters, type of cluster separation from each other, the number of attributes (variables) in discriminant function, and type of data space; respectively.

1)
Dimension of the space: One
2)
Noise and outliers: No
3)
Statistical distribution: Yes
4)
Number of clusters: Two
5)
Shape of clusters: Regular and convex
6)
Size of clusters: Equal
7)
Density of clusters: Equal
8)
Separation of clusters: Well-separated
9)
Number/type of attributes: One/Uniform
10)
Type of data space: Euclidean

Therefore, in the first phase, a mono-attribute binary clustering problem with uniform distribution (by given and constant mean and variance), without noise and outliers, is constructed as framework problem in which clusters are convex and well-separated and also have the equal size, regular shape, and equal density. In the second phase, 100 different problems are produced by changing the values of mean and variance in the size of 50, 100, 150, 200, and 250. Finally, in order to eliminate the effects of random processes in the data generation procedure, each problem is generated 100 times and the average performance of the proposed model on these 100 problems in training and test samples are calculated in each situation. In this paper, the 80–20% randomly split of clusters with first 80% in the training and second 20% in the test is used.

Obtained results indicate that the proposed model not only can cluster the objects, appropriately; but also can achieve the good performance in both test and training samples (100% in both). Of course, it is not very surprising, because the generated problems were too simple and only designed for feasibility evaluation of the proposed model. However, feasibility checking is logically necessary for the proposed model due to significantly differences of the proposed model with traditional clustering models in the notion of cluster, similarity measurement, and also process of clustering.
4.2.3 Generating data sets for efficiency evaluating of the proposed clustering model

After feasibility checking of the proposed model, the most important restricted conditions considered in generating the feasibility problem are lifted one at a time in order to determine the efficiency of the proposed model. Then the performance of the proposed model in each case is theoretically and practically evaluated and briefly discussed. In other words, the effect of the most important restricted conditions on the performance of proposed model is sensitivity analyzed to determine the efficiency of it in different situations and conditions. Further and more advanced analyses of the specific effects of these restricted conditions on the performance of proposed model have been omitted from this preliminary version of the paper. In this paper, the effects of dimension of data sets, noise and outliers, statistical distribution, number of clusters, and cluster separation are theoretically and practically analyzed, and analyzing the effects of other characteristics of data and clusters is postponed to the future.

4.2.3.1. Efficiency evaluating in higher dimensional data sets

In this section, the effect of higher dimensions of data sets is evaluated on the performance of the proposed model. A key feature of high dimensional data is that two objects may be highly similar even though commonly applied similarity measures indicate that they are dissimilar or perhaps only moderately similar. Conversely, and perhaps more surprisingly, it is also possible that an object’s nearest or most similar neighbors may not be as highly related to the object as other objects which are less similar [29]. Therefore, the performance of most clustering models, especially distance and density-based models is critically sensitive to the curse of dimensionality. It is the main reason that these clustering models usually use a transformation of the original data with a reduced number of dimensions.

Although, the proposed model due to benefit the unique advantages of multilayer perceptrons as universal approximators, can theoretically cluster the high dimensional problems; appropriately, in practice, the proposed model may not be able to efficiently handle them. The author’s experiences indicate that in high dimensional problems (higher than 30) is better to use the preprocessing algorithms in order to reduce the dimension of the input data sets. However, in lower dimensional cases, the proposed model can be effectively applied. For instance, the average performance of the proposed model (on 100 times running) in the some problems with the following characteristics (Dimension: 3; Size: 50, No. of Attributes: 1 to 3) is given in Table 1. The established discriminant function and clusters for Dimension: 3; Size: 50, No. of attributes: 1 and 2 are also shown in Figs 4 and 5, respectively. Circles and rectangles are represent the training and test data in each cluster.

1)
Dimension of the space: Two, Three, Four, Five
2)
Noise and outliers: No
3)
Statistical distribution: Uniform (Mean: 3, 7, and Variance: 5)
4)
Number of clusters: Two
5)
Shape of clusters: Regular and convex
6)
Size of clusters: Equal (50)
7)
Density of clusters: Equal
8)
Separation of clusters: Well-separated
9)
Number/type of attributes: One, Two, Three/Uniform
10)
Type of data space: Euclidean

Table 1
The average performance of the proposed model in three-dimensional problem with different number of attributes ${}^{\ast}$

Dimension Size No. of attributes Performance (Overall precision percentage)

Training Test

1- 3 50 1 89% 94%

2- 3 50 2 99% 100%

3- 3 50 3 100% 100%

${}^{\ast}$ All values are rounded.

Figure 4.
The established discriminant function and clusters (Dimension: 3; Size: 50, No. of Attributes: 1).

Figure 5.
The established discriminant function and clusters (Dimension: 3; Size: 50, No. of Attributes: 2).

4.2.3.2. Efficiency evaluating in data sets with noise and outliers

In this section, the effect of existing outliers in underlying data sets is evaluated on the performance of the proposed model. Most clustering models, such as agglomerative hierarchical and $K$ -means, are sensitive to outliers due to the fact that sources of error and variation are not formally considered in these models. For example, when the squared error criteria are used, outliers can unduly influence the clusters that are found. Thus, it is common to try to find outliers and eliminate them in the preprocessing of the raw data. The proposed model, similar to the multilayer perceptrons and most traditional clustering models, is sensitive on noise and outliers. However, it is demonstrated in the literature that the sensitivity of multilayer perceptrons to noise and outliers is less than the statistical techniques [30]. Therefore, it is expectable that the performance of the neural-based models is totally better than statistical-based techniques in clustering problems that include noise and outliers. In addition, in the proposed model, outliers can be effectively detected by comparing the similarity of each data to different clusters ( $S_{j}$ ), thus we can eliminate them from data before analyzing the obtained results in a post-processing step. As example, consider the problem with following characteristics (Fig. 6).

1)
Dimension of the space: Two
2)
Noise and outliers: Yes (One)
3)
Statistical distribution: Uniform (Mean: 3, 7, and Variance: 4)
4)
Number of clusters: Two
5)
Shape of clusters: Regular and convex
6)
Size of clusters: Equal (50)
7)
Density of clusters: Equal
8)
Separation of clusters: Well-separated
9)
Number/type of attributes: Two/Uniform
10)
Type of data space: Euclidean

Figure 6.
Binary clustering problem with an outlier data.

Now, by comparing the similarity of each data to cluster 1 ( $S_{1}$ ) and cluster 2 ( $S_{2}$ ) based on attributes 1 and 2 ( $X_{1}$ and $X_{2}$ ), it can be seen that the outlier data will be in none of the clusters. The plot of the similarity values ( $S_{1}$ and $S_{2}$ ) based on the $X_{1}$ and $X_{2}$ are respectively shown in Figs (4) and (5).

Figure 7.
The plot of $S_{1}$ and $S_{2}$ based on the $X_{1}$ .

Figure 8.
The plot of $S_{1}$ and $S_{2}$ based on the $X_{2}$ .

4.2.3.3. Efficiency evaluating in data sets with $\backslash$ without a given statistical distribution

In this section, the performance of the proposed model is evaluated in situations that underlying data follow a given statistical distribution or not. In some clustering models, such as distribution-based techniques, it is assumed that the under-study data have a particular distribution. Subsequently, their performance will be reduced if the opposite is occurred. However, the proposed model is theoretically insensitive on distribution of the input data. This property of the proposed model comes from the ability of multilayer perceptrons in modeling of any type of data. However, it is reported in the literature of artificial neural networks that multilayer perceptrons can practically yield a more accurate results when their inputs have the same distribution [31]. Therefore, it is strongly recommended that it is satisfied in a preprocessing step.

4.2.3.4. Efficiency evaluating in multiple cluster data sets

As mentioned previously, the number of existing clusters in underlying data sets can be detected by trial-and-error in the designing process of the proposed model. The author’s experiences indicate that the process of determining the number of hidden nodes is the most important and the most critical step in the designing process of the proposed model, which has rationally the great effect on the performance of the proposed model. After determining the number of clusters, the input data can be simultaneously clustered by the proposed model; appropriately.

4.2.3.5. Efficiency evaluating in clusters separation

In this section, the effect of different separation types of clusters is evaluated on the performance of the proposed model. In some cases the clusters are well-separated, but in many other cases, clusters may touch or overlap. The authors’ experiences indicate that although the proposed model can appropriately cluster problems with well-separated clusters, its behavior in partitioning the overlapping objects is strongly dependent on the other characteristics of data and clusters. Therefore, the sensitivity analysis of the proposed model only based on the separation type of clusters may not be very positive, and a scenario analysis is needed in order to evaluate the effects of simultaneously changing of characteristics, which is out of the level of this primary paper. However, in this section, the average performance of the proposed model in three different types of separation (well-separated, touch, and overlap) with following characteristic is given in Table 2. The established discriminant function and clusters for these situations are also shown in Figs 9, 10, and 11, respectively.

1)
Dimension of the space: Two
2)
Noise and outliers: No
3)
Statistical distribution: Uniform (Mean: 3, 7, and Variance: 4, 5, 5.5)
4)
Number of clusters: Two
5)
Shape of clusters: Regular and convex
6)
Size of clusters: Equal (50)
7)
Density of clusters: Equal
8)
Separation of clusters: Well-separated, touch, and overlap
9)
Number/type of attributes: Two/Uniform
10)
Type of data space: Euclidean

Table 2
The average performance of the proposed model in three-dimensional problem with different number of attributes ${}^{\ast}$

No. of Clusters Mean Variance Separation of clusters Performance (Overall precision percentage)

Training Test

1- 2 3, 7 4 Well-separated 100% 100%

2- 2 3, 7 5 Touch 99% 100%

3- 2 3, 7 5.5 Overlap 97% 95%

${}^{\ast}$ All values are rounded.

Figure 9.
The established discriminant function and clusters for well-separated clusters (Mean: 3, 7 and Var.: 4).

Figure 10.
The established discriminant function and clusters for touch clusters (Mean: 3, 7 and Var.: 5).

Figure 11.
The established discriminant function and clusters for overlap clusters (Mean: 3, 7 and Var.: 5.5).

5. The desired features of the proposed clustering model

Generally, the desired features of the clustering models depend on the particular problem under consideration. Several different desired characteristics have been reported in the literature for clustering models [24]. In this section, the most important desired features (top 10) of the proposed model are briefly introduced.

5.1 The proposed model is scalable

One of the most important features of the clustering models is the scalability both in terms of speed and space, especially for large data sets. In other words, the required space and the processing time of the clustering models for large sets of data must be scalable. The clustering models that are used in these cases should usually have linear or near linear time complexity to handle such large data sets. Clustering models that even have complexity of $O(m^{2})$ are not practical for large data sets. The statistical sampling is used by some clustering models in order to overcome this problem. Nonetheless, there are cases, e.g., situations where relatively rare points have a dramatic effect on the final clustering, where a sampling is insufficient. Furthermore, it can not be generally assumed in clustering models that all data of databases will fit in main memory or that data elements can be randomly accessed. These models are, likewise, infeasible for large data sets. Accessing data points sequentially and not being dependent on having all the data in main memory at once are important characteristics for scalability.

5.2 The proposed model can generalize

Another important feature of the clustering models is that they can correctly infer and organize the unseen points in their appropriate clusters after finishing the clustering process of the sample data presented to them even if these points contain noisy information. On the other hand, the clustering models must be able to generalize their past obtained results to the future. However, it is common for clustering models to produce clusters that are not good clusters when evaluated later. This feature of the proposed model comes from the property of multi-layer perceptrons to generalize their results.

5.3 The results of the proposed model are easily interpretable

Another important feature of the clustering models is easily interpretability of their obtained results. Many clustering models produce cluster descriptions that are just lists of the points belonging to each cluster. Such results are often hard to interpret. A description of a cluster as a region may be much more understandable than a list of points. This may take the form of a hyper-rectangle, a center point with a radius, etc.

5.4 The proposed model is insensitive to order of the input data

Another important feature of the clustering models is that they must be independence of the order of input data. Some clustering models are dependent on the order of the input, i.e., if the order in which the data points are processed changes, then the resulting clusters may change. This is unappealing since it calls into question the validity of the clusters that have been discovered. They may just represent local minimums or artifacts of the model.

5.5 The proposed model can detect and deal with noise or outlying data

Another important feature of the clustering models is that they can detect noise and outliers. Some clustering models can detect noise and outliers, by applying tests that determine if a particular point really belongs to a given cluster and delete them or otherwise eliminate their negative effects.

5.6 The proposed model can find clusters in subspaces

Another important feature of the clustering models is that they can find clusters in subspaces of the original space; for the reason that in high dimensional spaces, clusters regularly only occupy a subspace of the full data space; and hence lie in a subspace. Many clustering models have difficulty finding such that clusters, for example, a five-dimensional cluster in a nine-dimensional space.

5.7 The proposed model can handle high dimensional problems

Another important feature of the clustering models is that they can efficiently cluster in high-dimensional spaces; because, as discussed previously, clustering in high-dimensional spaces is quite different from clustering in low dimensional spaces. Clustering in high dimensional spaces is often problematic as theoretical results questioned the meaning of closest matching in high dimensional spaces. Many clustering models, especially models that use the distance or density as similarity measurement, have difficulty clustering in such that situations.

5.8 The proposed model is robustness in the presence of different underlying data and cluster characteristics

Another important feature of the clustering models is that they can appropriately cluster problems that have different characteristics in both sense of data and clusters. As mentioned previously, there are several different characteristic for under-study data and clusters such as type of data space, dimension of data, noise and outlier data, statistical distribution of data, number of clusters, shape of clusters, size of clusters, density of clusters, separation type of clusters, and number/type of attributes, that a robust clustering model must be able to properly handle them.

5.9 The proposed model can estimate any clustering parameters

Another important feature of the clustering models is that they can estimate clustering parameters such as the number of clusters, the size of clusters, or the density of clusters. Many clustering models take the number of clusters as a parameter. This can be a useful feature in certain instances, e.g., when using a clustering model to create a balanced tree for nearest neighbor lookup, or when using clustering for compression. However, this is not generally good, since the number of clusters parameterized may not match the real number of clusters. A model that requires the number of clusters up front can always be run multiple times. Assuming that there is some way to compare the quality of the clusters produced, it is then possible to empirically determine the best number of clusters. Of course, this increases the amount of computation required. Likewise, it is often difficult to estimate the proper values for other parameters of clustering models. In general, the parameters of a clustering model may identify areas of weakness. In the best case, the results produced by a clustering model will be relatively insensitive to modest changes in the parameter values.

5.10 The proposed model can function in an incremental manner

Another important feature of the clustering models is that they can incrementally cluster the underlying data sets. Many clustering models, such as K-means, must be run again, if a new data is added to the underlying data set. In certain cases, e.g., data warehouses, the underlying data used for the original clustering can change over the time. If the clustering model can incrementally handle the addition of new data or the deletion of old data, then this is usually much more efficient than re-running the model on the new data set.

6. Conclusions

In virtually every scientific field dealing with empirical data, scientism attempt to get a first impression on their data by trying to identify groups of similar data. The primary objective of the cluster analysis is to partition a given data set into the homogeneous groups, called clusters, such that objects within a cluster are more similar to each other than objects belonging to deferent clusters. However, clustering is naturally an ill-posed problem, where the act of grouping similar data objects is a subjective notion and highly dependent on the cluster notion and clustering criterion used, especially similarity measurement. Several different notions of cluster and also similarity measurements such as distance and correlation have been presented in the literature of clustering. For this reason, a vast number of algorithms have been developed, each aiming to address different aspects of the clustering.

In this paper, a new notion of cluster and similarity is first presented, and consequently, a novel clustering model is proposed based on these concepts. In the proposed model, data objects are partitioned based on their behavior in the learning process. It is comes from the idea that a group of objects representing the same learning behavior can be considered as a cluster. On the other hand, the claim of the proposed model is that the learning and convergence speed of data objects can be used as similarity measurement for clustering purposes. Of course, it must be noted that the learning speed of objects is also a function of distances and correlations between objects that is intelligently approximated by the proposed model in its learning process of the data objects. It is the main difference of the proposed model by traditional clustering models in which are assumed that the similarity of objects can be generally measured by a given and predetermined function of distance and correlation.

Therefore, the proposed model need to a universal learning-based approximator in order to estimate the relationships and structures between objects, and to calculate the learning speed of the objects as similarity measurement. Of course, it must be noted that the foundation of the traditional clustering models is also based on the similar behavior of the data objects. However, all traditional clustering models use the unsupervised learning algorithms that are iterative processes of knowledge discovery or interactive multi-objective optimization processes that involve trial and failure, and hence they are often expensive, time-consuming, and error-prone models. However, the proposed model, in contrast of the traditional clustering models, works in a supervised manner. Therefore, it is theoretically expected that the proposed model is generally faster, more accurate, and more efficient than other unsupervised clustering models.

Then, the feasibility and efficiency of the proposed model in practice are evaluated by some simulated data sets. Empirical results of the feasibility and efficiency evaluating indicate that the proposed model not only can be used as a clustering model, but also can achieve accurate results. These results numerically indicate that in contrast of the literature, supervised models can be also used for solving the unsupervised problems. Finally, some desired characteristics of the proposed model are presented. Further and advanced discussion about specific features of the proposed model, advanced versions of the proposed model such as nonlinear version, soft and fuzzy version, support vector version, hierarchical version, K-means, etc., advanced evaluating of the proposed model in the real-world situations, and performance evaluating of the proposed model in comparison with other clustering models are postponed to the future. In this primary paper, the main notions and main idea of the proposed model is only introduced and primary evaluating of its performance in clustering problems is only presented.

Footnotes

Acknowledgments

The authors wish to express their gratitude to Dr. Mehdi Bijari and Ali Zeinal Hamadani, Department of Industrial Engineering, Isfahan University of Technology (IUT), and two anonymous referees for their helpful comments, which greatly helped us to improve our paper.

References

Tan

Steinbach

and Kumar

, Introduction to Data Mining, Addison-Wesley, 2005.

Estivill-Castro

, Why so many clustering algorithms, ACM SIGKDD Explorations Newsletter 4(1), 65–75.

Halkidi

Batistakis

and Vazirgiannis

, On Clustering Validation Techniques, Journal of Intelligent Information Systems 17 (2001), 107–145.

Jain

A.K.

and Dubes

R.C.

, Algorithms for Clustering Data, Prentice Hall, 1988.

Kaufman

and Rousseeuw

P.J.

, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley and Sons, 1990.

Kent

R.A.

, Analysing Quantitative Data: Variable-based and Case-based Approaches to Non-experimental Datasets, SAGE Publications, 2015.

Jain

A.K.

Murty

M.N.

and Flyn

P.J.

, Data Clustering: A Review, ACM Computing Surveys 31 (1999), 264–323.

Guha

Rastogi

and Shim

, CURE: An Efficient Clustering Algorithm for Large Databases, In Proceedings of the ACM SIGMOD Conference, 1998.

MacQueen

J.B.

, Some Methods for Classification and Analysis of Multivariate Observations, In Proceedings of 5th Berkley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.

10.

Kaufman

and Rousseeuw

P.J.

, Clustering by means of medoids, Statistical Data Analysis based on the L1-Norm and Related Methods, North-Holland 1987, pp. 405–416.

11.

Ester

Kriegel

Sander

and Xu

, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, 226–231.

12.

Hinneburg

and Keim

, An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, AAAI Press.

13.

Ankerst

Breunig

M.M.

Kriegel

H.P.

and Sander

, OPTICS: Ordering Points to Identify the Clustering Structure, ACM SIGMOD international conference on Management of data, ACM Press, 1999, 49–60.

14.

Mechelen

Bock

H.H.

and De Boeck

, Two-mode clustering methods: a structured overview, Statistical Methods in Medical Research 13 (2004), 363–94.

15.

Khashei

Hejazi

S.R.

and Bijari

, A new hybrid artificial neural networks and fuzzy regression model for time series forecasting, Fuzzy Sets and Systems 159 (2008), 769–786.

16.

Khashei

Zeinal Hamadani

and Bijari

, A fuzzy intelligent approach to the classification problem in gene expression data analysis, Knowledge-Based Systems 27 (2012), 465–474.

17.

Cybenko

, Continuous Valued Neural Networks with Two Hidden Layers are Sufficient, Technical Report, Tuft University, 1988.

18.

Hornik

Stinchcombe

and White

, Multilayer feed-forward networks are universal approximators, Neural Networks 2 (1989), 359–366.

19.

Hornik

, Approximation capabilities of multilayer feed forward networks, Neural Networks 4 (1991), 251–257.

20.

Khashei

and Bijari

, An artificial neural network (p, d, q) model for time series forecasting, Expert Systems with Applications 37 (2010), 479–489.

21.

Rumelhart

and McClelland

, Parallel distributed processing, Cambridge, MA: MIT Press, 1986.

22.

Khashei

, Soft Intelligent Decision Making, Ph.D. Thesis, Industrial and Systems Engineering Department, Isfahan University of Technology, 2012.

23.

Zhang

Patuwo

B.E.

and Hu

M.Y.

, Forecasting with artificial neural networks: the state of the art, International Journal of Forecasting 14 (1998), 35–62.

24.

Khashei

Bijari

and Ardali

G.A.R.

, Improvement of Auto-Regressive Integrated Moving Average Models Using Fuzzy Logic and Artificial Neural Networks (ANNs), Neurocomputing 72 (2009), 956–967.

25.

Khashei

Zeinal Hamadani

and Bijari

, A novel hybrid classification model of artificial neural networks and multiple linear regression models, Expert Systems with Applications 39 (2012), 2606–2620.

26.

Manning

C.D.

Raghavan

and Schutze

, Introduction to Information Retrieval, Cambridge University Press, 2008.

27.

Rand

W.M.

, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66 (1971), 846–850.

28.

Chen

M.S.

Han

and Yu

P.S.

, Data Mining: An Overview from Database Perspective, IEEE Transactions on Knowledge and Data Engineering 8 (1996), 866–883.

29.

Beyer

Goldstein

Ramakrishnan

and Shaft

, When is Nearest Neighbor Meaningful? International Conference on Database Theory (1999), 217–235.

30.

Khashei

and Bijari

, A novel hybridization of artificial neural networks and ARIMA models for time series forecasting, Applied Soft Computing 11 (2011), 2664–2675.

31.

Khashei

, Forecasting the Isfahan Steel Company production price in Tehran Metals Exchange using artificial neural networks (ANNs), Master of Science Thesis, Isfahan University of Technology, 2005.

	Dimension	Size	No. of attributes	Performance (Overall precision percentage)
				Training	Test
1-	3	50	1	89%	94%
2-	3	50	2	99%	100%
3-	3	50	3	100%	100%

	No. of Clusters	Mean	Variance	Separation of clusters	Performance (Overall precision percentage)
					Training	Test
1-	2	3, 7	4	Well-separated	100%	100%
2-	2	3, 7	5	Touch	99%	100%
3-	2	3, 7	5.5	Overlap	97%	95%

Learning speed of supervised neural networks as similarity measurement in unsupervised cluster analysis

Abstract

Keywords

1. Introduction

4.1 Evaluation measures for clustering models

4.2.1 Describing the different underlying data sets and cluster characteristics

4.2.3.1. Efficiency evaluating in higher dimensional data sets

4.2.3.2. Efficiency evaluating in data sets with noise and outliers

4.2.3.3. Efficiency evaluating in data sets with \ without a given statistical distribution

4.2.3.4. Efficiency evaluating in multiple cluster data sets

4.2.3.5. Efficiency evaluating in clusters separation

5.1 The proposed model is scalable

5.2 The proposed model can generalize

5.3 The results of the proposed model are easily interpretable

5.4 The proposed model is insensitive to order of the input data

5.5 The proposed model can detect and deal with noise or outlying data

5.6 The proposed model can find clusters in subspaces

5.7 The proposed model can handle high dimensional problems

5.8 The proposed model is robustness in the presence of different underlying data and cluster characteristics

5.9 The proposed model can estimate any clustering parameters

5.10 The proposed model can function in an incremental manner

6. Conclusions

Footnotes

Acknowledgments

References

4.2.3.3. Efficiency evaluating in data sets with $\backslash$ without a given statistical distribution