Abstract
Cluster analysis or clustering is one of the most important and widely used techniques for data exploration and knowledge discovery that concerned with partitioning a set of objects in such a way that objects in the same groups, called clusters, are more similar to each other than to those in other clusters. However, obtaining the clusters that exhibit high within-cluster similarity or homogeneity and high between-cluster dissimilarity or heterogeneity is critically depended on the similarity notion, which has not been yet clearly defined for clustering purposes. Distance and correlation are the most important and commonly used mathematics and statistics-based similarity measurements in the literature of the clustering, respectively. In this paper, the learning speed of the supervised neural networks is proposed as novel intelligent similarity measurement for unsupervised clustering problems. On the other hand, the main aim of this paper is to answer this question that can convergence speed of the different objects to the given target be used for measuring the similarity. Empirical results of the simulated data sets indicate that the proposed measurement not only can be used as similarity measurement in clustering tasks, but also can produce accurate results. In this way, for first time and in contrast of the literature, it is demonstrated that a supervised model can be used for handling the unsupervised tasks.
Keywords
Introduction
Cluster analysis or clustering is one of the basic data mining techniques for exploring the underlying structure of a given data set and is being applied in a wide variety and broad range of scientific disciplines to discover hidden knowledge and information. There are generally two main purposes for using cluster analysis, including understanding and utility. Clustering has been extensively studied for understanding the conceptually meaningful groups of object that share common characteristics to better analyze and describe in many areas such as medicine, psychology, biology, society, climate, business, etc. Utility purposes of cluster analysis such as summarization, comparison, and efficiently finding nearest neighbors has also attracted a lot of research attention in the field of data mining and knowledge discovery. Clustering provides an abstraction from individual data objects to the clusters in which they reside. In addition, some clustering techniques characterize each cluster in terms of cluster prototype; i.e. a data object that is representative of other objects in the cluster. These prototypes can be applied as the basis for a number of data analysis or data preprocessing purposes. Therefore, in the context of utility, clustering is the study of techniques for finding the most representative cluster prototypes [1].
While it is easy to consider the idea of a data cluster on a rather informal basis, it is very difficult to give a formal and universal definition of a cluster. The definition of what constitutes a cluster is not well defined; and hence, not surprisingly, there are several different definitions for clustering in the literature [2]. However, several working definitions of a cluster that are commonly used are as follows:
Well-separated Cluster Definition: A cluster is a set of objects such that any object in a cluster is more similar to every other object in the cluster than to any object not in the cluster. Sometimes a threshold level is used to specify that all objects in a cluster must be sufficiently similar to one another. Center-based Cluster Definition: A cluster is a set of objects such that an object in a cluster is more similar to the center (e.g. mean, median, etc.) of a cluster, than to the center of any other cluster. Contiguous Cluster Definition (Nearest neighbor or Transitive Clustering): A cluster is a set of objects such that an object in a cluster is more similar to one or more other objects in the cluster than to any object that are not in the cluster. Density-based Definition: A cluster is a dense region of objects, which is separated by low-density regions, from other regions of high density. This definition is more often used when the clusters are irregular or intertwined, and when noise and outliers are present. Similarity-based Cluster Definition: A cluster is a set of objects that are similar, and objects in other clusters are not similar. A variation on this is to define a cluster as a set of objects that together create a region with a uniform local property, e.g., density or shape.
According to the different notions of cluster, different definitions for clustering have been also presented in the literature [3]. Commonly, a clustering algorithm aims to find the natural structures or relationships in an unlabeled data set. Clustering is a process to identify and classify objects (observations or variables) into one of a (unspecified) number of groups, called clusters, so that clusters exhibit high within-cluster similarity (homogeneity) and high between-cluster dissimilarity (heterogeneity). Clustering is the technique of grouping a set of physical or abstract objects into different clusters, such that objects with in a cluster are more similar to one another and are dissimilar to the objects in other clusters. A good clustering algorithm generates high quality clusters to yield low inter cluster similarity and high intra cluster similarity. Cluster analysis is a process of partitioning a given set of inputs into natural groups (clusters) such a way that each input can be assigned to each cluster with a certain degree of belongingness. Clustering groups objects based on the information found in the data describing the objects or their relationships. The goal is that the objects in a group will be similar (or related) to one other and different from (or unrelated to) the objects in other groups.
Similarity is the intersection point of all clustering definitions and is doubtlessly the most critical notion in the cluster analysis area. In addition, measuring the similarity is one of the most challenging problems in clustering and knowledge discovery tasks. Theoretically, if a measurement satisfies a number of requirements, it can be generally applied as similarity or proximity measurement. These requirements are as follows [4]:
For dissimilarity: For similarity:
Additionally, if the proximity measure is real-valued and is a true metric in a mathematical sense, then the following two conditions also hold in addition to conditions 2 and 3.
Dissimilarities that satisfy conditions 1–5 are called distances, while dissimilarity is a term commonly reserved for those dissimilarities that satisfy only conditions 1–3. Distance is the most important and widely used similarity measurement in clustering tasks. Some well-known distance functions include Minkowski distance (Manhattan (L
Correlation is another well-known and widely used similarity measurement in literature of clustering problems. The need for correlation as similarity measurement is based on this fact that distance functions may be totally inadequate for capturing correlations among the objects. On the other hand, strong correlation structures may still exist between a set of objects even if they are far apart from each other as measured by the distance functions. Mahalanobis distance is a well-known mixture similarity measurement of distance and correlation, developed in order to overcome the problem of potential intercorrelation among the objects. In this measurement, the sample variance-covariance matrix in which the principal diagonal elements are sample variances and the off principal diagonal elements are sample covariances is used in order to adjust the intercorrelation among the objects.
In addition to a wide range of similarity measurement, a vast number of procedures have been also proposed, in order to construct different clustering algorithms to address different aspects of the clustering problems. Generally, clustering procedures can be categorized in two main categories including hierarchical (nested) and nonhierarchical (unnested) clustering procedures [5]. Hierarchical procedures produce a nested sequence of partitions, with a single, all inclusive cluster at the top (bottom) and singleton clusters of individual points at the bottom (top). Therefore, there are two general types of hierarchical clustering procedures – agglomerative and divisive. In an agglomerative procedure, each object starts out as its own cluster. In the subsequent steps, the two closest clusters/objects are combined into a new aggregate cluster, thus reducing the number of clusters by one in each step. Eventually, all objects are grouped into one large cluster. Five popular agglomerative procedures used to form clusters are single linkage (MIN), complete linkage (MAX or CLIQUE), average linkage, centroid, Ward’s method. In a divisive clustering procedure, named Diana, the process proceeds in the opposite direction, it start out with one large cluster containing all objects; in the subsequent steps, the objects that are most dissimilar are split off and turned into smaller clusters; this process continues until each object forms a cluster of itself. The hierarchical clustering result is often displayed graphically using a tree-like diagram, called dendrogram, which displays both the cluster-subcluster relationships and the order in which the clusters were merged or split (agglomerative or divisive view).
Unlike to hierarchical clustering procedures, nonhierarchical or partitional clustering procedures do not involve the treelike construction process. Instead, its first step is to select a cluster center (or seed), and all objects within a prespecified threshold similarity are included in the resulting cluster. Nonhierarchical clustering procedures are frequently referred to as
The parallel threshold procedure selects several cluster seeds simultaneously in the beginning, and objects within the threshold distance are assigned to the nearest seed. As the process evolves, threshold distances can be adjusted to include fewer or more objects in the clusters. Also, in some methods, objects remain unclustered if they are beyond the prespecified similarity from any cluster seed. The optimizing procedure is similar to the other two except that it allows for reassignment of objects to another cluster from the original on the basis of some overall optimizing criterion. The number of clusters,
In the literature, several different techniques have been proposed, based on the different notions of cluster, different definitions of clustering, different measurements of similarity, and different procedures of clustering. There is no objectively correct clustering technique, and many researchers believe that clustering is in the eye of the beholder [1]. The most appropriate clustering techniques for a particular problem often need to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another [2]. In addition, there are no definite categorization standards for different clustering techniques [7]. Some of the most prominent clustering techniques in the literature are listed and categorized below:
Connectivity-based techniques, such as CURE [8] model that use agglomerative hierarchical to build clusters based on distance connectivity. Centroid-based techniques, such as K-means [9] and K-medoid [10] models that are based on this idea that each cluster can be represented by a single center (e.g. centroid, mean, median, medoid, etc.) vector. Distribution-based techniques, such as models that use statistical distributions, e.g. multivariate normal distributions to model clusters using the Expectation-maximization algorithm. Density-based techniques, such as DBSCAN [11], DENCLUE [12], and OPTICS [13] that define clusters as connected dense regions in the data space. Subspace-based techniques, such as Biclustering [14] (co-clustering or two-mode-clustering) that clusters are modeled with both cluster members and relevant attributes. Graph-based techniques, such as models in which a clique, i.e., a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster.
In this paper, a new notion of cluster and similarity is presented; and consequently, a new clustering technique is proposed. The notion is based on the idea that a group of objects which represent the same learning behavior can be considered as a cluster. In other words, objects can be partitioned base on their learning and convergence speed to the given target. Therefore, a cluster can be defined as follows:
“A cluster is a set of objects such that any object in a cluster has close and similar learning speed”
In fact, in this learning-based definition of the cluster, the learning speed plays role of the similarity measurement in the traditional clustering techniques. However, it is clearly known that the learning rate for using as similarity measurement in the clustering tasks must at least satisfy the first three aforementioned requirements. However, it is easy to show that the learning speed satisfies these requirements. Therefore, it has the necessary conditions and can be theoretically considered as a clustering similarity measurement.
However, in order to measure the similarity, based on the learning speed, a learning-based technique is first needed in order to model the objects generation process and also relationships and correlation structures between them. Then the model is used in order to calculate the learning and convergence speed of each object. In the literature, there are several different learning-based and machine learning techniques which theoretically can be applied for this purpose [15]. Artificial neural networks (ANNs) and fuzzy inference systems (FISs) are the most important and commonly used hard/crisp and soft/fuzzy learning-based techniques, respectively. Hard/crisp clustering techniques assign each data point to exactly one cluster while soft/fuzzy clustering techniques may assign each data point to several clusters with varying degrees of memberships. In this paper, multi-layer perceptrons (MLPs), which are the most well-known neural networks, are used, as example, for modeling and measuring the learning speed in hard and crisp clustering tasks. Soft and fuzzy clustering problems can be also solved by replacing the hard and crisp models with soft and fuzzy learning-based models. The rest of the paper is organized as follows. In the next section, the basic concepts and modeling approaches of the multi-layer perceptrons (MLPs) are briefly reviewed. In section 3, the formulating and designing of the proposed clustering model is introduced. In Section 4, the feasibility and performance of the proposed model is evaluated in clustering problems using generated data sets. In Section 5, the desired features of the proposed clustering model are reviewed. Our concluding remarks are presented in Section 6.
Multi-layer perceptrons (MLPs) are computer systems developed to mimic the operations of the human brain by mathematically modeling its neuro-physiological structure. In MLPs, computational units called neurons replace the nerve cells and the strengths of the interconnections are represented by weights, in which the learned information is stored. This unique arrangement can acquire some of the neurological processing ability of the biological brain such as learning and drawing conclusions from experience. Multi-layer perceptrons have been shown to be effective at approximating complex nonlinear functions, which is very useful in a wide rang of scientific tasks, especially forecasting and classification [16]. Single hidden layer feed forward neural network is the most widely used form of the artificial neural network for modeling, forecasting, and classification. The model is characterized by a network of three layers of simple processing units connected by acyclic links (Fig. 1). The relationship between the output (
where,
The simple network given by Eq. (1) is surprisingly powerful in that it is able to approximate the arbitrary function as the number of hidden nodes when
Multi-Layer Perceptrons structure (
There exist many different approaches such as the pruning algorithm, the polynomial time algorithm, the canonical decomposition technique, and the network information criterion for finding the optimal architecture of an artificial neural network. These approaches can be generally categorized as follows [20]: (i) Empirical or statistical methods that are used to study the effect of internal parameters and choose appropriate values for them based on the performance of model. The most systematic and general of these methods utilizes the principles from Taguchi’s design of experiments. (ii) Hybrid methods such as fuzzy inference where the artificial neural network can be interpreted as an adaptive fuzzy system or it can operate on fuzzy instead of real numbers. (iii) Constructive and/or pruning algorithms that, respectively, add and/or remove neurons from an initial architecture using a previously specified criterion to indicate how artificial neural network performance is affected by the changes. The basic rules are that neurons are added when training is slow or when the mean squared error is larger than a specified value. In opposite, neurons are removed when a change in a neuron’s value does not correspond to a change in the network’s response or when the weight values that are associated with this neuron remain constant for a large number of training epochs. (iv) Evolutionary strategies that search over topology space by varying the number of hidden layers and hidden neurons through application of genetic operators and evaluation of the different architectures according to an objective function.
Although many different approaches exist in order to find the optimal architecture of an artificial neural network, these methods are usually quite complex in nature and are difficult to implement. Furthermore, none of these methods can guarantee the optimal solution for all real forecasting problems. To date, there is no simple clear-cut method for determination of these parameters and the usual procedure is to test numerous networks with varying numbers of hidden units, estimate generalization error for each and select the network with the lowest generalization error. Once a network structure is specified, the network is ready for training a process of parameter estimation. The neural network training is an unconstrained nonlinear minimization problem in which weights and biases of a network are iteratively modified to minimize the cost function. Cost function is an overall accuracy criterion such as the following mean squared error:
where,
where, the parameter
The momentum term may also be helpful to prevent the learning process from being trapped into poor local minima, and is usually chosen in the interval [0; 1]. Finally, the estimated model is evaluated using a separate hold-out sample that is not exposed to the training process.
It can be achieved from various clustering algorithms that there are significantly differences between the notion of what constitutes a cluster and how to efficiently find them. The most popular traditional notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals, correlations or particular statistical distributions [22]. In this paper, a new notion of cluster is presented, based on this idea that objects with same learning behavior can be partitioned in the same groups or clusters. Consequently, the learning and convergence speed of the objects to the given target is defined as new similarity measurement, and then, a new clustering model is proposed. Therefore, in the first stage of the proposed model, a learning-based technique is designed in order to model objects generation process and also relationships and correlation structures between them. In the second stage of the proposed model, the obtained results of the first stage and behavior of hidden neurons are analyzed and then objects are partitioned in desired clusters.
Data-driven learning-based techniques are generally built from either labeled or unlabeled data sets. Unlabeled data sets do not require domain experts in order to label data points, which is an expensive, time-consuming, and error-prone process. However, if only the unlabeled data sets are available, only the unsupervised methods can be used to build model. However, the unsupervised methods have to use the unsupervised learning algorithms, which are expensive, time-consuming, and error-prone processes. Therefore, both these processes – labeled data sets and supervised algorithms or unlabeled data sets and unsupervised algorithms- are costly, inefficient, and inappropriate for measuring the learning speed in the learning-based techniques for clustering purposes. Traditionally, clustering models follow the second strategy; however, the idea of this paper, is to use unlabeled data sets in supervised models.
Our idea is a subset of a general idea that believes “Modeling Process Is Continuous”. On the other hand, there is no significant difference between supervised and unsupervised modeling processes and there is a one-to-one correspondence between supervised and unsupervised models. In other words, there is an unsupervised model for each supervised model and vice versa. An unsupervised modeling process with
In our learning-based clustering process, by assuming that similar objects have the similar behavior in the different situations, it is concluded that there is no significant difference between target values in the modeling processes. On the other hand, since learning speed is independent from target values, aforementioned idea can be used in order to construct a learning-based clustering model by replacing target values with a logical pre-specified learnable values. In this way, the necessary theoretical background for supervised discovering patterns, relationships and structures in unlabeled data sets for unsupervised clustering tasks are available. In this paper, a constant value is considered as target. The reason is that it is the simplest function that is independent from each explanatory variable. Therefore, it is enough that a supervised learning-based technique is used for modeling process. As mentioned previously, the multi-layer perceptrons are selected in this paper for practical modeling of data as last section of the modeling stage of the proposed clustering model.
The multi-layer perceptrons are the most important and widely used types of learning-based techniques especially for modeling purposes, because of their inherent capability of arbitrary input–output mapping. Despite of the many satisfactory characteristics of the multi-layer perceptrons, building a MLP for a particular modeling problem is a nontrivial task [23]. Modeling issues that affect the performance of a multi-layer perceptron must be considered carefully. One of the most critical decisions is to determine the appropriate architecture, that is, the number of layers, the number of neurons in each layer, and the numbers of arcs which inter connect with the neurons. Other network design decisions include the selection of activation functions of the hidden and output neurons, training algorithm, data preprocessing, training and test sets, and performance measures. Fortunately, the design process of the multi-layer perceptrons for the unsupervised clustering purposes is very similar to the traditional supervised MLPs.
The selection of these parameters is basically problem-dependent. The number of input and output neurons in unsupervised version similar to the supervised one is relatively easy to specify as it is directly related to the problem under study and correspond to the number of the explanatory and dependent variables, respectively. The hidden neurons play the most important role in the supervised modeling process of multi-layer perceptrons. It is the hidden neurons that allow the perceptrons to detect the structures, capture the patterns in the data, and to perform complicated nonlinear mapping between explanatory and depended variables. The issue of determining the optimal number of hidden neurons is a crucial yet complicated one. In general, multi-layer perceptrons with fewer hidden neurons are preferable as they usually have better generalization ability and less overfitting problem. However, perceptrons with too few hidden neurons may not have enough power to model and learn the data. There is no theoretical basis for selecting this parameter although a few systematic approaches are reported. For examples, both algorithms for pruning out unnecessary hidden nodes and adding hidden nodes to improve network performance have been suggested. The most common way in determining the number of hidden nodes is via experiments or by trial-and error. Several rules of thumb have been also proposed [24].
The hidden neurons similar to the supervised version play the most important role in the unsupervised clustering process. Basically, the number of the hidden neurons in the supervised version is simultaneously depended on the complexity of the input-output function and the number of existing structures and patterns in the data. However, in the unsupervised version, based on this fact that there is no complex input-output function, the number of hidden neurons is just depended on existing structures and patterns in the data. Therefore, if the number of hidden neurons is correctly chosen in the design process, then the behavior of each neuron for similar structures and patterns will be similar and differed by other neurons. On the other hand, structures and patterns with certain behavior in a given neuron are similar together and dissimilar with other those structures and patterns. Therefore, each hidden neuron in the unsupervised version exactly represents the notion of a cluster; and hence, can be considered as a cluster. On the other hands, the similarity of each object is calculated using the weight of each hidden neuron. Thus, weight of each hidden neuron determines the cluster that data must be assigned to it.
The proposed model, in contract of traditional clustering models, does not assume the certain relationship for measuring the similarity and dissimilarity between different objects and they are intelligently detected by designed network. Therefore, there is no requirement to define and predetermine a given relationship such as distance or correlation as similarity measurement; or predetermine a given type of the center such as mean or median and threshold level for assigning the objects; or compare the yielded clusters for consistency of results; or others; but, clusters are automatically determined in the learning process. In addition, the proposed model, in contract of traditional iterative clustering models, does not require a vast number of iterations and also store the huge amounts of information in order to yield the final results.
It is must be noted that this claim is not in contrary to this fact that the proposed model is used an iterative supervised learning algorithm such as backpropagation in its clustering process. Because, as mentioned previously, the proposed model in order to cluster the objects only need the speed of learning and convergence; hence, it is not necessary that the learning process is completed. On the other hand, the first iteration is practically sufficient for measuring the learning speed and further iterations are not needed; of course, when the learning process is converged to the target. There is no theoretical evidence for determining the iteration that learning process will be converged after it. However; the author’s experiences indicate that in the proposed model, maybe because of having too simple target, learning process will often converge to the target from the first stage. Therefore, the proposed model is faster and easier to use than other those traditional clustering models, especially for large scale problem. From another point of view, increasing the number of used data that are the main reason of increasing the processing time and required space in the traditional clustering models, in the proposed model will cause to increase the accuracy. Because, it is clearly known from literature that in multilayer perceptrons, as in any statistical approach, the accuracy is closely related to the training sample size and will be generally increased if the training sample size increases.
The activation functions, also called the transfer functions, determine the relationship between inputs and outputs of the neurons. The activation functions introduce a degree of nonlinearity and complexity that is valuable for most applications of supervised networks. In general, any differentiable function can qualify as an activation function in theory. However, only a small number of well-behaved (bounded, monotonically increasing, and differentiable) activation functions, such as sigmoid (logistic), hyperbolic tangent (tanh), sine or cosine, and linear functions are practically used. There are some heuristic rules for the selection of the activation functions. For example, using the logistic activation functions for classification problems which involve learning about average behavior, or using the hyperbolic tangent functions if the problem involves learning about deviations from the average such as the forecasting problem [25]. Although in the unsupervised version, similar to the supervised, it is not clear whether different activation functions have major effects on the performance, the author’s experiences indicate that the logistic or tanh and linear activation functions seem well suited for the hidden neuron for the clustering problems; where the number of clusters is even and odd, respectively.
The process of selection other parameters of unsupervised version is exactly similar to the supervised one; hence, more descriptions about them has been omitted from this paper. Once the design process is finished and optimal architecture of the unsupervised MLP is specified, the network is ready for training and estimating the suboptimal values of parameters. The structure of the proposed unsupervised multilayer perceptrons model is shown in Fig. 2.
Unsupervised multilayer perceptrons structure (
After measuring the similarity of objects to clusters (
where
where
This discriminant function between clusters can be also given in the form of a hyper-ellipsoidal (in the general form of
Unsupervised clustering multilayer perceptrons structure (
In this section, in order to practically demonstrate the feasibility and suitability of the proposed clustering model, it is evaluated using simulated data sets. Therefore, in the first subsection, the evaluation measures are briefly reviewed in order to choose the most consistent one for the proposed clustering model. In the second subsection, the process of generating the fitting problems for evaluating the proposed model is introduced; and finally, the performance of the proposed clustering model in simulated problems is evaluated and analyzed.
Evaluation measures for clustering models
In the supervised tasks, such as classification, the evaluation of the results is an integral part of the developing process of the classification models. In these cases, there are the well-accepted evaluation measures and procedures, e.g. accuracy and cross-validation, respectively. However, because of its very nature, cluster evaluation is not a well-developed or commonly used part of cluster analysis. Nonetheless, cluster analysis, or cluster validation as it is more traditionally called, is very important and should be a part of any cluster analysis. The evaluation measures, or indices, that are applied in the literature in order to judge various aspects of cluster validity are traditionally categorized into three main categories as follows [1]:
Unsupervised indices: The unsupervised, called also internal, indices measure the goodness of a clustering structure without respect to external information and use only information present in the data set. Unsupervised measures of cluster validity are often further divided into two subcategories, as follows:
The cohesion and separation-based indices: Many internal measurements for partitional clustering schemes are based on the notion of cohesion and separation. These measures of cluster cohesion (compactness, tightness), which determine how closely related the objects in the cluster are, and measures of cluster separation (isolation), which determine how distinct or well-separated a cluster is from other clusters. The similarity matrix-based indices: These indices for evaluating the goodness of clustering measure the correlation between the similarity matrix and an ideal version of similarity matrix based on the cluster labels. An ideal cluster is one whose points have a similarity of 1 to all points in the cluster, and similarity of 0 to all points in the other clusters. Supervised indices: When there is the external information about data, which is typically in the form of externally derived class labels for the objects, the usual procedures for evaluating the goodness of clustering are the supervised indices that measure the degree of correspondence between the cluster labels and the class labels. On the other hand, the supervised indices measure the extent to which the clustering structure discovered by a cluster algorithm matches some external structure such as entropy which measures how well cluster labels match externally supplied class labels. Motivations for such an analysis are the comparison of clustering with the “ground truth” or the evaluation of the extent to which a manual classification process can be automatically produced by cluster analysis. Supervised measures are often called external indices, because they use information not present in the data set. Supervised measures of cluster validity are often further divided into two subcategories, as follows:
Class-oriented indices: There are a number of measures, such as entropy, purity, precision, Recall, and F-measure [26], which are commonly used to evaluate the performance of the classification models. In the case of classification, the degree to which predicted class labels correspond to the actual class labels is measured. The class-oriented indices use these classification measures by using cluster labels instead of predicted class labels. These indices evaluate the extent to which a cluster contains objects of a single class. Similarity-oriented indices: The Similarity-oriented indices are related to the similarity measures for binary data, such as the Jaccard and RAND measures [23]. All these indices are based on the premise that any two objects that are in the same cluster should be in the same class and vice versa. This work is generally done by comparing two matrices: 1) the ideal cluster similarity matrix, which has a 1 in the ijth entry if both objects Relative: The relative indices compare the different clustering models and different clusters. A relative clustering index is a supervised or unsupervised evaluation measure that is used for the purpose of comparison. Thus relative measures are not actually a separate type of cluster evaluation measure, but are instead a specific use of such measures.
As mentioned previously, in this paper the simulated data sets will be used in order to practically demonstrate the feasibility of the proposed clustering model; therefore, the data generation process and class labels for each object are available. Hence, the supervised indices will be the most consistent and the most appropriate measures for evaluating the goodness of the proposed clustering model. In this paper, the overall precision (a supervised class-oriented index) is used for evaluating the obtained results of the proposed model, which is calculated as follows:
In the general, the performance of a clustering model depends on the particular problem and data under consideration. In this paper, in order to demonstrate the feasibility of the proposed model in practice, different characteristics of data sets and clusters are first briefly described and then, a clustering problem with the simplest necessary conditions is generated in order to show that the proposed procedure and the proposed model can be basically considered as a clustering model or not. Then, if the answer was positive, the restricted conditions would be lifted one at a time in order to show the appropriateness and effectiveness of the proposed model in different situations.
Describing the different underlying data sets and cluster characteristics
In the clustering literature, several different characteristics have been reported for data and the clusters, which can be effective on the performance of the clustering models [28]. Therefore, the robust clustering models must be taken into account all of these data and clusters characteristics. Some of the most important characteristics (10 top) of data sets and clusters, in the literature can be generally summarized as follows:
Dimensionality: Dimension of the underlying data sets and clusters is one of the most important factors in performance, speed, and required space of the clustering models, especially for distance-based models. It is shown in the literature that the distances between the closest and farthest neighbors of a point may be very similar in high dimensional spaces. Perhaps an intuitive way to see this is to realize that the volume of a hyper-sphere with radius, Noise and Outliers: A point which is noise or is simply an atypical point (outlier) can often be effective on the performance of the clustering models and can distort them. Some clustering models can detect outliers and delete them or otherwise eliminate their negative effects, by applying tests that determine if a particular point really belongs to a given cluster. This processing can occur either while the clustering process is taking place or as a post-processing step. However, in some instances, points cannot be discarded and must be clustered as well as possible. In such cases, it is important to make sure that these points do not distort the clustering process for the majority of the points. Therefore, the performance of some clustering models depends on that the underlying data set has noise and outliers. Statistical Distribution: The data generation processes of some data sets follow a given statistical distribution such as Gaussian (normal) or uniform, but this is frequently not the case. Some clustering models such as distribution-based models that use the Expectation-Maximization algorithm to model clusters need to know the statistical distribution of the underlying data sets. Therefore, the performance of some clustering models depends on that the underlying data set has a given statistical distribution. Cluster Number: The number of clusters is another effective factor in performance, speed, and required space of the clustering models. The performance of the clustering models will be generally decreased by increasing the number of clusters. Moreover, the processing time and required space of the clustering models have a positive relationship with the number of clusters. Therefore, the performance of the clustering models depends on the number of the clusters in the underlying problem. Cluster Shape: Some clusters have the regular shape such as rectangular or globular; and are convex. Some clustering models such as Cluster Size: Some clusters have the similar size, but this is frequently not the case. Some clustering models such as Cluster Density: Some clusters have the similar density, but this is frequently not the case. Some clustering models such as Cluster Separation: Some clusters are well-separated, but in many other cases the clusters may touch or overlap. Many clustering models have the problem in the presence of the overlapping clusters. Therefore, the performance of some clustering models depends on that clusters are well-separated. Many and Mixed Attribute Types: Some data sets have many attributes or include both continuous and categorical attributes (mixed attributes). A mix of attribute types is usually handled by having a proximity function that can combine all the different attributes in an “intelligent” way. Many clustering models have assumption about the data and cluster characteristics and do not work well if those assumptions are violated. In such cases the clustering model may miss clusters, split clusters, merge clusters, or just produce poor clusters. Therefore, the performance of some clustering models depends on that data sets have many or mixed attributes. Type of data space, e.g., Euclidean or non-Euclidean: Some clustering techniques calculate means or use other vector operations that often only make sense in Euclidean space.
In the literature, there is no reasonable measure for feasibility evaluation of a given clustering model. On the other hand, the minimum logical necessary conditions for a given clustering model have not been yet determined. In this section, in order to show the feasibility of the proposed clustering model, data sets with the simplest characteristics are generated. Now, if the performance of the proposed model on the generated data sets is equal or greater than the performance of the randomly assigning of objects into clusters, then the proposed model can be considered as clustering model. Therefore, based on the aforementioned characteristics, the ideal problems and data sets for feasibility evaluation can be considered with the following characteristics. These characteristics indicate that the dimension of the under-study data set, there is noise and/or outliers in the data or not, the data follow a particular statistical distribution or not, the number of clusters that exist in the data, the shape of clusters, the number of data in each cluster, density of clusters, type of cluster separation from each other, the number of attributes (variables) in discriminant function, and type of data space; respectively.
Dimension of the space: Noise and outliers: Statistical distribution: Number of clusters: Shape of clusters: Size of clusters: Density of clusters: Separation of clusters: Number/type of attributes: Type of data space:
Therefore, in the first phase, a mono-attribute binary clustering problem with uniform distribution (by given and constant mean and variance), without noise and outliers, is constructed as framework problem in which clusters are convex and well-separated and also have the equal size, regular shape, and equal density. In the second phase, 100 different problems are produced by changing the values of mean and variance in the size of 50, 100, 150, 200, and 250. Finally, in order to eliminate the effects of random processes in the data generation procedure, each problem is generated 100 times and the average performance of the proposed model on these 100 problems in training and test samples are calculated in each situation. In this paper, the 80–20% randomly split of clusters with first 80% in the training and second 20% in the test is used.
Obtained results indicate that the proposed model not only can cluster the objects, appropriately; but also can achieve the good performance in both test and training samples (100% in both). Of course, it is not very surprising, because the generated problems were too simple and only designed for feasibility evaluation of the proposed model. However, feasibility checking is logically necessary for the proposed model due to significantly differences of the proposed model with traditional clustering models in the notion of cluster, similarity measurement, and also process of clustering.
After feasibility checking of the proposed model, the most important restricted conditions considered in generating the feasibility problem are lifted one at a time in order to determine the efficiency of the proposed model. Then the performance of the proposed model in each case is theoretically and practically evaluated and briefly discussed. In other words, the effect of the most important restricted conditions on the performance of proposed model is sensitivity analyzed to determine the efficiency of it in different situations and conditions. Further and more advanced analyses of the specific effects of these restricted conditions on the performance of proposed model have been omitted from this preliminary version of the paper. In this paper, the effects of dimension of data sets, noise and outliers, statistical distribution, number of clusters, and cluster separation are theoretically and practically analyzed, and analyzing the effects of other characteristics of data and clusters is postponed to the future.
4.2.3.1. Efficiency evaluating in higher dimensional data sets
In this section, the effect of higher dimensions of data sets is evaluated on the performance of the proposed model. A key feature of high dimensional data is that two objects may be highly similar even though commonly applied similarity measures indicate that they are dissimilar or perhaps only moderately similar. Conversely, and perhaps more surprisingly, it is also possible that an object’s nearest or most similar neighbors may not be as highly related to the object as other objects which are less similar [29]. Therefore, the performance of most clustering models, especially distance and density-based models is critically sensitive to the curse of dimensionality. It is the main reason that these clustering models usually use a transformation of the original data with a reduced number of dimensions.
Although, the proposed model due to benefit the unique advantages of multilayer perceptrons as universal approximators, can theoretically cluster the high dimensional problems; appropriately, in practice, the proposed model may not be able to efficiently handle them. The author’s experiences indicate that in high dimensional problems (higher than 30) is better to use the preprocessing algorithms in order to reduce the dimension of the input data sets. However, in lower dimensional cases, the proposed model can be effectively applied. For instance, the average performance of the proposed model (on 100 times running) in the some problems with the following characteristics (Dimension: 3; Size: 50, No. of Attributes: 1 to 3) is given in Table 1. The established discriminant function and clusters for Dimension: 3; Size: 50, No. of attributes: 1 and 2 are also shown in Figs 4 and 5, respectively. Circles and rectangles are represent the training and test data in each cluster.
Dimension of the space: Noise and outliers: Statistical distribution: Number of clusters: Shape of clusters: Size of clusters: Density of clusters: Separation of clusters: Number/type of attributes: Type of data space:
The average performance of the proposed model in three-dimensional problem with different number of attributes
The established discriminant function and clusters (Dimension: 3; Size: 50, No. of Attributes: 1).
The established discriminant function and clusters (Dimension: 3; Size: 50, No. of Attributes: 2).
4.2.3.2. Efficiency evaluating in data sets with noise and outliers
In this section, the effect of existing outliers in underlying data sets is evaluated on the performance of the proposed model. Most clustering models, such as agglomerative hierarchical and
Dimension of the space: Noise and outliers: Statistical distribution: Number of clusters: Shape of clusters: Size of clusters: Density of clusters: Separation of clusters: Number/type of attributes: Type of data space:
Binary clustering problem with an outlier data.
Now, by comparing the similarity of each data to cluster 1 (
The plot of 
The plot of 
4.2.3.3. Efficiency evaluating in data sets with
without a given statistical distribution
In this section, the performance of the proposed model is evaluated in situations that underlying data follow a given statistical distribution or not. In some clustering models, such as distribution-based techniques, it is assumed that the under-study data have a particular distribution. Subsequently, their performance will be reduced if the opposite is occurred. However, the proposed model is theoretically insensitive on distribution of the input data. This property of the proposed model comes from the ability of multilayer perceptrons in modeling of any type of data. However, it is reported in the literature of artificial neural networks that multilayer perceptrons can practically yield a more accurate results when their inputs have the same distribution [31]. Therefore, it is strongly recommended that it is satisfied in a preprocessing step.
4.2.3.4. Efficiency evaluating in multiple cluster data sets
As mentioned previously, the number of existing clusters in underlying data sets can be detected by trial-and-error in the designing process of the proposed model. The author’s experiences indicate that the process of determining the number of hidden nodes is the most important and the most critical step in the designing process of the proposed model, which has rationally the great effect on the performance of the proposed model. After determining the number of clusters, the input data can be simultaneously clustered by the proposed model; appropriately.
4.2.3.5. Efficiency evaluating in clusters separation
In this section, the effect of different separation types of clusters is evaluated on the performance of the proposed model. In some cases the clusters are well-separated, but in many other cases, clusters may touch or overlap. The authors’ experiences indicate that although the proposed model can appropriately cluster problems with well-separated clusters, its behavior in partitioning the overlapping objects is strongly dependent on the other characteristics of data and clusters. Therefore, the sensitivity analysis of the proposed model only based on the separation type of clusters may not be very positive, and a scenario analysis is needed in order to evaluate the effects of simultaneously changing of characteristics, which is out of the level of this primary paper. However, in this section, the average performance of the proposed model in three different types of separation (well-separated, touch, and overlap) with following characteristic is given in Table 2. The established discriminant function and clusters for these situations are also shown in Figs 9, 10, and 11, respectively.
Dimension of the space: Noise and outliers: Statistical distribution: Number of clusters: Shape of clusters: Size of clusters: Density of clusters: Separation of clusters: Number/type of attributes: Type of data space:
The average performance of the proposed model in three-dimensional problem with different number of attributes
The established discriminant function and clusters for well-separated clusters (Mean: 3, 7 and Var.: 4).
The established discriminant function and clusters for touch clusters (Mean: 3, 7 and Var.: 5).
The established discriminant function and clusters for overlap clusters (Mean: 3, 7 and Var.: 5.5).
Generally, the desired features of the clustering models depend on the particular problem under consideration. Several different desired characteristics have been reported in the literature for clustering models [24]. In this section, the most important desired features (top 10) of the proposed model are briefly introduced.
The proposed model is scalable
One of the most important features of the clustering models is the scalability both in terms of speed and space, especially for large data sets. In other words, the required space and the processing time of the clustering models for large sets of data must be scalable. The clustering models that are used in these cases should usually have linear or near linear time complexity to handle such large data sets. Clustering models that even have complexity of
The proposed model can generalize
Another important feature of the clustering models is that they can correctly infer and organize the unseen points in their appropriate clusters after finishing the clustering process of the sample data presented to them even if these points contain noisy information. On the other hand, the clustering models must be able to generalize their past obtained results to the future. However, it is common for clustering models to produce clusters that are not good clusters when evaluated later. This feature of the proposed model comes from the property of multi-layer perceptrons to generalize their results.
The results of the proposed model are easily interpretable
Another important feature of the clustering models is easily interpretability of their obtained results. Many clustering models produce cluster descriptions that are just lists of the points belonging to each cluster. Such results are often hard to interpret. A description of a cluster as a region may be much more understandable than a list of points. This may take the form of a hyper-rectangle, a center point with a radius, etc.
The proposed model is insensitive to order of the input data
Another important feature of the clustering models is that they must be independence of the order of input data. Some clustering models are dependent on the order of the input, i.e., if the order in which the data points are processed changes, then the resulting clusters may change. This is unappealing since it calls into question the validity of the clusters that have been discovered. They may just represent local minimums or artifacts of the model.
The proposed model can detect and deal with noise or outlying data
Another important feature of the clustering models is that they can detect noise and outliers. Some clustering models can detect noise and outliers, by applying tests that determine if a particular point really belongs to a given cluster and delete them or otherwise eliminate their negative effects.
The proposed model can find clusters in subspaces
Another important feature of the clustering models is that they can find clusters in subspaces of the original space; for the reason that in high dimensional spaces, clusters regularly only occupy a subspace of the full data space; and hence lie in a subspace. Many clustering models have difficulty finding such that clusters, for example, a five-dimensional cluster in a nine-dimensional space.
The proposed model can handle high dimensional problems
Another important feature of the clustering models is that they can efficiently cluster in high-dimensional spaces; because, as discussed previously, clustering in high-dimensional spaces is quite different from clustering in low dimensional spaces. Clustering in high dimensional spaces is often problematic as theoretical results questioned the meaning of closest matching in high dimensional spaces. Many clustering models, especially models that use the distance or density as similarity measurement, have difficulty clustering in such that situations.
The proposed model is robustness in the presence of different underlying data and cluster characteristics
Another important feature of the clustering models is that they can appropriately cluster problems that have different characteristics in both sense of data and clusters. As mentioned previously, there are several different characteristic for under-study data and clusters such as type of data space, dimension of data, noise and outlier data, statistical distribution of data, number of clusters, shape of clusters, size of clusters, density of clusters, separation type of clusters, and number/type of attributes, that a robust clustering model must be able to properly handle them.
The proposed model can estimate any clustering parameters
Another important feature of the clustering models is that they can estimate clustering parameters such as the number of clusters, the size of clusters, or the density of clusters. Many clustering models take the number of clusters as a parameter. This can be a useful feature in certain instances, e.g., when using a clustering model to create a balanced tree for nearest neighbor lookup, or when using clustering for compression. However, this is not generally good, since the number of clusters parameterized may not match the real number of clusters. A model that requires the number of clusters up front can always be run multiple times. Assuming that there is some way to compare the quality of the clusters produced, it is then possible to empirically determine the best number of clusters. Of course, this increases the amount of computation required. Likewise, it is often difficult to estimate the proper values for other parameters of clustering models. In general, the parameters of a clustering model may identify areas of weakness. In the best case, the results produced by a clustering model will be relatively insensitive to modest changes in the parameter values.
The proposed model can function in an incremental manner
Another important feature of the clustering models is that they can incrementally cluster the underlying data sets. Many clustering models, such as K-means, must be run again, if a new data is added to the underlying data set. In certain cases, e.g., data warehouses, the underlying data used for the original clustering can change over the time. If the clustering model can incrementally handle the addition of new data or the deletion of old data, then this is usually much more efficient than re-running the model on the new data set.
Conclusions
In virtually every scientific field dealing with empirical data, scientism attempt to get a first impression on their data by trying to identify groups of similar data. The primary objective of the cluster analysis is to partition a given data set into the homogeneous groups, called clusters, such that objects within a cluster are more similar to each other than objects belonging to deferent clusters. However, clustering is naturally an ill-posed problem, where the act of grouping similar data objects is a subjective notion and highly dependent on the cluster notion and clustering criterion used, especially similarity measurement. Several different notions of cluster and also similarity measurements such as distance and correlation have been presented in the literature of clustering. For this reason, a vast number of algorithms have been developed, each aiming to address different aspects of the clustering.
In this paper, a new notion of cluster and similarity is first presented, and consequently, a novel clustering model is proposed based on these concepts. In the proposed model, data objects are partitioned based on their behavior in the learning process. It is comes from the idea that a group of objects representing the same learning behavior can be considered as a cluster. On the other hand, the claim of the proposed model is that the learning and convergence speed of data objects can be used as similarity measurement for clustering purposes. Of course, it must be noted that the learning speed of objects is also a function of distances and correlations between objects that is intelligently approximated by the proposed model in its learning process of the data objects. It is the main difference of the proposed model by traditional clustering models in which are assumed that the similarity of objects can be generally measured by a given and predetermined function of distance and correlation.
Therefore, the proposed model need to a universal learning-based approximator in order to estimate the relationships and structures between objects, and to calculate the learning speed of the objects as similarity measurement. Of course, it must be noted that the foundation of the traditional clustering models is also based on the similar behavior of the data objects. However, all traditional clustering models use the unsupervised learning algorithms that are iterative processes of knowledge discovery or interactive multi-objective optimization processes that involve trial and failure, and hence they are often expensive, time-consuming, and error-prone models. However, the proposed model, in contrast of the traditional clustering models, works in a supervised manner. Therefore, it is theoretically expected that the proposed model is generally faster, more accurate, and more efficient than other unsupervised clustering models.
Then, the feasibility and efficiency of the proposed model in practice are evaluated by some simulated data sets. Empirical results of the feasibility and efficiency evaluating indicate that the proposed model not only can be used as a clustering model, but also can achieve accurate results. These results numerically indicate that in contrast of the literature, supervised models can be also used for solving the unsupervised problems. Finally, some desired characteristics of the proposed model are presented. Further and advanced discussion about specific features of the proposed model, advanced versions of the proposed model such as nonlinear version, soft and fuzzy version, support vector version, hierarchical version, K-means, etc., advanced evaluating of the proposed model in the real-world situations, and performance evaluating of the proposed model in comparison with other clustering models are postponed to the future. In this primary paper, the main notions and main idea of the proposed model is only introduced and primary evaluating of its performance in clustering problems is only presented.
Footnotes
Acknowledgments
The authors wish to express their gratitude to Dr. Mehdi Bijari and Ali Zeinal Hamadani, Department of Industrial Engineering, Isfahan University of Technology (IUT), and two anonymous referees for their helpful comments, which greatly helped us to improve our paper.
