Abstract
In this paper the Unsupervised Active Learning Method (UALM), a novel clustering method based on the Active Learning Method (ALM) is introduced. ALM is an adaptive recursive fuzzy learning algorithm inspired by some behavioral features of human brain functionality. UALM is a density-based clustering algorithm that relies on discovering densely connected components of data, where it can find clusters of arbitrary shapes. This approach is a noise-robust clustering method. The algorithm first blurs the data points as ink drop patterns, then summarizes the effects of all data points, and finally puts a threshold on the resulting pattern. It uses the connected-component algorithm for finding clusters. Then determines cluster centers by intersecting the narrow-paths. Experimental results confirmed the superiority of our proposed method compared to the two most well-known density-based clustering algorithms, DBSCAN and DENCLUE.
Keywords
Introduction
Data mining methods are part of analyzing and managing data in numerous areas such as social sciences, medical applications and industry where produce a tremendous number of data [1]. Clustering is a primary data mining tool which helps users to discover and understand the natural structure of a dataset [2–5]. A large number of clustering algorithms can be found in the literature [6–8]. These clustering algorithms may be classified into partitioning, hierarchical, model-based, grid-based and density-based methods [9].
Density-based clustering is a category of clustering algorithms which can discover densely connected components with a less sensitivity to outliers. Therefore, they are popular among the different clustering methods. DBSCAN [10, 11], OPTICS [12] and DENCLUE [13, 14] are some examples of this category. These clustering methods also have some disadvantages. First, determining the initial parameters of the algorithms is not an easy task in DBSCAN and OPTICS algorithms and the clustering result is very sensitive to the values of the parameters. Second, the computation time grows drastically as the size of database increases. In order to offer a better time complexity, the DENCLUE algorithm uses grid-based methodology as an intermediate step when confronted with large datasets. However, the hill climbing procedure, which also has a computational burden, is still a time consuming task. These drawbacks of traditional clustering algorithms motivated us to propose a new density-based clustering algorithm. The proposed algorithm is capable of filtering outliers as well as finding arbitrary shaped clusters while maintaining an acceptable run-time. In addition, our proposed algorithm can be applied as a weak-learn in the ensemble clustering problem [1] with higher efficiency and less time complexity than two other well-known density-based clustering algorithms, DBSCAN and DENCLUE. The Unsupervised Active Learning Method (UALM) utilizes the main concepts of the Active Learning Method (ALM) whose basic concepts should be very suitable and beneficial for clustering tasks.
ALM is an adaptive fuzzy learning algorithm, which is inspired by certain behavioral features of human brain functionality when confronting complex problems [15]. Reportedly, the ALM method has been successfully used with good results in fields such as modeling [15–20], control [21–24] and classification [25, 26]. This method inserts uncertainty into the computations by considering each data point as a fuzzy pattern called an ink drop. The idea of spreading ink drops and accumulating their results, previously used in the ALM algorithm, can efficiently simulate data density in space.
Employing the core concepts of ALM, we propose a novel density-based clustering algorithm that will work with noisy datasets. The overlap of ink drops forms continuous spaces that defines the clusters and can take arbitrary shapes. In other words, using this idea fills the empty spaces among the data points of a cluster, hence achieving fully scalable method. In this way, the empty spaces between the collected data points in a small database can be covered by increasing the radius of the ink drops. This means the inclusion of more data which have a lesser degree of belonging in the vicinity of the collected data. As the ink drop patterns of a data point overlaps with that of its adjacent data points, the neighboring data points connect. Hence, in the areas where the density of data is high the intensity of the created patterns is high. On the other hand, in the areas where the density of data is low the intensity of the pattern is low, which allows us to easily identify the outlier data points by applying a threshold. Therefore, using this idea of spreading ink drops and accumulating the results is the main concept of ALM that helps to effectively cluster data.
Our proposed clustering algorithm also uses grid-based methodology as an intermediate step in order to provide good time complexity when dealing with a large sample size datasets. Finally, the algorithm uses a narrow-path crossing method, which imposes little computational burden, in order to find cluster centers. Our experiments show that our proposed clustering algorithm works efficiently on different datasets. In addition, the UALM algorithm works on the whole dataset at once, whereas, DBSCAN and DENCLUE clustering algorithms do separate computations for every single data point. Therefore, our algorithm outperforms DBSCAN and DENCLUE clustering algorithms in terms of execution time.
The remainder of this paper is organized as follows. In Section 2, related works including different clustering methods are provided. The Active Learning Method is described in Section 3. Section 4 describes the UALM algorithm. Section 5 shows the experimental results of the quality of our clustering algorithm in comparison with k-means, DBSCAN and DENCLUE. In order to show the advantage of our approach in noisy environments, synthetic datasets with a variable amount of noise have been employed in simulations. Finally, Section 6 concludes this paper.
Related work
This section reviews different clustering methods emphasizing density-based algorithms in particular.
A large number of clustering algorithms exist in the literature. These algorithms may be classified into partitioning, hierarchical, model-based, grid-based and density-based methods [9, 27–29]. Partitioning algorithms aim to find meaningful partitions such as k-means and k-medoid [6–9]. Hierarchical algorithms, such as BIRCH [30] and CURE [31], attempt to find a hierarchy of clusters based on the distances between objects and clusters. Model-based clustering algorithms try to optimize the fit between the given data and some mathematical models [32, 33], such as decision trees and neural networks. Grid-based algorithms indirectly work with data by constructing summaries of data over the attribute space subsets. They segment the space and then aggregate the appropriate segments [6], e.g., BANG [34], STING [35] and WaveCluster [36]. The main advantage of this approach is its fast processing time which is typically independent of the number of data objects. It depends only on the number of cells in each dimension in the quantized space [8]. Therefore, they are fast and handle outliers well. Grid-based methodology is also used as an intermediate step in many other algorithms, e.g., DENCLUE, CLIQUE [37], and MAFIA [38]. Density-Based Clustering algorithms, such as DBSCAN, DENCLUE and OPTIC, group objects based on the crowdedness of points. Our proposed algorithm is a density-based algorithm, therefore we will explain this type of clustering algorithms in more details.
DBSCAN and OPTICS are the most well-known density-based clustering algorithms [39]. The idea behind DBSCAN is that if a particular data point belongs to a cluster, it should also be close to many other data points in that cluster. If there are more than MinPts data points within a distance of Eps from that data point then all the adjacent data points fall into the same cluster as well; otherwise, that data point is labeled as an outlier. A modified version of DBSCAN was proposed in OPTICS which applies a linear ordering function on all data points in such a way that spatially close data points are placed in close positions in the ordering. OPTICS has the same parameters as DBSCAN. However, determining the initial parameters of the algorithms is not an easy task [40–42]. These algorithms are also very time-consuming particularly in large datasets [43, 44].
DENCLUE is another density-based clustering algorithm which uses a density function for any given data point and then aggregates the effects of all data points. Since the first part of the DENCLUE algorithm that considers a kernel for each data point is similar to that of UALM, it is necessary to explain DENCLUE in more detail. DENCLUE considers a kernel for each point called an influence function. The algorithm computes the density function by adding the influence functions of all data points. This computed density function has high values in the dense regions of the space, and small values in low density regions of the space. A data point is considered as a density attractor if the density function has a local maximum in its location. These density attractors are used to create clusters. For this purpose each data moves in the opposite direction of the gradient of the density function until it reaches the ɛ radius of one of the density attractors as depicted in Fig. 1. Therefore, it is necessary for the gradient of density function to be computed for each data point in the path to the density attractor.
The DENCLUE method has an intelligent idea for developing a density based clustering algorithm; however, like all methods it has its own unique advantages and disadvantages. The idea of considering a kernel for each data point is the most important part of the algorithm. However, movement in the opposite direction of the gradient, which depends on the moving step, requires the computation of the gradient as well as the value of the density function at each step. This may prove very time consuming. In order to reduce the computational complexity, the space is gridded and the data points are quantized in the initial step of the algorithm. However, this computation is time consuming because a large amount of computation is involved in each step of moving each data point toward its density attractor [45, 46]. After moving each data point to its density attractor, DENCLUE still needs to perform another process. This process is another kind of clustering in which the data points that have reached the ɛ radius of density attractors are considered as a cluster. Therefore, this algorithm has three adjustable parameters. These parameters are the radius of the influence function, the threshold for rejecting noise and outliers, and the ɛ radius. One of the most important problems of DENCLUE occurs when working with datasets that have two or more dense parts within a single cluster. As we show in section 5, this algorithm cannot appropriately cluster the Chain-link dataset because this dataset has several dense parts in each ring. Therefore, it shows a higher number of clusters than actually exist. Another problem with the DENCLUE algorithm is that its parameters are highly sensitive to each other. For example in order to determine the outliers, the threshold parameter highly depends on the density of data and the value of parameter related to the influence function.
Active learning method
In this section, we briefly review the basic concepts of the ALM used in our proposed clustering algorithm. For more information regarding ALM, the reader is referred to [15–17]. It’s worth mentioning that ALM is a complete different notion than Active Learning. Active Learning is a special scenario in semi-supervised learning where the learner is allowed to interactively query the user for the labels with only a few very informative examples [47].
ALM was inspired by the way humans behave when confronting a complex problem by breaking it down into simpler and more comprehensible problems [15]. Then, this fine-grained partial knowledge can be refined and integrated with an inference process to make a decision and recognition [48]. The basic idea of ALM is to approximate a Multiple-Inputs-Single-Output (MISO) system with several Single-Input-Single-Output (SISO) subsystems (Fig. 2). Each subsystem shows the overall behavior of the output with respect to one input; the behavior will be extracted from the pattern generated on a two-dimensional unit called the Ink Drop Spread (IDS) unit.
Although they proposed an n-dimensional IDS by defining the IDS group concept in [49], the original version of the IDS unit is a two-dimensional plane that shows the overall behavior of a variable to the output variable. In a two-dimensional IDS unit, a three-dimensional fuzzy membership function is considered for each data point in order to diffuse information in each IDS plan. Then the effects of all data points are aggregated and smooth patterns are created. Therefore, diffusion of information performs the principal role in this process. Common sense implies that each sample data not only has information at its exact point, but also is valid in its neighboring points with a lesser degree of confidence. As you go away from the point of sample data, the confidence degree will decrease [50]. Figure 3(a) depicts two such membership functions overlapped and aggregated. An IDS plane after applying the IDS operator to the five data samples is shown in Fig. 3(b).
For each SISO subsystem, the IDS operator works on IDS units to extract partial knowledge. Two important features in the partial knowledge space are the narrow-path, which describes the X_Y characteristic of the IDS units, and the spread value over a narrow-path, which shows the degree of importance and the effectiveness of partial knowledge in the overall system. Unlike instance-based methods, the IDS method does not store data in the process of learning [50]. However, it stores the narrow-path and spread which have short constant memory usage. Figure 4 shows the narrow-path and spread of an IDS plan.
The range of input variables should be quantized to n levels in order to gain faster simulation time and lesser hardware for implementation. Therefore, each IDS unit has n2 grids [51, 52].
There are various methods of extracting the narrow-path. The weighted-average method, maximum method, and median method are three prominent examples. Let p (x, y), x ∈ X, y ∈ Y be the point (x, y) in the x - y plane. Also, d (x, y) denotes the darkness at p (x, y). Where the increase in darkness of the neighboring area of the datapoint p (x, y) is defined as Equation (1):
In that R is the radius. [τ] denotes the greatest integer less than or equal to τ. Approximately 10% of the x–y plane is the size of the ink-drop radius [16].
Equation 2 describes the weighted-average formula for the narrow-path [12]:
Where . Otherwise, ψ (x) = y max /2.
When {y|d (x, y)> Th, y ∈ Y } ≠ ∅, the spread can be computed as Equation (3):
Otherwise, σ (x) = y max .
Where Th denotes as the threshold of the plan, which is selected by the user (usually Th = 0 for modeling purposes).
After mining narrow-path and spread values for the entire IDS plans, the inference procedure combines this information as depicted in Fig. 5. During the ALM inference procedure, a fuzzy inference unit is applied to narrow-path and spread values in order to generate a rule base. As a result, this unit integrates the partial knowledge and extracts knowledge expertness that exists in data samples [48].
Although ALM performs excellently in many applications, such as function modeling [15–17], control [21–23] and classification [25, 26], it has not been used for unsupervised learning or clustering applications. In this section, an Unsupervised Active Learning Method (UALM) algorithm is described.
The proposed algorithm is a density-based clustering algorithm, which was developed based on the active learning method. The algorithm blurs the data point as an ink drop pattern. This process is called “ink drop spread.” As individual ink drop patterns overlap, the overlapping portions become increasingly darker forming some clusters on the plane.
The proposed clustering algorithm is a graphical-based algorithm which differs from other density-based methods in certain aspects. First, the method of looking to each data point is different. In UALM algorithm, each data point is not only valid in that exact point, but is also valid in its neighboring points with a less degree of confidence. In other words, a fuzzy membership function is considered for each data point. The second difference is the method of finding clusters. The proposed algorithm uses a connected-component labeling algorithm for finding clusters. Another difference is the method of finding cluster centers. UALM finds cluster centers by intersecting the narrow-paths. The Narrow-path for each dimension of a cluster is a line or vector with a special characteristic. The description of narrow-path will come later in this section. As our experiments showed, the proposed algorithm is faster than the popular density-based algorithms DBSCAN and DENCLUE.
Algorithm
The UALM algorithm is based on the ALM algorithm, but with some differences. The first difference is the IDS unit dimensions in UALM. If the data space has n features, the UALM algorithm needs to use n-dimensional inks for data diffusion, and therefore has n-dimensional IDS units. The flowchart of the algorithm is shown in Fig. 6.
We illustrate the steps of the algorithm by a simple example in Fig. 7. A two-dimensional dataset with 10 samples and 2 clusters is used for this example. Figure 7(a) shows the first step of the algorithm where the data are gridded in the quantized space. An ink drop pattern is shown in Fig. 7(b). Figure 7(c) shows the aggregation of the ink drops of all data points on the IDS plan. After applying the threshold on the resulting plan a binary image is formed (Fig. 7(d)). Then in Fig. 7(e) the algorithm assigns labels to the separated patterns of the image by applying a labeling operator. Finally, the cluster centers are found by intersecting the narrow-paths as shown in Fig. 7(f). In this example, the narrow-paths are gained by applying the mean policy.
Each step of the algorithm is described in more detail as follows:
Quantization is the first step of the algorithm. In order to gain faster simulation time and less hardware for implementation, the range of n features should be quantized to l i levels where i = 1ton. Therefore, each IDS unit has grids, where each grid is represented by (g1, …, g n ) in the gridded-space, G-space.
Just like any other grid-based method, usually it is assumed that l1 = … = l n = l. Determining the value for the number of grids depends on the data distribution and the required speed and memory volume. The more grids the slower the algorithm and the more memory volume required. Fortunately, as our simulation shows the proposed algorithm is not very sensitive to the number of grids. For this reason, we have not mentioned it as an algorithm parameter.
If dataset D be a set with N samples where each sample has n features. We define the gridding function as F = (F1, …, F n ), where F i (xi,j) is defined as in Equation (4).
where xi,j is the ith feature of the jth sample, Min i is the minimum of the ith feature among all of the members of D set. In other word, . Similarly, Max i definition is .
After gridding the dataset, the jth member of D set, where xi,j ∈ X
i
, is quantized and represented by , where qi,j = F
i
(xi,j), qi,j ∈ G
i
, i = 1, …, n, and j = 1, …, N. Therefore, as Equation (5) shows, the gridding function maps the D set to the Q set.
The parameters which need to be determined in the initialization phase are the ink-radius (Ir) and threshold value (Th). These two parameters are essential for the algorithm to find the correct or desired number of clusters. The parameter Ir determines the influence of a point on its neighborhood, and Th can be used for noise elimination and specifying cluster boundaries. The size of the ink drop pattern is empirically determined not only from the size of the grids but also from the training set size and application requirements such as the clustering accuracy and learning speed. When the amount of data is small, the ink drop pattern size should be large. If the amount of data is large enough, the use of a small ink drop pattern enhances the learning speed and also improves the clustering accuracy. A small ink-drop pattern needs less summation operation. This results in less time complexity. Moreover, when the size of the ink-drop decreases, clusters will be more separated. Therefore, the accuracy increases according to the definition of the accuracy in Equation (30). The influence of these two parameters on clustering results is presented in section 5 along with some experiments.
In order to determine the parameters of our algorithm, if we have a preliminary estimate about the number of clusters, this can help us to find the appropriate value for the ink-radius and threshold based on trial and error. Otherwise, we can use an evaluation index to find the optimal parameter values. If an appropriate evaluation index is selected as a cost function, then heuristic optimization algorithms such as genetic algorithm can be used in order to determine the ink-radius and threshold parameters automatically. We propose a heuristic method to find the algorithm parameters without any need to know the clustering validation indices. This is because using the clustering validation indices is not an appropriate solution in order to determine the algorithm parameters. The majority of internal evaluation indices are based on a predefined distance measure and biased towards algorithms using the same cluster model. While external clustering indices are more suitable for evaluating density-based clustering algorithms, the drawback of these measures is that the clustering results are evaluated based on data that has known class labels. This means that the class labels should be known a priori. However, in real clustering problems the labels of the classes are not available.
Here, we present a heuristic strategy which will find the algorithm’s parameters without ever needing the clustering validation indices. First, we set the value of Ir to a value around 0.05 (0.05 ± 0.01) depends on the size of the dataset. Next, we execute the algorithm with Th equal to zero, and obtain the PDF (Probability Density Function) of the IDS unit resulting from the third step of the algorithm. Next, we calculate the weighted average value for PDF and set the threshold equal to this value. In Fig. 8 the mean point and its value for the aggregation dataset is shown. After selecting an appropriate value for the threshold, by holding this parameter constant and changing the Ir parameter in the range of 0.025 to 0.175, we execute the algorithm up to the end of step 4. Then two data charts are drawn.
Figure 9(a) shows the number of unlabeled data points with respect to ink radius (Ir), and Fig. 9(b) shows the number of clusters as a function of Ir. At the point where unlabeled data are decreasing and the number clusters is constant, the corresponding Ir value can be a good candidate for clustering of the test dataset. To see this more clearly the derivative of previously mentioned charts for the dataset are computed and drawn on a single chart (Fig. 9(c)). A good candidate for Ir value is the value at which the derivative of the number of clusters with respect to Ir is zero (number of clusters is constant), and at the same time the derivative of unlabeled data points with respect to Ir is negative (unlabeled data are decreasing).
The obvious reason for this is that by increasing the value of the radius of ink drops we expect more data will join the clusters. Hence, the rate of labeled data is an increasing trend and subsequently the rate of unlabeled data is a decreasing trend. At the same time if the number of clusters for several values of Ir remains constant, it shows that the main clusters have been discovered. As the value of Ir increases, adjacent clusters will joint and the number of clusters will decrease. The greater the value of Ir the more adjacent clusters join together, therefore, the total number of clusters decreases. The smallest interval of Ir in which the number of clusters remains constant is the best value set for Ir. For the aggregation dataset this value, from Fig. 9(c) is equal to 0.07.
The main idea of this step is similar to that of the ALM algorithm, which considers each point in the database with a degree of uncertainty and considers a membership function for each of data points. This process is called “Ink Drop Spread”. As the ink drop patterns overlap, the overlapping portions become increasingly darker. Finally, clusters are formed as connected patterns on the plan.
Mathematically, the process of spreading an ink drop in the coordinate (q1, …, q n ) can be represented as in Equation (6). All grids in the neighborhood radius of Ir of the ink drop are affected. This effect is the darkness intensity as a function of distance to the ink drop center.
where dist is a function, dist : G n × G n → R+, which is defined based on a distance function. For instance, if Euclidean distance is chosen, the dist function can be written as in Equation (7).
Function f, f : R+ → [0, 1], assigns a darkness value to each grid based on its distance to the center of the ink drop. Eventually the IDS operator aggregates the darkness values resulting from all ink drops. The aggregation function can be defined in various forms like sum, max, saturating sum operators, or a combination of these. In special cases, novel operators can also be proposed. Aggregation with sum operator is as in Equation (8).
where
After aggregation, the IDS plane is normalized and can be represented as in Equation (9).
In order to prepare the IDS plane for the next step, a thresholding operation is applied to all grids of the IDS plane such that each grid with a darkness less than the threshold, d (g1, …, g n ) < Th, will be erased, d (g1, …, g n ) = 0. This operation helps the algorithm to detect and eliminate outliers in this step of the algorithm. In addition, it is a controlling parameter to distinguish separate clusters. Therefore, after applying threshold the IDS plane can be represented as in Equation (10):
In order to specify the number of clusters we should define an operator that acts on the resulting IDS-units. The number of clusters can be determined by a connected-component labeling algorithm, which is an algorithmic application of graph theory. In this case, the subsets of connected components are uniquely labeled based on a given heuristic [53]. Connected-component labeling has also been widely used in image processing for estimating connected components in a 2D grayscale image, although color images and data with higher dimensionalities can be processed as well [54, 55].
Different labels are assigned to the pixels belonging to various regions or clusters in the algorithm. The algorithm passes through the graph, labeling the vertices based on the connectivity and relative values of their neighboring labels. The medium, which is a specific graph, is used for determining the connectivity. As an example, for 2D images the medium can be 4-connected or 8-connected [56] as shown in Fig. 10. Therefore, we introduced the connected-component labeling algorithm as our labeling-operator L which assigns a label to any grid whose darkness value is above the threshold based on the connectivity of the grids. Therefore, the clusters can be defined as in Equation (11).
We represent the total number of members of C k as T k and the jth member of cluster C k by .
If the projection of C k on G i , was defined by and contains grids, then we can say that the kth cluster is in the space S k where , and we have .
In the UALM clustering method, clusters can be presented by several if-then rules. This is useful for real applications that need to label new data without any computation [57]. In other words, instead of storing all the grids of a cluster, clusters are represented by some upper-bounds and lower-bounds. In addition, most density-based clustering algorithms do not have any representation method for their resulting clusters. The number of representation rules corresponding to each cluster depends on the grid resolution, the relevant portion of occupied grids of the cluster, and the number of dimensions. If we consider a dataset with n features (X1, …, X
n
) and K clusters, the total number of representation rules NR
i
with respect to feature X
i
is given in Equation (12).
Where is the length (in terms of the number of grids) of the projection of cluster C
k
on the ith feature axis. In order to represent clusters by if-then rules, we should first compute the lower-bound (LB) and upper-bound (UB) for each member of the subspace where . The set of grids related to space is introduced in Equation (14).
The set has the total number of members equal to , where the jth member of is shown by . The lower-bound set and upper-bound set related to are defined as Equations (15) and (16), respectively.
After finding the upper-bound and lower-bound of the clusters, we can represent the clusters by if-then rules as
Although not all non-convex cluster shapes (e.g. spiral dataset) can be presented by the UB and LB, they are still useful in representing clusters in many clustering problems. Moreover, clusters represented in the k-means clustering algorithm are nothing more than this type of representation. Additional consideration of cluster representation will take place in our future work.
In order to find clusters centers, narrow-paths should be computed for each labeled area (cluster) by a policy. Examples of different policies are the maximum policy, mean policy, and median interval policy. The center of each cluster is obtained by intersecting its narrow-paths. In this study, we proposed for n feature space, the narrow-path along with subspace in cluster C k , (), for i = 1, …, n and k = 1, …, K, as in Equation (17).
Likewise, for maximum, median and median-interval policies, is defined, respectively, as Equations (19–21).
where M is described by Equation (22).
[M (j)] denotes the greatest grid belonging to G i which is less than or equal to M. and are defined in Equations (15) and (16), respectively.
The points of the narrow-path, which are obtained from a median policy, obviously belong to the cluster points. On the other hand, the points of the narrow-path which are obtained from a median-interval policy, may no longer belong to the cluster points.
After finding the narrow-paths along with each axis of a selected policy, cluster centers are obtained by intersecting the narrow-paths. These center point(s) can be obtained for the kth cluster by Equation (23).
These type of sets are never empty for the maximum policy. If more than one crossover point is detected, the cluster center will be determined depending on the narrow-path finding policy.
For example, in a two-feature space with a mean policy, the narrow-path sets for the 1st cluster of two, along with subspaces and , are defined as in Equations (24) and (25), respectively.
where, , and is the projection of cluster C1 on the G1. Equations (26) and (27) represent and , respectively.
where is similar to the definition in Equation (2). Cluster centers are defined by Equations (28) and (29) for cluster one and cluster two, respectively.
Figure 11 illustrates the clustering result on a dataset with two clusters. It shows the narrow-paths obtained by a mean policy. As can be seen, the centers of the clusters are found by intersecting the narrow-paths.
After step 5 of the algorithm and determining the number of clusters and their associated data points, some data may be introduced as unlabeled data points. Therefore, a process must be performed to determine whether these unlabeled data points are outliers or they belong to a cluster. If after completing this process, no cluster is found for an unlabeled data, that data will be considered as an outlier. The number of unlabeled data resulting from step 5 of the algorithm primarily depends on the threshold parameter of the algorithm. Therefore, step 6 is added to the algorithm to insure the correctness of the clustering. The pseudocode of this step is presented in Fig. 12.
In this algorithm, the ink-drop with the ink-radius from the earlier steps of the algorithm is spread for each unlabeled data point on an IDS-plan. Then by summing or averaging the ink intensity on the position of each clustered data point existing in the radial range of the ink drop, the cluster that captures the maximum sum or average of the unlabeled data ink stain wins that data. Therefore, the unlabeled data will be labeled the same as the winning cluster label. If no clustered data exists in the neighborhood of an unlabeled data, that data will be reported as an outlier. This means that this unlabeled data is not sufficiently close to the any other clusters. Figure 13 illustrates how this part of the algorithm works.
As it can be seen in Fig. 13(a), after step 5, five data points remain unlabeled (shown by stars). The Fig. 13(b–d) show the process of selecting the clusters for three of these unlabeled data points. As Fig. 13(b) shows, the unlabeled data belongs to cluster 1. Figure 13(c) shows that the unlabeled data does not belong to any cluster. Therefore the algorithm reports it as an outlier. Figure 13(d) shows that the ink of this unlabeled data takes 3 data points of cluster 2 but only one data point of cluster 1. As it is shown in this figure, the sum of the density related to cluster 2 is more than that of cluster 1. Therefore, this data belongs to cluster 2. Finally, Fig. 13(e) shows the result of clustering on this dataset.
In this section, a sequence of experiments is considered to evaluate the proposed UALM clustering method. These experiments are conducted in terms of Sensitivity to its parameters, Cluster Quality, Computation time, and Noise and outlier immunity.
All experiments were run on a 32 bit Pentium Dual-Core 2.7 GHz CPU with 2 GB RAM memory.
First, the proposed algorithm is evaluated in different configurations, created by changing the value of the ink-radius and threshold, and different data conditions. Synthetic data are used for this part of the experiments. Second, the performance of UALM is compared with the efficient clustering algorithms DBSCAN, DENCLUE and KNN. Synthetic data and real databases are used for this part of the experiments. The existing characteristics of these datasets are summarized in Table 1. Third, the execution time of these three clustering algorithms is compared. Finally, the ability of the proposed algorithm to work in noisy environment and detecting outliers is investigated in the last part of this section.
Clustering evaluation methods are needed to evaluate a clustering algorithm. The two types of clustering quality evaluations are internal evaluation and external evaluation. When a clustering result is evaluated based on data that clustered itself it is called internal evaluation. In external evaluation, clustering results are evaluated based on data that has known class labels. Using internal criteria in cluster evaluation is biased toward algorithms that use the same cluster model. Therefore, the best approach for fair evaluations at this time is using external evaluation measures [58, 59]. Four measures of external clustering evaluation were used to evaluate the proposed clustering algorithm: accuracy, F-measure [60], Rand index [58] and Adjusted Mutual Information (AMI) [61].
In this paper, the accuracy is defined the same as the precision definition in [58], which measures the homogeneity of clusters with respect to primarily known classes as in Equation (30). Having cluster C
i
, let J
i
denote the partition that contains the maximum number of points from C
i
. This measures the portion of points in C
i
from the majority partition T
Ji
.
High accuracy is easy to achieve when the number of clusters is large; in particular, accuracy is unity if each document gets its own cluster. Thus, accuracy can not be used to trade off the quality of the clustering against the number of clusters. AMI, F-measure and Rand Index measures allow us to make this tradeoff. For example, the mutual information tries to quantify the amount of shared information between the clustering C and partitioning T. Adjusted Mutual Information normalizes the NMI. More details on the evaluation metrics can be found in the literature [7, 58–61, 7, 58–61].
Although our algorithm will work on n-dimensional datasets, if the number of dimensions exceeds 4 or 5, it suffers from a lack of memory. This is because the number of grids increases with the power of n. Therefore, in this work we examined the algorithm on two-dimensional and three-dimensional feature space datasets. We plan on implementing the algorithm with higher-dimensional datasets in future work.
The UALM clustering algorithm has two adjustable parameters as ink-radius (Ir), and threshold (Th). In this part, some experiments on different datasets are performed in order to show how these parameters affect the clustering result.
Figure 14(a) shows the result of UALM on a spiral dataset, where Ir = 0.05 and Th = 0.07. Many clustering algorithms fail to cluster this dataset correctly due to its complicated shape. Since the proposed clustering method is a density-based method, it is able to cluster datasets with arbitrary shapes.
Figure 14(b) shows AMI vs. threshold. It shows that the algorithm can reach AMI = 1 by tuning the parameters correctly. The maximum value of AMI is achieved for a perfect clustering when the clusters are the same as the original partitions. The results from this figure indicate that bigger Ir values can result in overlapping clusters when the threshold is small. Furthermore, it shows that when the threshold increases, the AMI decreases. This is due to the fact that when the threshold increases the number of clusters may also increase or some data may be detected as outliers. Figure 14(c) shows four clustering evaluation index values vs. Ir with Th = 0.07. This shows that the accuracy is one when Ir < 0.06. However, accuracy decreases for higher Ir values. This is because, when the Ir is very small, almost all individual data are considered as a cluster so accuracy is high. However, AMI, Rand Index and F-measure (which are affected by the number of clusters) are smaller than one when Ir < 0.025. On the other hand, if Ir > 0.06 some clusters overlap, resulting in all validation measures decreasing.
Figure 15(a) shows the result of UALM on an aggregation dataset, with Ir = 0.06 and Th = 0.36. This dataset consists of seven classes of data, in which some of its clusters are very close to each other. Therefore, parameters should be set with some limitations in order to separate these clusters. UALM considered some data as outliers, although the last part of the algorithm improved the result. Figure 15(b) indicates that with a constant Ir, the AMI measure increases by increasing the threshold from zero since the clusters which are close to each other will be separated as the threshold increases. However, the AMI barely reached unity due to the existence of the points considered as outliers. By increasing the threshold, AMI reached its peak, but after that additional data was eliminated from the cluster bounds. As a result, AMI began to decrease. In Fig. 15(c) it seems that the clustering accuracy fluctuates between unity and 0.8 while the threshold value is constant and equal to 0.36. This is due to the aggregation method that UALM uses. The algorithm adds the ink of each point to the ink of other points. After adding all points’ ink, the result is normalized. For some Ir values, clusters which are near to each other merge and consequently the accuracy decreases. When Ir increases the dense areas will gain a large value. Therefore, in the normalization and thresholding step, the merged clusters are split and accuracy reaches unity again.
Figure 16(a) shows the result of UALM on a chain-link dataset which is a 3-dimensional dataset with two classes of data. In this experiment, Ir = 0.05 and Th = 0.1. Figure 16(b) shows that AMI decreases faster for smaller values of Ir by increasing the threshold. This is because when the threshold increases, more than two clusters are formed and some data will not be clustered. Figure 16(c) depicts that the accuracy is unity until Ir becomes a large value (more than 0.125). The reason for this is that when Ir is small several clusters (more than the real number of partitions) are formed. Therefore, according to the definition, accuracy will be high. AMI, Rand Index, and F-measure, which are affected by the number of clusters are smaller than unity when Ir is very small. The clusters overlap when Ir has a very large value. Therefore, all validation indices will decrease.
In this part of the experiments, UALM is compared with k-means, DBSCAN and DENCLUE in terms of some external clustering evaluation measures. Four measures of external clustering evaluation, clustering accuracy, F-measure, Rand index, and Adjusted Mutual Information, are used in order to evaluate the proposed clustering algorithm and compare it with the three other clustering algorithms. Synthetic data and real databases are used for this part of the experiments. Table 1 summarizes the characteristics of these datasets.
The clustering quality measures were computed for four algorithms and summarized in Table 2. We ran each algorithm with different parameters and reported the measures when the best F-measure index was achieved. Table 2 shows the values of clustering measures for each datasets. UALM parameters are Ir and Th, where Ir varies from 0.005 to 0.1 with increments of 0.005 and Th varies from 0.01 to 0.4 with increments of 0.02. The number of clusters (K) in k-means changes from #Cl-10 to #Cl+10 for #Cl> 10, and from 1 to #Cl+10 for #Cl< =10. For DBSCAN Epsvaries in the range of 0.1 to 3 with 0.1 increments and MinPtsvaries from 1 to 10. Finally, for DENCLUE, hvaries from 1 to 20 with increments of 2, the ɛ varies from 1 to 25 with increments of 3.
According to Table 2, in general, the density-based algorithms offers better validation indices than k-means algorithm. In addition, compared to k-means, UALM shows absolutely better evaluation indices. Furthermore, Table 2 shows better evaluation characteristics of UALM compared to DBSCAN, while UALM and DENCLUE have almost the same validation indices for half of the datasets. However, DENCLUE algorithm is unable to cluster datasets that have multiple dense regions within each cluster (e.g. girl, Chainlink, Jv-set and butterfly). DENCLUE algorithm breaks this type of clusters into parts or merges the neighboring clusters. This will result in decreasing the validation indices which are sensitive to the true number of clusters. However, UALM and DBSCAN cluster this type of datasets with an acceptable validation indices.
In order to evaluate our proposed clustering algorithm in terms of running time, we set up two experiments and compared the run time of UALM, k-means, DBSCAN and DENCLUE algorithms. In all of the experiments, each algorithm was repeated ten times and then the average execution time was reported. In order to have a fair comparison, we used a dataset in which each grid contains only one datapoint. Therefore, in our simulation, the preprocessing of DENCLUE and UALM algorithms did not have any influence on the run time of these algorithms. In the first experiment, the number of clusters was constant and equal to 9, while the number of data in each cluster changed. Figure 17(a) shows the result of this experiment. As this figure shows, k-means needed the least run time while DENCLUE needed the most. The UALM running time was better than the two other density-based algorithms. In the second experiment, the total number of data was constant and equal to 10000, while the number of clusters varied. As Fig. 17(b) demonstrates, the DENCLUE and DBSCAN offer the worst running times. However, as the number of clusters increased UALM run time remained almost constant; whereas, that of K-means increased.
The DBSCAN and DENCLUE algorithms, which do a lot of computations for each data point, are slower than the proposed method; whereas, the UALM algorithm spreads the ink of data points with a simple summation and then finds the clusters. Therefore, UALM is faster than two other density-based algorithms as is depicted in Fig. 17.
It should be noted that as the number of datapoints increases, the gap in CPU time of the different algorithms increases as well. As an example, the CPU time for 3D64 dataset is 2 seconds for k-means, about 20 seconds for UALM, 750 seconds for DBSCAN, and more than 1200 seconds for DENCLUE.
It should be noted that the comparison is done on datasets which have one data point in each grid. This situation rarely happens in real datasets. Therefore, as reported in [13], DENCLUE is faster than DBSCAN. This is due to the gridding operation that exists in DENCLUE, although a variety of grid-based versions of DBSCAN are proposed in [62–65].
In order to show the insensitivity of our proposed approach to noise, a datasets with different levels of noise was used. The white noise was injected into the original data varying from 1 to 50 percent of the total number of data. Then the accuracy and the percentage of detected noise were recorded. K-means is not able to work in noisy environments and cannot detect noise. However, in this experiment k-means has been used as a classification method to classify noise data. Therefore, the value of k is set to a large value (about four times greater than the real number of clusters) and each cluster is labeled with the majority class instances. Lastly, our approach summarizes the number of noise instances in clusters labeled as noise clusters. Figure 18(a) shows the result of our proposed algorithm on an aggregation dataset with 50 percent white noise. Figure 18(b) shows the percentage of the detected noise vs. the percentage of injected noise. The UALM algorithm shows better performance compared with the other three algorithms. As Fig. 18(c) shows, the UALM algorithm also shows better accuracy compared with the three other algorithms in a noisy environment. However, since DENCLUE and UALM have the same mechanism to diminish noise and outliers their result are comparable as well.
In this experiment the value of parameters in the algorithms are selected based on the best AMI in all steps. The value of parameters in the algorithms varies the same as in section B, except for DENCLUE whose ξ parameter increments by 2 × 10-10 from 10-10 to 10-9. This example shows the superiority of UALM in distinguishing clusters in a noisy environment.
This paper introduces a novel density-based clustering algorithm based on Active Learning Method. The proposed clustering method called the Unsupervised Active Learning Method (UALM). The algorithm is described and experiments confirm the ability and efficiency of our algorithm compared to the k-means, DBSCAN and DENCLUE. Further experiments were run in order to show the effects of algorithm parameters on clustering results.
Some of the important advantages of UALM are its generality, noise-robustness, and an optional method for parameters setting. Another property of the UALM algorithm is its cluster representation technique, which uses several if-then rules. The advantage of this representation is that there is no need to save all grids belonging to the clusters, it is sufficient to keep only the boundaries of the clusters. Therefore UALM saves memory. In addition, it can detect clusters of arbitrary shapes since it is a density-based clustering method and relies on discovering densely-connected components of data. The algorithm is faster than the two most well-known density-based clustering algorithms, DBSCAN and DENCLUE. In addition, it shows better clustering efficiency. Based on our simulations, the result of our proposed clustering algorithm utilizing a mean policy is the same as the k-means clustering algorithm for easy clustering tasks where the clusters have convex shapes. Since outliers can shift the mean, k-means has a problem with the mean policy. In UALM, noisy data or outliers are eliminated by the threshold mechanism. Therefore, the result of the mean policy of the algorithm is better than that of k-means algorithm for noisy convex datasets. Furthermore, in complicated clustering tasks where the clusters are not convex, the proposed algorithm clusters in a good manner. The k-means algorithm fails to cluster such datasets. Another advantage of UALM compared with k-means is that k-means is sensitive to the initial cluster center points. Therefore, the result of k-means clustering is not unique for different runs of the algorithm. As a result, each time it runs it shows a different index parameter. The UALM algorithm shows unique results in different runs of its algorithms, while its parameters are fixed.
In addition, the UALM algorithm can act similar to DENCLUE which is a density-based clustering approach. DENCLUE has a solid mathematical foundation which uses density functions for each data and adds these functions. To finish it applies the threshold on the resulting function. Contrary to DENCLUE, UALM has a numerical foundation, using the ink drop spread for each data and applies the threshold. The DENCLUE algorithm uses a hill-climbing procedure based on the local density function and its gradient to determine the density-attractors for each point in the cubes. This requires a lot of computations and is therefore a time-consuming procedure. On the other hand, UALM uses the connected-component algorithm to discover clusters and finds cluster centers by intersecting the narrow-paths. As simulation confirms, the UALM algorithm is more than four times faster than DENCLUE and DBSCAN; moreover, it shows better clustering efficiency.
In the DENCLUE method it is not easy to cluster those datasets which have multiple dense regions within each cluster as the DENCLUE algorithm breaks such clusters into parts or mistakenly merges the neighboring clusters. This results in a decrease of validation indices that are sensitive to the true number of clusters. UALM correctly clusters these type of datasets; hence, it shows better validation indices than the DENCLUE method.
