Semi supervised approach towards subspace clustering

Abstract

High-dimensional data analysis is quite inevitable due to emerging technologies in various domains such as finance, healthcare, genomics and signal processing. Though data sets generated in these domains are high-dimensional, intrinsic dimensions that provide meaningful information are often much smaller. Conventionally, unsupervised clustering methods known as subspace clustering are utilized for finding clusters in different subspaces of high dimensional data, by identifying relevant features, irrespective of labels associated with each instance. Available label information, if incorporated in clustering algorithm, can bias the algorithm towards solutions more consistent with our knowledge, leading to improved cluster quality. Therefore, an Information Gain based Semi-supervised- subspace Clustering (IGSC) is proposed that identifies a subset of important attributes based on the known label for each data instance. The information about the labels associated with data sets is integrated with the search strategy for subspaces to leverage them into a model based clustering approach. Our experimentation on 13 real world labeled data sets proves the feasibility of IGSC and we validate the clusters obtained, using an improvised Davies Bouldin Index (DBI) for semi-supervised clusters.

Keywords

Subspace clustering semi-supervised information gain entropy

1 Introduction

The basic assumption of Subspace Clustering is that data often contains some types of hidden structures which can be revealed when projected onto various subspaces [1, 2]. This is in contrast to the traditional clustering algorithms that run through all dimensions to identify clusters in full dimensions. The irrelevant features can degrade the cluster quality by hiding clusters in noisy data and mask them as the points seem to be equidistant from each other when dimensions increase (Curse of dimensionality [3]). In order to adapt to such situations, subspace clustering approaches are employed to discover hidden patterns in data in various subspaces. One of the challenging phases of any subspace clustering is to identify a subset of features from data, to characterize different groups [22]. Some of the works related to feature extraction are [5 , 22]. Different subsets of features may produce different important hidden structures. If data set is labeled fully or partially, then the labels can aid in the process of investigating different subsets of important features. Moreover, leveraging label information during the clustering process can also make the process of validation reliable. This type of subspace clustering where the information of labels, if available, is used in the process of feature selection, clustering process and cluster validation is called semi-supervised subspace clustering. Semi-supervised Learning [16] can work only if the knowledge about distribution of data obtained from unlabeled data carries the information about class of the data to which it belongs. This is based on the assumption that if points are in the same cluster, then they are most likely to be of the same class. Thus, devising a semi-supervised subspace clustering algorithm has two advantages. The first is the discovery of labels for unknown data and the second is the discovery of subspace clusters that may be more accurate and in accordance with our knowledge of the partially labeled data.

During the past decade, various types of subspace clustering techniques have been proposed. Based on the search strategy, each subspace clustering algorithm can be further divided into SSC (Soft subspace clustering) and HSC (Hard Subspace clustering) [17]. HSC algorithms follow either top down or bottom up approach for subspace clustering. While HSC [8, 10 , 15] finds exact subspace clusters, SSC [4 –6] attempts to find contribution of each dimension in the clustering process.

The above mentioned algorithms have two obvious disadvantages.

The feature selection strategy used, do not leverage the label information, if available.

The subspace clusters formed without considering the labels of the available data, D_l, may be inconsistent with D_l. Hence the clustering should be modified to achieve consistency with D_l.

In order to overcome the above two problems, a top down semi-supervised subspace clustering approach is proposed in this paper that takes into consideration the available labels of the dataset, to rank each feature so as to find important subsets of features pertaining to each cluster. Instead of learning feature weight through an iterative process which is computationally expensive, we use information gain of each attribute, corresponding to available labels, proposed by Quinlan [7]. Hence the name IGSC (Information Gain based Semi-supervised-Subspace clustering). Thus the semi supervised approach, using information gain for selecting relevant attributes, will help us in finding better subspaces and hence clusters in high dimensional data. We conduct experiments on thirteen real world labeled datasets and our approach is compared with two subspace clustering algorithms, PROCLUS and CLIQUE, and a traditional clustering algorithm, K-means. The validity of our proposed approach is demonstrated by experimental results. The contributions of our work include the following.

Formulation of a new strategy to find important features by explicitly incorporating the information about labels using the information theoretic approach of entropy and information gain.

Integrating the feature selection process into a top down subspace clustering approach to form subspace clusters in consistence with the knowledge about the labeled dataset.

The rest of the paper is organized as follows. Section 2 briefly describes related work in this area. Section 3 introduces the new proposed model IGSC and devised algorithms. Extensive experiments conducted to prove the feasibility and effectiveness of proposed work, are explained in Section 4. We conclude the paper in Section 5.

2 Related work

In this section we brief the trends in subspace clustering by reviewing some of the relevant works in this area. We try to bring out the pros and cons in existing algorithms and explain how our approach differ from the existing works.

Based on the approach of finding subspaces, Subspace Clustering (SC) can be classified in to Soft Subspace Clustering (SSC) and Hard Subspace Clustering (HSC). While HSC finds exact subspaces, SSC can be considered as an extension of feature weighting clustering. HSC can be further classified into Bottom-up and Top-down approaches.

In Bottom-up approach, clusters are discovered from lower to higher dimensional space. Some of the Bottom-up approaches include, CLIQUE [15], SUBCLU [14] and MAFIA [18] These algorithms make use of downward closure property of density and hence reduces the search space. CLIQUE divides the number of dimensions into non-overlapping grids and finds the dense regions according to given threshold. On the other hand, SUBCLU searches for objects that are density-connected and in contrast to other grid-based algorithms, it detects arbitrarily shaped and positioned clusters in subspaces. The Top-down subspace clustering begins with a full dimensional search for clusters in the given data space. Subspaces are then identified in the corresponding iterations, by assigning weights for dimensions in each cluster. PROCLUS [8], FIND-IT [10] and δ-CLUSTERS [11] are some of the Top-down subspace clustering algorithms. PROCLUS finds initial clusters in full dimension and later appropriate subset of dimensions are determined. An extended version of PROCLUS, known as ORCLUS [9] is another top-down subspace clustering algorithm that identifies non-axis parallel subspaces. The strategy of finding relevant dimensions in FIND-IT is based on two ideas. First, using a Dimension-Oriented Distance (DOD) measure and second, a dimension voting policy. In δ-CLUSTERS, a cluster model is proposed that exhibits some coherent tendency.

Soft subspace clustering identifies the contribution of each dimensions in cluster generation. Various SSC algorithms assign weights to attributes depending on which the subspace clusters are formed. Some algorithms identify subspace first and later forms the clusters. Some assign weights to all dimensions according to the degree of relevance. Kohavi et al. [4] and Langley et al. [5] described the concept of relevance of attributes and have proposed the classical feature selection methods- Filter and Wrapper Model. Convex K-means [19] utilizes a separated feature weighting method by defining a set of viable weight groups before data clustering is performed. Fuzzy Subspace Clustering (FSC) [20] is another classical SSC where clustering is performed by assigning a fuzzy weight to each dimensions of different clusters. A method of attribute weighting was also described by Langley et al. [5]. Another approach of feature weighting method based on information gain was described by N.F. Ayan [6].

In all these works, the usage of labels, if available, are not taken into consideration. But if data labels are available then leveraging such information for subspace clustering can not only be useful for validating clusters, but can also enhance the learning process to be consistent with the available knowledge. Hence our work adopts entropy and information gain with respect to the knowledge about the available labels to guide the significance of important attributes leading to subspace clusters. Influenced by the concept of Soft subspace clustering, where contribution of each individual feature is taken into consideration, we propose IGSC, an Information Gain based Semi_supervised_subspace clustering algorithm that follows a top down approach similar to [8].

3 Background

In this section, we briefly describe Entropy and Information Gain, an information theoretic measure, which we utilize in our proposed model for finding significant features for a subspace cluster. This is followed by a short review of PROCLUS [8], which is adapted for our model.

3.1 Entropy and information gain

Entropy [7] can be defined as a measure of uncertainty of a probability distribution. It is also defined as the impurity of an arbitrary collection of examples. Let E = {E₁, . . . , E_n} be a set of events each with probability of occurrence p_i such that $\sum_{i = 1}^{n} p_{i} = 1$ . Since the occurrence of events with smaller probability yields more information as they are least expected, a measure of information h should be a decreasing function of p_i. Claude Shannon proposed a log function h (p_i) to express information. This function is given as $h (p_{i}) = \log_{2} \frac{1}{p_{i}}$ which decreases from infinity to 0. This function reflects the idea that lower the probability of an event to occur, higher the amount of information in the message stating that the event occurred. From these n information values h (p_i), the expected information content H called entropy is derived by weighting the information values by their respective probabilities. $H = - \sum_{i = 1}^{n} p_{i} * \log_{2} p_{i}$

Information gain is a measure of change in entropy when new information is obtained. Thus it aims at reducing the entropy of system with availability of new information. We leverage the concept of information gain to find the significant attributes of a subspace cluster by partitioning the partially labelled dataset in an optimal way as explained in later sections. Formally, we define information gain as follows.

Information Gain: Let D denote a set of N examples and let P denote a set of c classes. Pr (p_i, D) denotes the fraction of instances in D, which belongs to class p_i ∈ P. The included information is obtained by: $Inf (D) = - \sum_{i = 1}^{c} P (p_{i}, D) \times \log (P (p_{i}, D))$

Let D_i be the set of examples having value v_i for attribute a_i. Let v be the number of distinct values for attribute a_i. Then, the information obtained by considering a_i as a significant attribute, is achieved by: ${Inf}_{a_{i}} (D) = - \sum_{i = 1}^{v} \frac{| D_{i} |}{| D |} \times Inf (D_{i})$

The information gain is obtained by, $Gain (a_{i}) = Inf (D) - {Inf}_{a_{i}} (D)$

3.2 Review of projected clustering

Given a set of data points, D, the algorithms that aim at finding a unique assignment of each point to exactly one subspace cluster are called projected clustering algorithms [1]. PROCLUS [8], FIND-IT [10] and δ-CLUSTERS [11] are projected clusters. We adapt PROCLUS [8] to make it suitable for our proposed model. PROCLUS finds axis parallel subspaces by following a top down approach thereby conquering the exponential search space of all possible subspaces. The principle of top-down approaches is to do a initial clustering on full-dimensional space based on potential set of representative elements, such that the points meet the given cluster criterion when projected onto the corresponding subspace. Most top-down approaches assume that the subspace of a cluster can be derived from the local neighborhood (in the full-dimensional data space) of the cluster representative. Thus, broadly there are three main phases. First, to identify the potential cluster representatives for formation of initial clusters in full dimensions. Second, iteratively forming the local neighbourhood and finding the subspaces until the objective function is met. Third, to refine the subspace clusters formed.

Our proposed model differ from the above strategy in two aspects. The first one is leveraging the concept of information gain at the local neighbourhood to identify the important attributes that can help in the formation of subspace clusters. The second one is the objective function that aims at reducing the entropy of the subspace clusters. Thus we ensure that the information about labels, if available, are utilized to form subspace clusters in consistent with the knowledge about the dataset.

4 Semi-supervised subspace clustering

Semi-supervised subspace clustering can not only help in forming subspace clusters consistent with the labeled dataset, but can also aid in modifying and extending the labels to reflect the hidden structures in low dimensional data. The proposed model searches for a good partition by formulating an objective function that aims at minimizing the entropy of each cluster.

4.1 Problem formulation

Let D be a dataset in d dimensions with each element t ∈ D defined by a set of attributes A = a₁, . . . , a_d, y. Each instance t is labeled with y = p_i or unlabeled with y = 0, where p_i ∈ P = {p₁, . . , p_c}, with c being the total number of class labels. We define subspace cluster S_i as a cluster of elements that are most similar to each other with respect to a representative element m_i ∈ D, called medoid, with label p_i along dimensions A_i ⊂ A, where |A_i| = l and l ≤ d. In this context, an objective function F is devised to find an optimal set of K subspace clusters S = {S₁, . . . , S_K} so as to reduce the entropy of each subspace cluster S_i in A_i dimensions with m_i as the medoid.

Let K denote the number of clusters, M ⊂ D denote the set of potential representative elements called medoids and P denote the set of class labels with |P| = c. Each subspace cluster S_i is defined by a subset of dimensions A_i ⊂ A and a medoid m_i ∈ M. Then, the objective function F is defined as a minimization function given by Equation 1. $F (K, M, P) = minimize \sum_{i = 1}^{K} H (S_{i}, P) \frac{| S_{i} |}{| D |}$ (1) where, $H (S_{i}, P) = - \sum_{p_{j} \in P} \Pr (p_{j}, S_{i}) \log_{2} \Pr (p_{j}, S_{i})$ (2)

Here, H (S_i, P) is the entropy of the cluster S_i with dimensions A_i and medoid m_i with respect to the available set of labels P. Pr (p_j, S_i) is the probability of an instance being classified as p_j, in cluster S_i. The notations used in our paper are listed in Table 1.

Table 1

Notations used in the Algorithm

Notations	Description
D	Numerical Dataset
l	Avg. number of dimensions
K	No. of Clusters
S _i	ith subspace cluster
A _i	Set of dimensions associated with S_i
M	Maximum distant Medoid set
Med _current	Medoid set from the current iteration
Med _best	Best medoids found so far
dimensionSets	Final dimensions and medoids
δ	Distance between two instances
O,B	Constant integers

The minimization of the objective function is primarily driven by the importance of features in each subspace cluster. Here we leverage the information about labels, if available, to seek important features that can aid in exploring hidden structures in subsets of features of dataset. This is achieved by finding the role of each attribute in labelling each instance. We use the theory of entropy and information gain to weight the significance of each attribute.

4.2 Proposed model

We propose an information theoretic approach, IGSC (Information Gain based Semi-supervised subspace Clustering) for formation of subspace clusters from partially labeled dataset. Our approach consists of three phases, namely, Medoid selection, Iterative semi supervised subspace extraction, and Cluster Refinement as given in Fig. 1. The process of finding semi-supervised subspace clusters involve two challenges. Firstly, a potential set of elements is to be sought to form representatives for each subspace cluster. Secondly, most informative attributes that forms the subspace clusters are to be found out. The solution approach follows a top down approach of subspace clustering.

Fig.1

Block Diagram of proposed model for semi-supervised subspace clustering based on Information Gain (IGSC).

A) Initial Clustering: This phase is to find a potential set of elements that can serve as representatives for subspace clusters. We ensure that only labeled instances from D are used to form these representatives. This set of potential representatives called medoids are chosen in such a manner that each of the medoids are farthest from each other. We use L₁ norm distance metric to find the farthest medoids in high dimensional dataset due to its characteristics like robustness to outliers and enforce sparsity. For high dimensional data, the meaningfulness of the L_p norm worsens faster as p increases [23]. Hence in our approach we use L₁ norm which is preferable than L₂ norm in high dimensional data. A greedy technique [8] is adopted to extract maximum distant medoid set M of size |M| > K. This approach helps in forming the initial clustering in full dimensions.

B) Iterative Semi-supervised Subspace Extraction: Semi-supervised subspace clustering can not only help in forming subspace clusters consistent with the labeled data set, but can also aid in modifying and extending the labels to reflect the hidden structures in low dimensional data. The proposed model searches for a good partition by formulating the objective function that aims at minimizing the entropy of each cluster. The relevant features are found by measuring the uncertainty of each feature and information gained by including each feature in deciding the label of an instance.

In this phase, randomly K medoids are selected from medoid set M. Each of the K medoids are used to generate clusters in full dimensions. Based on this initial clustering, we apply information gain approach to rank each feature. For each cluster generated, we discretise the features in order to transform the continuous value to discrete ones. Discretization generally help to improve the classification performance of algorithms that are sensitive to the dimensionality of the data [13]. We use Fayyad and Irani’s MDL method [12] that is entropy- based supervised discretization algorithm. Discretization enhances the amount of information we get from the data. For each discretised features, we find the Information Gain. The feature with highest Information Gain is given highest weight (rank). The algorithm then advances as follows: Med_current stores the randomly generated K medoids, and dimensionSets stores the relevant dimensions to generate subspaces with respect to each medoid. Each medoid is associated with respective subspaces and the formed semi-supervised subspace clusters are evaluated using Entropy as given in Equation 1. Those clusters that have data points less than 1% of |D| are considered as outliers and the corresponding medoids are replaced with new medoids, randomly selected from the M. These are then stored in Med_best. This process continues iteratively until a termination criterion is reached as given in Algorithm 1. Algorithm 2 describes the selection of a set of attributes that contribute significant information about the dataset. We restrict the attribute selection using a minimum threshold, l thus, instead of undergoing a full dimensional comparison, now we only need to consider a subset of important attributes.

C) Final Refinement: Dimensions and cluster assignments from iterative phase are passed through another refinement phase (see Algorithm 3) to improve the cluster quality. For each medoid, m_i and corresponding dimension A_i, we find least Manhattan distance δ_i to one of the (K - 1) medoids corresponding to A_i. Points outside the sphere of influence with radius δ_i and dimensions A_i, are considered as outliers for the respective clusters.

5 Experimental analysis

In this section, we present our experimental results and comparison of our semi-supervised subspace clustering algorithm with two subspace clustering algorithms namely, PROCLUS and CLIQUE as well as with a traditional clustering approach K-means, to show the effectiveness of our proposed work.

Experimental Setup:

The experimentations were conducted on a machine with 2.8 GHz/1.9 GHz AMD Quad-Core processor, with 6 GB of RAM. We used weka 3.8 together with OpenSubspace (Weka Subspace- Clustering Integration) for comparing various algorithms with our approach. The performance is measured in terms of Scalability and Cluster quality.

Performance Parameters:

We used the following parameters in our empirical study: size of the dataset |D|, number of subspace clusters K and average dimensionality of each subspace cluster l to demonstrate the effectiveness of our proposed algorithm. For scalability, we replicated few datasets like Pen digits and Synthetic datasets both in terms of dimensions and number of instances. The proposed approach was then applied on the datasets to determine how the running time and the cluster quality varied with |D|, K and l.

Data Sets: To evaluate the effectiveness and performance of our semi-supervised subspace clustering algorithm, 13 publically available data sets from UCI repository were used.

These 12 real world data sets are high dimensional with dimensionality varying from 9 to 27, 680 and number of instances from 34 to 8000 as shown in Fig. 2. For scalability experiments, we replicated the instances of Pendigits and Synthetic datasets upto 400000 records. Multi-Interval discretization using MDL method [12] was used to discretize the continuous valued attributes of the datasets.

Fig.2

Data sets used for experimentation.

Experimental procedure: The experiments have been repeated 15 times each and the average values have been considered for final evaluation. Proposed algorithm is compared with other algorithms on different datasets. A confusion matrix is generated for each data set. The metrics used for validation purpose are Purity, Accuracy based on F1 measure, Entropy and DBI measures.

5.1 Results and analysis

In this section the experimental results based on the observations are discussed.

Cluster Quality: DBindex (Davies-Bouldin index) [24] is a metric used for evaluating clustering algorithms. It is defined as the ratio of scatter of points (SC_i) within a cluster and the separation (SP_i,j) between clusters. We modified DBindex for calculating the inter cluster distances and scatter among the cluster, in subsets of dimensions using manhattan distance.

For a given clustering, our custom DBindex is defined as follows, $DB (K, M) = \frac{1}{| K |} \sum_{i = 1}^{| K |} \max_{j \neq i} \frac{SC (S_{i}) + SC (S_{j})}{SP (S)_{i, j}}$ where, $\begin{matrix} SC (S_{i}) & = & \frac{1}{| S_{i} |} \sum_{j = 1}^{| S_{i} |} | t_{j}^{A_{i}} - m_{i}^{A_{i}} | \\ SP (S)_{i, j} & = & ∥ m_{i}^{A_{i} \cup A_{j}} - m_{j}^{A_{i} \cup A_{j}} ∥ \\ SP (S)_{i, j} & = & \sum_{u \in (A_{i} \cup A_{j})} | a_{u, i} - a_{u, j} | \end{matrix}$

Here, |K| is the total number of clusters, $t_{j}^{A_{i}}$ is A_i dimensional instance assigned to S_i and m_i is the medoid of cluster S_i. a_u,i denotes value of uth attribute of m_i.

Purity: Purity measures the quality of clusters. To compute purity, we generated confusion matrix as follows: [i, j] ^th entry is the number of records assigned to ith output cluster that were initially part of jth input cluster. For a clustering with S = {S₁, ⋯ , S_K} clusters and P = {p₁, ⋯ , p_c} labels, purity can be calculated as, $purity (S, P) = \frac{1}{N} \sum_{i} \max_{j} | S_{i} \cap p_{j} |$

Figures 3 and 4 represents the confusion matrix generated for Synthetic datasets and Pen digits with number of clusters equal to the original number of classes defined for the dataset.

Fig.3

Confusion matrix with rows representing class labels p_i and columns representing the number of instances in each cluster with class label p_i of Synthetic dataset, with avg. dimensions l = 15.

Fig.4

Confusion matrix with rows representing class labels p_i and columns representing the number of instances in each cluster with class label p_i of Pendigits dataset, with avg. dimensions l = 15.

Table 2 shows the purity and DBI measures of subspace clusters formed. IGSC outperforms subspace clustering algorithm PROCLUS when both the number of clusters and number of features are high for ECML and B_cell2 dataset. We find that purity measure of our proposed model IGSC is quite good for datasets with very high dimensions such as ECML and GCM. But for low dimensional dataset pen_digits and Glass, the information gain approach does not give very significant results. This is because for low dimensional dataset, the contribution of attributes are equally likely in knowing the labels of instances.

Table 2

Comparison between clustering methods of IGSC, PROCLUS and K_Means approach on 13 datasets from UCI repository. Parameters K are tuned to be set as the number of classes P for each dataset and the average number of dimensions per subspace cluster l = d/3. † and ‡ shows that IGSC outperforms other methods in purity measure and DBI measure respectively

Dataset	Parameters		Metrics
	K	l	Method	Purity	DBindex
B-cell1.arff	2	1342	IGSC	0.84†	1.67
	2	1342	PROCLUS	0.6	1.60
	2	n/a	KMEANS	0.6	2.80
B-cell2.arff	11	1342	IGSC	0.95†	1.33‡
	11	1342	PROCLUS	0.43	1.45
	11	n/a	KMEANS	0.56	1.95
B-cell3.arff	9	1342	IGSC	0.96†	1.45
	9	1342	PROCLUS	0.76	1.33
	9	n/a	KMEANS	0.73	1.84
Colon.arff	2	667	IGSC	0.77†	1.76
	2	667	PROCLUS	0.64	0.59
	2	n/a	KMEANS	0.64	1.27
Embryonal.arff	2	2376	IGSC	0.63	1.52
	2	2376	PROCLUS	0.63	1.25
	2	n/a	KMEANS	0.65	2.10
Leukemia1.arff	2	2376	IGSC	0.92†	1.73
	2	2376	PROCLUS	0.60	1.57
	2	n/a	KMEANS	0.58	2.58
Leukemia2.arff	2	2376	IGSC	0.91†	1.55‡
	2	2376	PROCLUS	0.72	1.77
	2	n/a	KMEANS	0.71	3.12
Pen digits.arff	10	5	IGSC	0.61	2.20
	10	5	PROCLUS	0.49	1.90
	10	n/a	KMEANS	0.92	1.69
Synthetic.arff	13	15	IGSC	0.85	2.09‡
	13	15	PROCLUS	0.75	2.29
	13	n/a	KMEANS	0.87	3.46
ECML.arff	43	9226	IGSC	0.93†	0.63‡
	43	9226	PROCLUS	0.54	1.17
	43	n/a	KMEANS	0.65	1.05
GCM.arff	14	5354	IGSC	0.71†	1.07‡
	14	5354	PROCLUS	0.66	1.22
	14	n/a	KMEANS	0.45	1.81
Diabetes.arff	2	3	IGSC	0.77†	1.17‡
	2	3	PROCLUS	0.60	1.58
	2	n/a	KMEANS	0.65	4.24
Glass.arff	6	3	IGSC	0.50	2.86
	6	3	PROCLUS	0.47	2.33
	6	n/a	KMEANS	0.87	4.21

Accuracy: Since the approach is semi-supervised, F1 measure has been used to analyze the classification accuracy. We compute the precision and recall from the given confusion matrix where number of clusters generated is taken as equal to the number of classes. Figure 5 shows the F1 measure for 13 data sets.

Fig.5

Comparison of Proposed approach with PROCLUS, KMEANS, and CLIQUE based on Accuracy measure F1-measure.

This empirically validates that the information gain for subspace clustering plays an important role in improving the accuracy.

Entropy: Total entropy of a clustering defines its quality. As the total entropy decreases, the clustering gets stabilized. From Fig. 6 it can be observed that our approach gives better results for most of the datasets used for experimentation.

Fig.6

Comparison of Proposed approach with PROCLUS, KMEANS, and CLIQUE based on Cluster evaluation metric Entropy.

Execution time: From Fig. 7 it can be observed that as dimensions increases, PROCLUS takes more time for clustering. Compared to PROCLUS our approach consumes less execution time even when the dimensions increases. KMEANS execution time is less since it does not find subspace clusters and is not concerned with the significance of attributes. CLIQUE execution time is less since the partition of each dimension is given as ξ = 10 and density threshold is tau = 0.4. These parameters are such that very few clusters are formed and is not guaranteed to be optimal.

Fig.7

Comparison of Proposed approach with PROCLUS, KMEANS, and CLIQUE based on Execution time in seconds.

Scalability of algorithm: To identify how our algorithm performs while scaling the attributes and instances, we have done a comparitive study of PROCLUS and K-means together with IGSC approach. We scaled the Pen digits and Synthetic datasets by duplicating instances and features arbitrarily. The following aspects were considered, 1) Progress in execution time while varying number of instances, 2) Effect of number of instances on DBI, 3) Effect of threshold on DBI and 4) Effect of clusters on DBI. The results are shown in Figs. 8 and 9. We observe that, while the number of features increases, PROCLUS needs more execution time than KMEANS and the execution time of our approach comparatively improves as the number of features increases. IGSC approach shows better results while considering cluster quality (DBI) even though execution time is comparatively high for large instances.

Fig.8

(a): Execution of IGSC by varying |D|, (b)– (d): Evaluation of subspace clusters, formed by proposed model IGSC for Pendigits dataset, based on modified DBI measure, by varying |D|, l, and K.

Fig.9

(a): Execution of IGSC by varying |D|, (b)– (d): Evaluation of subspace clusters, formed by proposed model IGSC for Synthetic dataset, based on modified DBI measure, by varying |D|, l, and K.

6 Conclusion

We presented a semi-supervised subspace clustering algorithm, by a novel attribute selection strategy, for high dimensional partially labeled datasets. A top down approach of subspace clustering using ranking for attribute extraction in high dimensions is introduced as a new concept. This novel attribute selection technique together with subspace extraction has been conducted in such a way that the data is neither transformed nor does it lose its structure. Rigorous experiments have been done to compare the performance of our proposed method with other subspace clustering algorithms. The feasibility of incorporating the concept of information theory for subspace selection into high dimensional dataspace is established by the experimental results. The empirical results indicate that our approach generated comparable results with respect to traditional methods. The future scope lies in improving the excecution time and extending the approach to non numerical datasets.

Footnotes

Acknowledgments

The authors would like to thank Dr. M.R. Kaimal of the department of Computer Science, Amrita Vishwa Vidyapeetham for his valuable suggestions in improving this paper.

References

Kriegel

H.P.

, Kroger

and Zimek

, Clustering highdimensional data: A survey on subspace clustering, patternbased clustering, and correlation clustering, ACM Transactions on Knowledge Discovery from Data3(1) (2009).

Kriegel

H.P.

, Kroger

and Zimek

, Subspace clustering, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery2(4) (2012), 351–364.

Beyer

, Goldstein

, Ramakrishnan

, Shaft

, When is nearest neighbors meaningfulProceedings International Conference on Database Theory (ICDT) (1999), 217–235.

John

G.H.

, Kohavi

and Pfleger

, Irrelevant features and the subset selection problem. Machine Learning: Proceedings of the Eleventh International ConferenceMorgan Kaufmann, 1994.

Langley

and Blum

A.L.

, Selection of relevant features and examples in machine learning, Special issue of Artificial Intelligence on Relevance (1994).

Ayan

N.F.

, Using information gain as feature weight, TAINN’99 8th Turkish Symposium on Artificial Intelligence and Neural Networks

Istanbul

48–57, (1999).

Quinlan

J.R.

, Induction of decision trees, Machine Learning1 (1986).

Aggarwal

C.C.

, Wolf

J.L.

, Yu

P.S.

, Procopiuc

and Park

J.S.

, Fast algorithms for projected clustering. Proceedings of the 1999 ACM SIGMOD international conference on Management of data199961–72. ACM Press.

Aggarwal

C.C.

and Yu

P.S.

, Finding generalized projected clusters in high dimensional spaces, Proceedings of the 2000 ACM SIGMOD international conference on Management of data2000. 70–81. ACM Press.

10.

Woo

K.G.

and Lee

J.H.

, FINDIT: A Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting. PhD thesis, Korea Advanced Institute of Science and Technology, Taejon, Korea, 2002.

11.

Yang, et al., Ît’-clusters: Capturing subspace correlation in a large data set. In ICDE (2002), pp. 517–528.

12.

Fayyad

U.M.

and Irani

K.B.

, Multi-interval discretization of continuous valued attributes for classification learning, 13th International Joint Conference on Artificial Intelligence (1993), 1022–1027.

13.

Lustgarten

J.L.

, Gopalakrishnan

, Grover

and Visweswaran

, Improving Classification Performance with Discretization on Biomedical Datasets, in AMIA Annu Symp Proc (2008), 445–449.

14.

Kailing

, Kriegel

H.P.

and Kroger

, Density-connected subspace clustering for high dimensional data, in proceedings of the 4th SIAM International Conference on Data Mining (2004), 46–257Orlando, FL.

15.

Agrawal

, Gehrke

, Gunopulos

and Raghavan

, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data (1998) 94–105, ACM Press.

16.

Zhu

and Goldberg

, Introduction to Semi-Supervised Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning (2009).

17.

Deng

, Choi

K.-S.

, Jiang

, Wang

and Wang

, A survey on soft subspace clustering, Information Sciences (2016).

18.

Goil

, Nagesh

and Choudhary

, MAFIA: Efficient and scalable subspace clustering for very large data sets, Technical Report CPDC-TR-9906-010 Northwestern University, 1999.

19.

Modha

D.S.

and Spangler

W.S.

, Feature weighting in k-means clustering, Machine Learning52(3) (2003), 217–237.

20.

Gan

G.J.

and Wu

J.H.

, A convergence theorem for the fuzzy subspace clustering (FSC) algorithm, Pattern Recognition41 (2008), 1939–1947.

21.

Sandhya

and Roy

M.M.

, Data integration of heterogeneous data sources using QR decomposition, Advances in Intelligent Systems and Computing385 (2016), 333–344.

22.

Harikumar

and Dilipkumar

D.U.

, Apriori algorithm for association rule mining in high dimensional data, in Proceedings of the 2016 International Conference on Data Science and Engineering, ICDSE 2016, 2016.

23.

Aggarwal

, Hinneburg

, Keim

On the surprising behavior of distance metrics in high dimensional space

Database Theory-ICDT 2001, Lecture Notes in Computer Science (2001), 420–434 , Berlin , HeidelbergSpringer.

24.

Davies

D.L.

and Bouldin

D.W.

, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence1(2) (1979).