A New semi-supervised clustering for incomplete data

Abstract

Semi-supervised clustering technique partitions the unlabeled data based on prior knowledge of labeled data. Most of the semi-supervised clustering algorithms exist only for the clustering of complete data, i.e., the data sets with no missing features. In this paper, an effort has been made to check the effectiveness of semi-supervised clustering when applied to incomplete data sets. The novelty of this approach is that it considers the missing features along with available knowledge (labels) of the data set. The linear interpolation imputation technique initially imputes the missing features of the data set, thus completing the data set. A semi-supervised clustering is now employed on this complete data set, and missing features are regularly updated within the clustering process. In the proposed work, the labeled percentage range used is 30, 40, 50, and 60% of the total data. Data is further altered by arbitrarily eliminating certain features of its components, which makes the data incomplete with partial labeling. The proposed algorithm utilizes both labeled and unlabeled data, along with certain missing values in the data. The proposed algorithm is evaluated using three performance indices, namely the misclassification rate, random index metric, and error rate. Despite the additional missing features, the proposed algorithm has been successfully implemented on real data sets and showed better/competing results than well-known standard semi-supervised clustering methods.

Keywords

Semi-supervised clustering labeled and unlabeled data incomplete data and interpolation

1 Introduction

Clustering, an unsupervised learning technique, is used to group and relate the patterns in data sets. The patterns in the data sets are grouped according to some pre-defined similarity metrics. In general, clustering is a technique to partition the data set into groups of related data objects. Several algorithms have been proposed for data clustering and find applications in pattern recognition, image processing, medical sciences, information retrieval, and others.

Depending upon the knowledge of the group of data points, clustering can be classified as supervised and unsupervised clustering. In supervised clustering, the group in which data points lie is known in advance, whereas in unsupervised clustering, no such information is provided beforehand.

Real-world data sets are becoming large, and it becomes a time consuming, lengthy, and expensive task to get knowledge of labeled data points. Combining both supervised clustering and unsupervised clustering gives rise to semi-supervised clustering in which the data set is partially labeled. In semi-supervised clustering, labeled data points are used to partition the unlabeled points, which improves the accuracy of unsupervised clustering. Several semi-supervised fuzzy clustering methods (SSFCM) are proposed in the literature, which can be classified based on the modification of objective function or on modification of the clustering method. A lot of research has been done on SSFCM algorithms based on the modification of the objective function [4 , 22]. Pedrycz [22] presented a semi-supervised clustering algorithm to introduce a penalty factor in the conventional objective function. Binary vector and scaling factors are further introduced to distinguish the labeled data and keep the semi-supervised and unsupervised part of the objective function balanced, respectively [5, 24]. Bensaid et al. [2] presented a semi-supervised clustering method with a modified objective function where the membership value of the data point is chosen as 1 for labeled data and 0 for unlabeled data. Zhang et al. [29] presented a semi-supervised kernel-based fuzzy c means by combining a semi-supervised clustering algorithm along with the kernel approach. All semi-supervised algorithms explored in the literature explicitly account for the complete data set.

Missing or unknown data with several samples containing missing features is another challenge. Such incomplete data occur in most application domains, including data mining, pattern recognition, image analysis, and environmental and medical sciences. Reasons for missing data may be numerous, including improper collection process of data sets, the missing response in the surveys, and high cost involved to get certain features. Dixon [7] and Jain and Dubes [9] explored empirical studies on incomplete data. Several techniques were proposed for the clustering of data with missing features. Incomplete-data problems are generally dealt with via imputation, i.e., missing features are filled by some specific values. Imputation techniques include filling of missing data with zeros, mean, and median and can be viewed as single imputation schemes [8, 16]. Toutenburg and Nittner [25] presented linear regression models for the incomplete categorical covariate. Zhang et al. [30] proposed the imputation of missing values with a genetic algorithm in fuzzy probabilistic c means algorithm. Hathaway and Bezdek [11] presented several clustering strategies for clustering of incomplete data. Jinhua Li et al. [19] developed K-means and K- median clustering methods for incomplete data. Dan Li et al. [18] utilized the nearest neighbor intervals to extend the fuzzy c means algorithm for incomplete data.

As most of the existing algorithms use complete data sets for a semi-supervised clustering algorithm, it is proposed to reformulate the semi-supervised algorithm for the clustering of incomplete data.

The proposed work addresses two issues that need to be articulated for the clustering of available data, i.e.

how clustering is done when missing attributes are present in the dataset and

how clustering is improved by utilizing the knowledge of a few labeled data samples.

The research paper is organized as: Section 2 sums up the different methods utilized for handling incomplete data in clustering, semi-supervised clustering, and kernel-based semi-supervised algorithms. Section 3 and Section 4 describes the proposed semi-supervised clustering algorithm for incomplete data and experimental results, respectively. Section 5 presents the conclusion drawn.

2 Background/related work

This section introduces various statistical and non-statistical techniques for the clustering of incomplete data. It then discusses fuzzy c means clustering and semi-supervised clustering algorithms.

2.1 Traditional techniques to handle incomplete data

Conventional clustering algorithms cannot analyze the data with missing values directly. The missing features must be imputed by some means. Data pre-processing and application of existing fuzzy clustering methods or modification of existing clustering methods are two main strategies for handling missing data. Data pre-processing is done by various statistical techniques, which include the case deletion method and imputation methods. There are four powerful techniques to handle missing data: deletion/removal strategy, imputation techniques, model-based techniques, and machine learning techniques.

2.1.1 Deletion/ removal strategy

The most common approach is the removal strategy, which removes the missing data from the dataset. However, the limitation of this approach is that it can be applied only when a small portion of data is incomplete [11].

2.1.2 Imputation techniques

Imputation is the most commonly used technique for missing data that simply alter the missing features with a specific value. Imputation techniques are classified as single imputation and multiple imputations [6]. A single imputation technique replaces the missing feature with a specific value calculated using various statistical methods. The simplest imputation method is mean imputation, where the missing features are substituted by the mean value of all the available features. This method introduces a bias in the datasets. This limitation of mean imputation can overcome by median imputation in which the existing values replace the missing feature. Nearest Neighbor [3] based imputation fills the missing feature with the nearest value in the neighborhood. This strategy is generally more precise yet computationally costly for large datasets. Interpolation is also an imputation technique used to approximate the missing value between two data points [17, 20]. Linear interpolation is the simplest form of interpolation among three interpolation techniques, i.e., Linear Interpolation, Quadratic Interpolation, and Cubic Interpolation. It imputes the missing values of incomplete data by considering linearity between two data points and computes the missing feature utilizing the straight-line condition. All three methods are found to be equally suitable for the estimation of missing attributes [20]. In the present work, the linear interpolation method is investigated for handling incomplete data in clustering.

Regression imputation is a data-driven model, which preserves the data by imputing the missing value with an estimate from available data. The missing features are replaced with the regression of the available features [25]. In linear regression for modeling n data points, the dependent variable y_i is given as: $y_{i}^{*} = β_{0} + β_{i} x_{i} + ε_{i} i = 1, 2, 3, \dots \dots .$ (1)

Where x_i is an independent variable, β₀ and β_i are fixed parameters and ε_i is termed as an error.

The multiple imputation approach offers different imputed values for each missing feature and estimates the missing features by suitably combining all imputed values [10]. Multiple imputation by chained equations (MICE) is commonly used to estimate missing features in incomplete data [26]. MICE utilizes regression techniques to approximate missing features.

2.1.3 Model-based techniques

Model-based technique models the probability density factor of incomplete data utilizing expectation maximization (EM) algorithm [14, 15]. EM is an iterative method used for estimating the missing value randomly to get the optimized result by repeating the process until convergence. This algorithm is relevant for data missing completely at random (MCAR) or missing at random (MAR) but not suitable for data not missing at random (MNAR).

2.1.4 Machine learning techniques

These techniques are non-imputation based techniques that modify the conventional fuzzy c means algorithm [11, 30]. In machine learning techniques, data is not pre-processed before the clustering and clustering process is modified in such a way that it can handle incomplete data in the clustering process itself.

Figure 1 describes the different techniques used for handling missing data in clustering.

Fig. 1

Techniques to handle missing data in fuzzy clustering.

2.2 Fuzzy c-means clustering

Fuzzy c-means clustering partitions a dataset Y = {y₁, y₂, …… . y_k} into fuzzy clusters v = {v₁, v₂, …… . v_p} by minimizing the objective function: $J (U, V) = \sum_{j = 1}^{p} \sum_{i = 1}^{k} u_{ji}^{m} ∥ y_{k} - v_{p} ∥^{2}$ (2)

Where

k - samples in the data set,

v_p - p^th cluster centre

u_ji - fuzzy partition vector

u_ji ∈ [0, 1] and justifies the condition $\sum_{j = 1}^{p} u_{ji} = 1 j = 1, 2, \dots \dots . p$

FCM is an unsupervised clustering technique where cluster centres and the distance metrics are calculated only when all features of each data sample are available. Various modifications are incorporated in conventional FCM to make it applicable for the clustering of data with missing features. Hathway and Bezdek [11] utilized the fuzzy c means algorithm to handle missing features in data by developing four new strategies. The first one is the whole data strategy (WDS), which omits all missing data features. This approach is applicable only when a small portion of data is incomplete. When the fraction of incomplete data sample is large, partial distance strategy (PDS) can be implemented in which partial distance function [7] is calculated using all available features for the clustering of incomplete data. In optimal control strategy (OCS), missing values are viewed as additional variables, and then better estimates are determined using some optimal control methods. The nearest prototype strategy (NPS) is an iterative method used to replace missing features with the nearest prototype.

2.3 Semi-supervised fuzzy c-means clustering (SSFCM1& SSFCM2) and semi-supervised kernel-based fuzzy c means (SSKFCM) algorithms

The main idea behind the semi-supervised clustering algorithm is to improve the clustering performance with prior knowledge of labeled data. Semi-supervised clustering methods are based on modifying the clustering process [2] or modifying objective function [22]. This section presents a semi-supervised clustering algorithm (SSFCM1) based on the modification of the clustering process [2] and a semi-supervised clustering algorithm (SSFCM2) based on the modification of the objective function [22], where few data samples are labeled, and a large number of data samples are unlabeled. Data set Y is arranged as: $Y = {\underset{labeled}{\underset{︸}{y_{1}^{1}, \dots . . y_{n_{l}}^{1}}}, \underset{unlabeled}{\underset{︸}{y_{1}^{u}, \dots . . y_{n_{u}}^{u}}}}$ (3)

The labeled and unlabeled data are represented as follows: $Y = Y^{l} \cup Y^{u}$ (3a)

Here labeled and unlabeled data are represented by superscript l and u, respectively, whereas the number of supervised and unsupervised data is n_l and n_u respectively.

Partition vector U is randomly initialized and is described as: $U = {\underset{labeled}{\underset{︸}{U^{l} = {u_{ji}^{l}}}} | \underset{unlabeled}{\underset{︸}{U^{u} = {u_{ji}^{u}}}}}$ (4)

Typically, the value of $u_{ji}^{l}$ in U^l is chosen as 1 if the data is supervised and 0 otherwise, and it is known in advance.

Initial cluster centres are calculated taking labeled data into account as follows: $v_{j}^{0} = \frac{\sum_{j = 1}^{n_{l}} (u_{ji}^{l})^{m} Y_{i}^{l}}{\sum_{j = 1}^{n_{l}} (u_{ji}^{l})^{m}} 1 \leq j \leq p$ (5)

Therefore, the membership $u_{ji}^{u}$ in U^u (unlabeled data) is updated as follows: $u_{ji}^{u} = \frac{d_{ji}^{\frac{1}{(m - 1)}}}{\sum_{j = 1}^{p} (d_{ki}^{\frac{1}{(m - 1)}})}, 1 \leq j \leq p, 1 \leq i \leq n_{u}$ (6)

Where $d_{ji} = {| | Y_{i} - v_{j} | |}^{2}$

Finally, cluster centres are updated as: $v_{j} = \frac{\sum_{j = 1}^{n_{l}} (u_{ji}^{l})^{m} Y_{i}^{l} + \sum_{j = 1}^{n_{u}} (u_{ji}^{u})^{m} Y_{i}^{u}}{\sum_{j = 1}^{n_{l}} (u_{ji}^{l})^{m} + \sum_{j = 1}^{n_{u}} (u_{ji}^{u})^{m}}$ (7)

Pedrycz and Waletzky [22] developed a semi-supervised clustering algorithm (SSFCM2) based on the modification of the objective function is as follows:

$\begin{matrix} J = \sum_{j = 1}^{p} \sum_{i = 1}^{k} u_{ji}^{m} {| | y_{i} - v_{j} | |}^{2} + α \sum_{j = 1}^{p} \sum_{i = 1}^{k} \\ (u_{ji} - f_{ji} g_{i})^{m} {| | y_{i} - v_{j} | |}^{2} \end{matrix}$ (8)

Here, a binary vector g is introduced (g_i = 0 if y_i is unlabeled else g_i = 1) to differentiate a labeled and unlabeled sample. The membership values of labeled data are represented in a matrix. $F = [f_{ji}] for 1 \leq j \leq p, 1 \leq i \leq k$ (8a)

The scaling factor α keeps the balance between the labeled and unlabeled data. The detailed algorithm of SSFCM2 can be studied in [22].

Semi-supervised kernel-based fuzzy c-means is based on utilizing the kernel version of fuzzy c means in semi-supervised FCM where distance metric is kernel induced distance measure [28] and is as follows:

$\begin{matrix} d (x, y) = ϕ (x) - ϕ (y) \\ = \sqrt{K (x, x) - 2 K (x, y) + K (y, y)} \end{matrix}$ (9)

Here ϕ is a non-linear factor that map sy_k from the input Y to a new space with high dimensions. The kernel function K (x, y) is defined as the inner product in a new space for (x, y) in input space Y as follows: $K (x, y) = ϕ (x)^{T} ϕ (y)$ (10)

Several kernel functions are explored in the literature. In this work, Gaussian kernel function [27] is used for simplicity, which is given as: $K (x, y) = \exp (- \frac{∥ x - y ∥^{2}}{σ^{2}})$ (11)

Where parameter σ is given by $σ = \frac{1}{p} (\sqrt{\frac{\sum_{i = 1}^{k} ∥ y_{i} - \bar{y} ∥^{2}}{k}})$ (12)

Where $\bar{y}$ is the centroid of k data points

Finally, the partition matrix for unlabeled data and cluster centre in SSKFCM [29] are updated as follows: $\begin{matrix} u_{ji}^{u} = \frac{(1 / (1 - K (Y_{i}^{u}, v_{j})))^{1 / (m - 1)}}{\sum_{j = 1}^{p} (1 / (1 - K (Y_{i}^{u}, v_{k})))^{1 / (m - 1)}}, \\ 1 \leq j \leq p, 1 \leq i \leq n_{u} \end{matrix}$ (13)

$v_{j} = \frac{\sum_{j = 1}^{n_{l}} (u_{ji}^{l})^{m} K (Y_{i}^{l}, v_{j}) Y_{i}^{l} + \sum_{j = 1}^{n_{u}} (u_{ji}^{u})^{m} K (Y_{i}^{u}, v_{j}) Y_{i}^{u}}{\sum_{j = 1}^{n_{l}} (u_{ji}^{l})^{m} K (Y_{i}^{l}, v_{j}) + \sum_{j = 1}^{n_{u}} (u_{ji}^{u})^{m} K (Y_{i}^{u}, v_{j})}$ (14)

3 Proposed semi-supervised fuzzy c-means clustering algorithm for incomplete data (SSFCM_ID)

A semi-supervised clustering algorithm utilizes labeled as well as unlabeled data for clustering. In this proposed work, a semi-supervised FCM clustering algorithm [2] is extended for the clustering of incomplete data.

Initially, a complete data set Y which contains both labeled and unlabeled data samples is chosen. Therefore Y can be represented as a combination of labeled data Y^l and unlabeled data Y^u. $Y = {\underset{labeled}{\underset{︸}{y_{1}^{1}, \dots . . y_{n_{l}}^{1}}}, \underset{unlabeled}{\underset{︸}{y_{1}^{u}, \dots . . y_{n_{u}}^{u}}}} = Y^{l} \cup Y^{u}$ (15)

Here n_l and n_u are the number of labeled and unlabeled data points in the dataset Y.

Both labeled and unlabeled data points in Y may contain some missing features also. So Y^l consists of labeled data with both missing and non-missing features i.e. $Y^{l} = Y_{M}^{l} \cup Y_{W}^{l}$ (15a)

Where Y_W corresponds to the whole data, and Y_M to the missing data, in the given dataset

Similarly, Y^u is taken as the combination of both unlabeled data points with missing features and unlabeled data points without missing features. $Y^{u} = Y_{M}^{u} \cup Y_{W}^{u}$ (15b)

The missing features in both labeled and unlabeled data are initially filled by linear interpolation imputation technique. Linear interpolation imputation technique is used to handle missing data initially because this technique is fast and computationally efficient. Moreover, it can create artifacts when data points are sparse. If the point y₂ contains a missing feature, then both points y₁ and y₃, which is the actual measurement taken prior to the point y₂ and the actual measurement is taken after point y₂ respectively, will be used to compute the value at y₂. (y₁₁, y₁₂), (y₂₁, y₂₂) and (y₃₁, y₃₂) are the coordinates of points y₁ and y₂ respectively as shown in Fig. 2. The equation that computes the missing feature at y₂₂ for point y₂ is as follows: $y_{22} = y_{12} + \frac{(y_{32} - y_{12})}{(y_{31} - y_{11})} (y_{21} - y_{11})$ (16)

Fig. 2

Linearity between missing and non-missing observations.

Partition vector U is initialized randomly and is described as: $U = {\underset{p * n_{l}}{\underset{︸}{U^{l}}} | \underset{p * n_{u}}{\underset{︸}{U^{u}}}}$ (17)

Where $U^{l} = U_{M}^{l} \cup U_{W}^{l}$ $U^{u} = U_{M}^{u} \cup U_{W}^{u}$

Cluster centres are initialized taking whole labeled data into account from U^l as follows: $v_{j}^{0} = \frac{\sum_{j = 1}^{n_{l}} ((u_{W}^{l})_{ji})^{m} ((Y_{W}^{l})_{i})}{\sum_{j = 1}^{n_{l}} ((u_{W}^{l})_{ji})^{m}} 1 \leq j \leq p$ (18)

The membership grade $U_{M}^{l}$ and $U_{W}^{l}$ is chosen as 1 if the data is supervised with class p and 0 otherwise. Further partition matrix $U_{M}^{u}$ and $U_{W}^{u}$ are updated as: $\begin{matrix} (U_{M}^{u})_{ji} = \frac{1}{\sum_{k = 1}^{p} (\frac{(Y_{M}^{u})_{i} - v_{j}}{(Y_{M}^{u})_{i} - v_{k}})^{2 / (m - 1)}}, \\ 1 \leq j \leq p, 1 \leq i \leq n_{um} \end{matrix}$ (19)

$\begin{matrix} (U_{W}^{u})_{ji} = \frac{1}{\sum_{k = 1}^{p} (\frac{(Y_{W}^{u})_{i} - v_{j}}{(Y_{W}^{u})_{i} - v_{k}})^{2 / (m - 1)}}, \\ 1 \leq j \leq p, 1 \leq i \leq n_{uw} \end{matrix}$ (20)

Where n_um is taken as the number of unlabeled data points with features and n_uw is taken as a number of unlabeled data points without missing features. Therefore n_u = n_um + n_uw

Finally, the cluster centres and missing vectors $Y_{M}^{u}$ and $Y_{M}^{l}$ are updated as: $\begin{matrix} v_{j} = \frac{\sum_{j = 1}^{n_{lw}} ((U_{W}^{l})_{ji})^{m} (Y_{W}^{l})_{i} + \sum_{i = 1}^{n_{lm}} ((U_{M}^{l})_{ji})^{m} (Y_{M}^{l})_{i}}{\sum_{i = 1}^{n_{lw}} ((U_{W}^{l})_{ji})^{m} + \sum_{i = 1}^{n_{lm}} ((U_{M}^{l})_{ji})^{m} + \sum_{i = 1}^{n_{uw}} ((U_{W}^{u})_{ji})^{m} + \sum_{i = 1}^{n_{um}} ((U_{M}^{u})_{ji})^{m}} \\ + \frac{\sum_{i = 1}^{n_{uw}} ((U_{W}^{u})_{ji})^{m} (Y_{W}^{u})_{i} + \sum_{i = 1}^{n_{um}} ((U_{M}^{u})_{ji})^{m} (Y_{M}^{u})_{i}}{\sum_{i = 1}^{n_{lw}} ((U_{W}^{l})_{ji})^{m} + \sum_{i = 1}^{n_{lm}} ((U_{M}^{l})_{ji})^{m} + \sum_{i = 1}^{n_{uw}} ((U_{W}^{u})_{ji})^{m} + \sum_{i = 1}^{n_{um}} ((U_{M}^{u})_{ji})^{m}} \end{matrix}$ (21)

$(Y_{M}^{l})_{j} = \frac{\sum_{j = 1}^{n_{lm}} ((U_{M}^{l})_{ji})^{m} v_{j}}{\sum_{j = 1}^{n_{lm}} ((U_{M}^{l})_{ji})^{m}}$ (22)

$(Y_{M}^{u})_{j} = \frac{\sum_{j = 1}^{n_{um}} ((U_{M}^{u})_{ji})^{m} v_{j}}{\sum_{j = 1}^{n_{um}} ((U_{M}^{u})_{ji})^{m}}$ (23)

The above discussion is summarized in the following algorithm.

Algorithm SSFCM_ID

Input: Y-a d-dimensional incomplete dataset which contains both labeled data points and unlabeled data; p-number of classes, m > 1 is a fuzzification factor, ∈>0 is termination accuracy

Output: Cluster centre v and missing features of both labeled and unlabeled data i.e. $Y_{M}^{l}$ & $Y_{M}^{u}$ respectively.

The given data set is represented as

$Y = {Y^{l} | Y^{u}} = {Y_{M}^{l} \cup Y_{W}^{l} | Y_{M}^{u} \cup Y_{W}^{u}}$

Initialize all missing features i.e. $Y_{M}^{l}$ & $Y_{M}^{u}$ with linear interpolation imputation using Equation (16).

Initialize Fuzzy partition vector U randomly. Further, partition vector U is divided as

$U = {U^{l} | U} = {U_{M}^{l} \cup U_{W}^{l} | U_{M}^{u} \cup U_{W}^{u}}$

Initialize v applying standard FCM on whole labeled data using Equation (18).

Repeat

v1 = v

Membership degree of labeled data is taken as 1, and a partition matrix of unlabeled data is separated from partition vector U and is updated using Equatios (19-20).

Cluster centre v1 is updated using Equation (21)

Update all missing features i.e. $Y_{M}^{l}$ & $Y_{M}^{u}$ using Equation (22-23).

Until ∥v₁ - v∥ < ∈ then Stop else go to (5)

Figure 3 describe the methodology of the proposed work graphically.

Fig. 3

Algorithm of proposed work.

4 Experimental analysis

This section explains data sets, performance indices, and the methodology used to evaluate the proposed semi-supervised clustering for incomplete data. All the experiments are carried out in a MATLAB 2019a environment. The relevant parameters used in all experiments are: fuzzification factor m is set to be 2, stopping criterion ∈=10^-5 and scaling factor α = 0.5.

4.1 Data sets

The proposed algorithm is tested on four real datasets, namely IRIS, WINE, SEED, and Wisconsin Breast Cancer of classification and pattern recognition problems from the UCI dataset Repository [1].

The IRIS data set comprises 150 samples of three species of iris flower with four different features. The SEED data set includes three sets of wheat seeds categorized into three different varieties, where each group consists of 70 samples.

WINE data set depicts the outcome of a chemical investigation of WINE grown in a specific area of Italy. It consists of 178 objects, and each object has 13 attributes found in each of the three classes of WINE data.

Wisconsin Breast Cancer data set is an incomplete real-world data set. The objective of this data is to identify benign or malignant classes. This data set contains 699 observations, with 16 missing attribute values.

4.2 Evaluation criterion

The accuracy of results for each algorithm is evaluated using three performance indices: misclassification rate [12], random index metric [13, 23], and error rate.

4.2.1 Misclassification rate

The misclassification rate [12] is the most widely recognized index used to determine the clustering algorithm’s effectiveness. It is defined as the sum of incorrectly assigned samples in each cluster. Let A is the number of incorrectly assigned data points in each cluster. $Misclassificationrate = \sum_{i = 1}^{n} A_{i}$ (24)

4.2.2 Rand index

It is an external index used to validate the clustering algorithms by measuring the similarity index between different clustering algorithms. The 0 value of Rand index (RI) indicates that there ino similarity between two data clustering, whereas 1 indicates both data clustering algorithms agree perfectly.

Let U = {u₁, u₂ …… … u_p} be the partition given by the proposed algorithm and V = {v₁, v₂ …… … v_s} be the partition defined by the standard clustering algorithm. Then RI can be determined [23] as follows:

$RI = \frac{\sum_{i = 1}^{p} \sum_{j = 1}^{s} (\begin{matrix} n_{ij} \\ 2 \end{matrix}) - [\sum_{i} (\begin{matrix} n_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} n_{s} \\ 2 \end{matrix})] / (\begin{matrix} n \\ 2 \end{matrix})}{\frac{1}{2} [\sum_{i} (\begin{matrix} n_{i} \\ 2 \end{matrix}) + \sum_{j} (\begin{matrix} n_{s} \\ 2 \end{matrix})] - [\sum_{i} (\begin{matrix} n_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} n_{s} \\ 2 \end{matrix})] / (\begin{matrix} n \\ 2 \end{matrix})}$ (25)

Where n_ij represents the number of samples common in partition matrix U and V. n_i gives the number of samples in U and n_s represents the number of samples in V, and n is the number of total samples. $Also (\begin{matrix} n_{ij} \\ 2 \end{matrix}) = \frac{n_{ij} (n_{ij} - 1)}{2}$ (25a)

4.2.3 Error rate

The error rate is the ratio of incorrectly assigned unlabeled data points to the total number of unlabeled samples in the data set. Let n is the number of unlabeled data points, and N_un is incorrectly assigned data points. $ER = \frac{N_{un} * 100}{n}$ (26)

The proposed algorithm (SSFCM_ID) compares the effectiveness obtained by applying a semi-supervised clustering concept to incomplete datasets with semi-supervised FCM (SSFCM1 and SSFCM2) and semi-supervised kernel-based FCM (SSKFCM).

Standard semi-supervised FCM (SSFCM1 and SSFCM2) clustering algorithms and SSKFCM algorithms utilize both labeled and unlabeled data. Our proposed algorithm utilizes all available data, i.e., incomplete data, complete data, labeled and unlabeled data. Four different portions of labeling and six randomly selected missing data patterns are used for determining the accuracy of the proposed algorithm.

4.3 Methodology to create missing feature values

Initially, a data set with complete information is selected, and then a small proportion of data is made supervised randomly. It divides the data into two parts, i.e., unlabeled data and labeled data. The labeled percentage range used is 30, 40, 50, and 60% of the total data. Data is made incomplete by arbitrarily removing certain components of its features with partial labeling. Missing percentage ranges in the proposed work are 5, 10, 15, 20, 25, and 30% of the total data points in a given data set. The proposed algorithm utilizes both labeled and unlabeled data, along with incomplete data. The proposed algorithm assumes data are missing completely at random (MCAR) mechanism with the following constraints:

Each original element vector holds at least one feature. For example, given a three-dimensional data $Y = \begin{matrix} y_{1} \\ y_{2} \\ y_{3} \end{matrix} [\begin{matrix} 2 & 5 & 7 \\ ? & 4 & ? \\ 3 & 5 & 6 \end{matrix}]$

Here the element vector y₂ has at least one feature value |4| presents in the data.

Each data point has at least one value present in the incomplete data set. For example, given three-dimensional data, $Y = \begin{matrix} y_{1} \\ y_{2} \\ y_{3} \end{matrix} [\begin{matrix} 2 & 4 | & ? | \\ 7 & 2 | & 2 | \\ 9 & 4 | & ? | \end{matrix}]$

Here, the third component of sample y1 and y3 is missing, and only one component value |2| is present as the third component of data sample y2.

Table 1 describes the number of missing features with different missing rates in datasets.

Table 1
Number of missing features with different missing rates in datasets

Name of the Data set IRIS SEED WINE WISCONSIN BREAST CANCER

Total data feature vector 150 210 178 699

Dimension of data vector 4 7 13 9

Total number of features 600 1470 2314 6291

5% 30 74 116

10% 60 147 232

Number of missing features 15% 90 221 347 16

for different percentage of 20% 120 294 463

Missing Rate 25% 150 368 579

30% 180 441 694

Name of the Data set	IRIS	SEED	WINE	WISCONSIN BREAST CANCER
Total data feature vector	150	210	178	699
Dimension of data vector	4	7	13	9
Total number of features	600	1470	2314	6291
	5%	30	74	116
	10%	60	147	232
Number of missing features	15%	90	221	347	16
for different percentage of	20%	120	294	463
Missing Rate	25%	150	368	579
	30%	180	441	694

4.4 Results and discussion

In the proposed work, 20 independent trials are conducted for each set of missing and labeling data, and the same data sample created in every trial is utilized for all the calculations with the goal that the outcomes can be effectively examined and compared. Results are then calculated by taking the average of all independent trials. Experiments are performed on different data sets, and results are stated in Tables 2-5 for misclassification rate (Mis), Rand Index (RI), and error rate (Err). Results for IRIS and SEED data reported in Tables 2-3 show that when the fraction of missing data is between 5% –20%, the proposed approach outperforms SSFCM1, SSFCM2 and SSKFCM for all performance indices. As the

Table 2
Performance of different algorithms on IRIS data

% age of labeled data SSFCM1 SSFCM2 SSKFCM Proposed Algorithm (SSFCM_ID)

Mis RI Err Mis RI Err MIS RI Err % of Missing Data

5% 10% 15% 20% 25% 30%

Mis RI Err Mis RI Err Mis RI Err Mis RI Err Mis RI Err Mis RI Err

30 10.5 0.74 10.0 11 0.80 10.476 9.2 0.77 8.76 6 0.87 5.71 7 0.85 6.67 9 0.82 8.57 10 0.76 9.5 11 0.71 10.48 12 0.69 11.43

40 8.7 0.75 9.66 7 0.87 7.778 8.1 0.77 9.0 5 0.90 5.56 6 0.88 6.67 7 0.85 7.78 8 0.78 8.9 8 0.78 8.9 10 0.74 11.11

50 6.9 0.76 9.20 4 0.92 5.330 6.2 0.79 8.27 4 0.93 5.33 5 0.90 6.67 5 0.90 6.94 6 0.84 8.5 7 0.79 9.3 8 0.75 10.66

60 5.2 0.79 8.62 4 0.92 6.667 3.85 0.83 6.42 3 0.96 5.0 4 0.93 6.67 4 0.93 6.35 5 0.87 8.3 6 0.85 10 6 0.79 10

% age of labeled data	SSFCM1	SSFCM2	SSKFCM	Proposed Algorithm (SSFCM_ID)
30	10.5	0.74	10.0	11	0.80	10.476	9.2	0.77	8.76	6	0.87	5.71	7	0.85	6.67	9	0.82	8.57	10	0.76	9.5	11	0.71	10.48	12	0.69	11.43
40	8.7	0.75	9.66	7	0.87	7.778	8.1	0.77	9.0	5	0.90	5.56	6	0.88	6.67	7	0.85	7.78	8	0.78	8.9	8	0.78	8.9	10	0.74	11.11
50	6.9	0.76	9.20	4	0.92	5.330	6.2	0.79	8.27	4	0.93	5.33	5	0.90	6.67	5	0.90	6.94	6	0.84	8.5	7	0.79	9.3	8	0.75	10.66
60	5.2	0.79	8.62	4	0.92	6.667	3.85	0.83	6.42	3	0.96	5.0	4	0.93	6.67	4	0.93	6.35	5	0.87	8.3	6	0.85	10	6	0.79	10

Table 3

Performance of different algorithms on SEED data

% age of labeled data	SSFCM1			SSFCM2			SSKFCM			Proposed Algorithm (SSFCM_ID)
	Mis	RI	Err	Mis	RI	Err	MIS	RI	Err	% of Missing Data
											5%			10%			15%			20%			25%			30%
										Mis	RI	Err	Mis	RI	Err	Mis	RI	Err	Mis	RI	Err	Mis	RI	Err	Mis	RI	Err
30	15	0.71	10.5	16	0.75	10.884	16.25	0.70	11.05	11	0.79	7.48	12	0.77	8.16	12	0.77	8.16	13	0.75	8.84	15	0.72	10.2	16	0.70	10.9
40	12.65	0.73	10.0	13	0.76	10.3175	13.75	0.70	10.91	8	0.83	6.35	9	0.81	7.14	9	0.81	7.14	11	0.77	8.73	13	0.73	10.3	14	0.71	11.1
50	9.9	0.74	9.43	9	0.78	8.5714	10.9	0.72	10.38	6	0.85	5.71	7	0.82	6.70	7	0.82	6.67	9	0.78	8.51	10	0.76	9.5	11	0.73	10.47
60	7.5	0.77	8.51	7	0.78	8.333	8.5	0.72	10.12	5	0.87	5.49	5	0.84	5.90	5	0.84	5.81	6	0.81	7.06	7	0.78	8.23	9	0.72	10.58

Table 4

Performance of different algorithms on WINE data

% age of labeled data	SSFCM1			SSFCM2			SSKFCM			Proposed Algorithm (SSFCM_ID)
	Mis	RI	Err	Mis	RI	Err	MIS	RI	Err	% of Missing Data
										5%			10%			15%			20%			25%			30%
										Mis	RI	Err	Mis	RI	Err	Mis	RI	Err	Mis	RI	Err	Mis	RI	Err	Mis	RI	Err
30	3.95	0.90	3.16	5	0.87	4.302	8.85	0.80	6.84	4	0.90	3.2	5	0.87	4.0	8	0.80	6.4	9	0.79	7.2	10	0.77	8.2	12	0.72	9.6
40	3.20	0.91	2.99	5	0.87	4.717	6.1	0.83	5.70	3	0.91	2.8	4	0.88	3.74	5	0.85	4.7	7	0.81	6.5	8	0.78	7.6	10	0.73	9.4
50	2.6	0.91	2.92	4	0.89	4.494	4.7	0.84	5.28	2	0.93	2.3	2	0.93	2.25	3	0.89	3.4	5	0.84	5.6	7	0.81	7.5	8	0.73	9.4
60	2.05	0.91	2.84	4	0.89	5.634	3.80	0.84	5.27	2	0.92	2.7	2	0.93	2.70	2	0.92	2.78	4	0.84	5.55	5	0.81	6.9	6	0.77	8.33

Table 5

Performance of different algorithms on Wisconsin Breast Cancer Data set

% age of labeled data	SSFCM1			SSFCM2			SSKFCM			Proposed Algorithm (SSFCM_ID)
	Mis	RI	Err	Mis	RI	Err	MIS	RI	Err	Mis	RI	Err
30	19.3	0.84	4.03	22	0.87	4.60	22.25	0.82	4.84	22	0.88	4.50
40	16.4	0.84	4.0	22	0.87	5.33	19.5	0.82	4.75	19	0.89	4.50
50	13.8	0.85	3.80	16	0.90	4.69	16.10	0.82	4.71	16	0.92	4.30
60	11	0.86	3.50	15	0.91	5.49	13.25	0.83	4.70	12	0.93	4.28

percentage of labeling increases, the performance of all clustering algorithms improves. When the missing percentage is increased to 25%, the proposed approach results are comparable to the competing algorithms. It is further seen that as the missing rate is further increased to 30%, the results of the proposed approach are again comparable if not better than SSFCM1, SSFCM2 and SSKCM, thus indicating the robustness of the clustering algorithm in the presence of missing data.

Next, results in terms of misclassification rate, random index, and error rate for different patterns for missing and labeled data to the WINE data set are presented in Table 4. The performance of the proposed algorithm is better than SSKFCM but comparable to SSFCM1, SSFCM2 when the missing rate is less than 15%. But as the missing ratio increases more than 15%, the proposed algorithm does not give better results than the other techniques.

Comparisons are also made between SSFCM1, SSFCM2, SSKFCM, and the proposed SSFCM_ID for the WISCONSIN BREAST CANCER data set, which is an incomplete real-world data set, and results are reported in Table 5. This data set contains 699 observations, with 16 missing attribute values. SSFCM1, SSFCM2, and SSKFCM algorithms utilize only complete data; therefore, both the algorithms are implemented only on 683 data points, without considering 16 missing attributes, whereas our proposed algorithm SSFCM_ID performs at par if not better even when applied on 699 data points containing missing features. This analysis supports our claim that the proposed semi-supervised clustering algorithm can handle incomplete data efficiently.

Figure 4 presents the experimental results of different clustering algorithms in terms of the random index for different data sets. The proposed algorithm SSFCM_ID with different missing percentages: 5%, 10%, 15%, 20%, 25% and 30% is further represented as SSFCM-ID05%, SSFCM-ID10%, SSFCM-ID15%, SSFCM-ID20%, SSFCM-ID25% and SSFCM-ID 30% respectively. The random index increases for SSFCM1, SSFCM2, and SSKFCM algorithms as the percentage of labeled data increases. However, the random index obtained by the proposed algorithm, even with missing data, is better than other algorithms for IRIS, SEED, and Wisconsin cancer data. For WINE data, the performance of the proposed work is at par when compared with other algorithms. Hence, the proposed algorithm works on both missing and unlabeled data and is compatible with SSFCM1, SSFCM2, and SSKFCM.

Fig. 4

Experimental results for different algorithms in terms of Random Index for (a) IRIS Data, (b) SEED Data, (c) WINE Data (d) Wisconsin Breast Cancer Data.

5 Conclusion

This paper presents a new approach for a semi-supervised clustering algorithm for incomplete data (SSFCM_ID). The proposed work is motivated by considering how to reveal the inner data structure information of unlabeled data with clustering when missing features are present in data sets. The algorithm utilizes all the available data-complete and incomplete, as well as labeled and unlabeled. The importance of a linear interpolation imputation method to compute the missing values is discussed and demonstrated. Despite the additional missing features, the proposed semi-supervised algorithm has been successful on various real-life data sets. Different percentages of missing and labeling are considered to evaluate the performance of the proposed work. Experimental outcomes show that in comparison to SSFCM1, SSFCM2, and SSKFCM algorithms, the proposed algorithm performs better on IRIS and SEED data sets and gives comparable results with WINE and Wisconsin Breast Cancer data sets.

References

Bache

and Lichman

, UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science 28 (2013). http://archive.ics.uci.edu/ml.

Bensaid

A.M.

, Hall

L.O.

, Bezdek

J.C.

and Clarke

L.P.

, Partially supervised clustering for image segmentation, , Pattern Recognition 29(5) (1996), 859–871. https://doi.org/10.1016/0031-3203(95)00120-4.

Beretta

and Santaniello

, Nearest neighbor imputation algorithms: a critical evaluation, BMC Medical Informatics and Decision Making 16(3) (2016), 74. https://doi.org/10.1186/s12911-016-0318-z.

Bouchachia

and Pedrycz

, Data clustering with partial supervision, Data Mining and Knowledge Discovery 12(1) (2006), 47–78. https://doi.org/10.1007/s10618-005-0019-1.

Bouchachia

and Pedrycz

, A semi-supervised clustering algorithm for data exploration. In International Fuzzy Systems Association World Congress Springer, Berlin, Heidelberg, (2003), 328–337.

Chapra

S.C.

and Canale

R.P.

, Numerical methods for engineers (Vol. 2). New York: Mcgraw-hill, (1998).

Dixon

J.K.

, Pattern recognition with partly missing data, , IEEE Transactions on Systems Man and Cybernetics 9(10) (1979), 617–621. https://doi:10.1109/TSMC.1979.4310090.

Donders

A.R.T.

, Van Der Heijden

G.J.

, Stijnen

and Moons

K.G.

, Review: A gentle introduction to imputation of missing values, , Journal of Clinical Epidemiology 59(10) (2006), 1087–1091. https://doi.org/10.1016/j.jclinepi.2006.01.014.

Dubes

R.C.

and Jain

A.K.

, Algorithms for clustering data, (1988).

10.

Fang

, MIFuzzy clustering for incomplete longitudinal data in smart health, Smart Health 1 (2017), 50–65. https://doi.org/10.1016/j.smhl.2017.04.002.

11.

Hathaway

R.J.

and Bezdek

J.C.

, Fuzzy c-means clustering of incomplete data, IEEE Transactions on Systems, Man and Cybernetics Part B (Cybernetics 31(5) (2001), 735–744. https://doi:10.1109/3477.956035.

12.

Huang

and Ng

M.K.

, A fuzzy k-modes algorithm for clustering categorical data, IEEE Transactions on Fuzzy Systems 7(4) (1999), 446–452. https://doi:10.1109/91.784206.

13.

Hubert

and Arabie

, Comparing partitions, Journal of Classification 2(1) (1985), 193–218. https://doi.org/10.1007/BF01908075.

14.

Hwang

, Oh

, Cox

, Tang

S.J.

and Tibbals

H.F.

, Blood detection in wireless capsule endoscopy using expectation maximization clustering, In Medical Imaging 2006: Image Processing (Vol. 6144, p. 61441P). International Society for Optics and Photonics, (2006). https://doi.org/10.1117/12.654109.

15.

Jung

Y.G.

, Kang

M.S.

and Heo

, Clustering performance comparison using K-means and expectation-maximization algorithms, Biotechnology & Biotechnological Equipment 28(sup1) (2014), S44–S48. https://doi.org/10.1080/13102818.2014.949045.

16.

Junninen

, Niska

, Tuppurainen

, Ruuskanen

and Kolehmainen

, Methods for imputation of missing values in air quality data sets, Atmospheric Environment 38(18) (2004), 2895–2907. https://doi.org/10.1016/j.atmosenv.2004.02.026.

17.

Kornelsen

and Coulibaly

, Comparison of interpolation, statistical, and data-driven methods for imputation of missing values in a distributed soil moisture dataset, Journal of Hydrologic Engineering 19(1) (2012), 26–43. https://doi/abs/10.1061/(ASCE)HE.1943-5584.0000767.

18.

, Gu

and Zhang

, A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data, Expert Systems with Applications 37(10) (2010), 6942–6947. https://doi.org/10.1016/j.eswa.2010.03.028.

19.

, Song

, Zhang

and Zhen

, Robust k-median and k-means clustering algorithms for incomplete data, Mathematical Problems in Engineering (2016). https://doi.org/10.1155/2016/4321928.

20.

Noor

M.N.

, Yahaya

A.S.

, Ramli

N.A.

and Al Bakri

A.M.M.

, Filling missing data using interpolation methods: study on the effect of fitting distribution (Vol. 594, pp. 889–895), Trans Tech Publications, (2014). https://doi.org/10.4028/www.scientific.net/KEM.594-595.

21.

Pedrycz

, Amato

, Di Lecce

and Piuri

, Fuzzy clustering with partial supervision in organization and classification of digital images, IEEE Transactions on Fuzzy Systems 16(4) (2008), 1008–1026. https://doi:10.1109/TFUZZ.2008.917287.

22.

Pedrycz

, Algorithms of fuzzy clustering with partial supervision, Pattern Recognition Letters 3(1) (1985), 13–20. https://doi.org/10.1016/0167-8655(85)90037-6.

23.

Rand

W.M.

, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66(336) (1971), 846–850. https://doi:10.1080/01621459.1971.10482356.

24.

Stutz

and Runkler

T.A.

, Classification and prediction of road traffic using application-specific fuzzy clustering, IEEE Transactions on Fuzzy Systems 10(3) (2002), 297–308. https://doi:10.1109/TFUZZ.2002.1006433.

25.

Toutenburg

and Nittner

, Linear regression models with incomplete categorical covariates, Computational Statistics 17(2) (2002), 215–232. https://doi.org/10.1007/s001800200103.

26.

White

I.R.

, Royston

and Wood

A.M.

, Multiple imputation using chained equations: issues and guidance for practice, Statistics in Medicine 30(4) (2011), 377–399. https://doi.org/10.1002/sim.4067.

27.

Yang

M.S.

and Tsai

H.S.

, A Gaussian kernel-based fuzzy c-means algorithm with a spatial bias correction, Pattern Recognition Letters 29(12) (2008), 1713–1725. https://doi.org/10.1016/j.patrec.2008.04.016.

28.

Zhang

D.Q.

and Chen

S.C.

, Kernel-based fuzzy and possibilistic c-means clustering, In Proceedings of the International Conference Artificial Neural Network 122 (2003), 122–125. https://doi=10.1.1.491.540.

29.

Zhang

and Lu

, Semi-supervised fuzzy clustering: A kernel-based approach, Knowledge-Based Systems 22(6) (2009), 477–481. https://doi.org/10.1016/j.knosys.2009.06.009.

30.

Zhang

, Lu

, Liu

, Pedrycz

and Zhong

, Fuzzy c-means clustering of incomplete data based on probabilistic information granules of missing values, Knowledge-Based Systems 99 (2016), 51–70. https://doi.org/10.1016/j.knosys.2016.01.048.