Towards improving machine learning algorithms accuracy by benefiting from similarities between cases

Abstract

Data preprocessing is a necessary core in data mining. Preprocessing involves handling missing values, outlier and noise removal, data normalization, etc. The problem with existing methods which handle missing values is that they deal with the whole data ignoring the characteristics of the data (e.g., similarities and differences between cases). This paper focuses on handling the missing values using machine learning methods taking into account the characteristics of the data. The proposed preprocessing method clusters the data, then imputes the missing values in each cluster depending on the data belong to this cluster rather than the whole data. The author performed a comparative study of the proposed method and ten popular imputation methods namely mean, median, mode, KNN, IterativeImputer, IterativeSVD, Softimpute, Mice, Forimp, and Missforest. The experiments were done on four datasets with different number of clusters, sizes, and shapes. The empirical study showed better effectiveness from the point of view of imputation time, Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (R² score) (i.e., the similarity of the original removed value to the imputed one).

Keywords

Data preprocessing missing data imputation missingness mechanisms

1 Introduction

In real world datasets, missing values are unavoidable [1 –4]. There is a variety of reasons why values may be missing, but common causes are related to equipment faults, human errors or other environment conditions [5]. Statistical analyses face the occurrence of missing values in the datasets. Data-dependent tool like machine learning is used in the studies of statistical learning to find a predictive model. Undoubtedly, better quality of the data generates effective model; therefore, better prediction and analysis. Data quality depends on some features (e.g., similarities between objects and complete values of cases). Considering the similarities between objects and dealing with the missing values are important stages of the data pre-processing steps [6].

In the clustering process, the major interest is revealing the collective of objects into reasonable groups allowing one to detect similarities and differences, as well as to conclude useful and important inferences about them [7]. Absence of labels in the datasets caused the expansion and genesis of clustering issues in which there are no predefined classes and examples. Unlike classification, in which the groups are predefined and the procedure of the classification assigns an object to them, clustering creates premier groups in which objects of a dataset are classified through the classification process. The essential steps of the clustering procedure are described in Table 1.

Table 1
Clustering steps and their descriptions

Step Description

1- Feature selection •Clustering the cases depends on a set of features. The main goal is selecting the analogical features of the cases to be clustered.

2- Choice of clustering algorithm •Choosing the algorithm that is more appropriate for the data on hand. Similarity measure and clustering criterion are selected in tandem:

–Optimizing the cost function (clustering criterion) and expecting the type of generated clusters.

–Ensuring that the selected attributes share evenly to the computation of the closeness measure and no attributes dominate others.

3- Validation the results •Verifying the exactitude of the clustering approach results using appropriate techniques and criteria.

4- User decision •Choosing the results gained or starting from the beginning using other parameters or perhaps changing different algorithm.

Step	Description
1- Feature selection	•Clustering the cases depends on a set of features. The main goal is selecting the analogical features of the cases to be clustered.
2- Choice of clustering algorithm	•Choosing the algorithm that is more appropriate for the data on hand. Similarity measure and clustering criterion are selected in tandem:
	–Optimizing the cost function (clustering criterion) and expecting the type of generated clusters.
	–Ensuring that the selected attributes share evenly to the computation of the closeness measure and no attributes dominate others.
3- Validation the results	•Verifying the exactitude of the clustering approach results using appropriate techniques and criteria.
4- User decision	•Choosing the results gained or starting from the beginning using other parameters or perhaps changing different algorithm.

1.1 The contributions of this paper

This paper gives a summary of the studies which related to handling the missing data, shows that the performance metrics (e.g., imputation time accuracy, and error) depend on: i) algorithm’s nature; for example, some methods depend only on the data in the same variable, such as Mode, Mean, and Median, other methods be helped by other variables such as interpolation imputation, ii) characteristics of the data; for example, data size (i.e., number of attributes and records), shapes, and number of clusters. This paper discusses the effect of imputing missing data at the clusters level instead of the whole data. The proposed preprocessing step utilizes the nature of the clustered data to impute the missing data effectively. This paper compares between common imputation packages in the case of the whole data level and the clusters level from the point of view of RMSE, MAE, and R² score.

1.2 Missingness data mechanism

The complete data is considered to be a hypothetical entity because it may contains some of its values which are missing. Rubin [8] introduced classification for missingness mechanism which is widely used today. Missingness mechanism describes the probability of a missing values relate to the data. The common missing data mechanisms in the literature are: Missing Not at Random (MNAR), Missing at Random (MAR), and Missing Completely at Random (MCAR) [9 –19].

a. Missing not at random

Data are MNAR when the probability of the missing data on a variable X is related to the values of X itself. The probability distribution can be written as: $p (R | Y_{obs}, Y_{mis}, φ)$ (1)

where p represents a generic symbol for the probability distribution, R is the indicator for the missing data, Y_mis and Y_obs are the missing and observed parts of the data, respectively, and φ is a parameter(s) that describe the relation between the data and R. Equation 1 says: the probability that R takes can depend on both Y_mis and Y_obs.

b. Missing at random

Data are MAR when the probability of the missing data on a variable X is concerned to some other variable(s) but not to the values of X itself. MAR means that there is a systematic relationship between one or more variables and the probability of the missingness. This implies that R depends on Y_obs but not Y_mis. The probability distribution can be written as: $p (R | Y_{obs}, φ)$ (2)

Equation 2 says: the probability of the missingness is dependent on the observed part of the data via some parameter(s) φ that relates Y_obs to R.

c. Missing completely at random

Data are MCAR when the probability of the missing data on a variable X is unrelated to the values of X itself and is unrelated to other variables. The missingness of the data is completely unrelated to the data; this makes MCAR more restrictive condition than MAR. Consequently, both Y_mis and Y_obs are unrelated to R. The probability distribution can be written as: $p (R | φ)$ (3)

Equation 3 says: the probability that R takes is governed by some parameter(s), but the missingness is not related to the data.

1.3 Dealing with missing data

Data-dependent tools assume that the data are complete (i.e., with no missing values) [11]. Abundant methods deal with missing values [12]. These methods are categorised into: i) deletion (i.e., removing the instances with incomplete data), and ii) imputation (i.e., filling in the missing values) [13, 14].

1.3.1 Deletion approach

Pairwise and Listwise deletion are the most common handling approaches for the missing data. The major advantage of these approaches is that they are standard choices in statistical software packages and convenient to implement. Pairwise deletion (also referred to as available-case analysis) attempts to alleviate the loss of the data by eliminating instances on the basis of analysis-by-analysis. Listwise deletion (also referred to as complete-case analysis (CCA)) [24 –29] discards the data for the instance that has at least one missing value. The major benefit of listwise is its convenience. Pairwise deletion (also referred to as available-case analysis (ACA)) attempts to alleviate the loss of the data by eliminating instances on the basis of analysis-by-analysis.

1.3.2 Imputation approach

Imputation means replacing the missing values by the appropriate values with the help of available information. The amount of missingness in the data and the missingness mechanism affect the quality of the imputed data, therefore, the imputation approach should be carefully chosen. Improper imputation of the missing data could produce incorrect data analysis, erroneous predictions, and false conclusions [17]. The imputation mechanism can be done in two ways: i) the variable which contains the missing values to be imputed is limited to itself only (e.g., Mode, Mean, Median), and ii) the variable which contains the missing values to be imputed depends on other variables (e.g., interpolation and regression). Imputation technique falls under single imputation or multiple imputation. In single imputation, each missing value is filled by one plausible value [18]. In multiple imputation, proposed by Rubin [19], each missing value is filled n times, n > 1, n versions of complete datasets are generated [20, 21]. Imputation can also be guessed using regression methods [22], hot-deck imputation [23, 24], K-nearest neighbours (KNNs) [25, 26], and etc. Some of the common imputation methods are: i) Arithmetic mean imputation (also referred to as unconditional mean imputation and mean substitution), imputes the missing values by the arithmetic mean of the available instances in which the data values are actually observed, ii) Interpolation imputation replaces the missing values by creating new data points within the range of the known data points, iii) Regression imputation (also referred to as conditional mean imputation) uses regression equation to predict the missing values. If the relationship between the response (i.e., dependent variable y) and the predictors (i.e., independent variable X) is linear, the regression is said to be simple linear regression and the mathematical equation can be written as below: $y = β_{0} + β_{1} X + ɛ$ (4)

where β₀ is the value of y when X is equal to zero, β₁ is estimated regression coefficient, and ɛ is the estimation error. Equation 4 says: regressing the response y on the predictor X. Value of the unknown y on the basis of X| ∀ X = x can be estimated by: $\hat{y} = \hat{β_{0}} + \hat{β_{1}} X$ (5)

The hat symbol $\hat{}$ expresses the estimated value of the unknown coefficient or parameter. When more independent variables used to achieve better prediction, the regression is said to be multiple linear regression and the mathematical equation can be written as below: $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p} + ɛ$ (6)

iv) Stochastic regression imputation uses regression equations to predict the missing values from the complete variables. However, after predicting the missing values, an extra random error term is added to each predicted score replacing the missing values with the resulting sum [27]. v) Multiple imputation (MI) proposed by Rubin [28] and further detailed in [29, 30], in MI, each missing value is substituted by a list of n > 1 simulated values. vi) K-Nearest Neighbor (KNN) [31, 32] computes the k nearest neighbors and uses a value from them for the imputation. For numerical data, the average value is taken, for nominal data, the most common value is taken. vii) Weighted K-Nearest Neighbor (WKNN) [33] selects the cases with similar values (i.e., in terms of distance) to a considered value, the different distances between the estimated value and the neighbors are taken into account using the most frequently repeated value according to the distance. viii) K-means Clustering (KM) [34]. ix) Hot deck [19] replaces the missing attribute value with a value from the estimated distribution of the current values. x) Cold deck [35] is similar to hot deck, except that the source of the data is different from the present dataset. In summary, the most common used approaches fall into three main categories (see Fig. 1).

Fig. 1

Main categories for handling missing data methods.

1.4 Objective

This paper revises the most common imputation algorithms, presents Python and R packages in the comparisons, and evaluates the imputation accuracy from the point of R² score, RMSE, and MAE (see Fig. 2).

Fig. 2

Imputation algorithms and validity measures used in this study.

1.5 Paper organisation

The rest of the paper is organized as follows: Section 2 discusses the literature review, the proposed preprocessing step and the experimental implementation are discussed in Section 3. Summary and conclusion are explained in Section 4. Figure 3 shows the structure of this paper.

Fig. 3

Organization of the paper.

2 Literature review

This section discusses an overview of the studies related to handling missing values.

Cismondi et al. [36] presented a method for dealing with missing values in intensive care units (ICU) databases to better the modelling performance. By using statistical classifier followed by fuzzy modelling, the authors determined which missing value should be imputed. The authors estimated the performance of the method by developing a test bed. Although their approach improved the classification accuracy, specificity, and sensitivity, their approach may be unsuccessful in filling all missing values. Hapfelmeier et al. [37] considered the problem of the imputation as an optimisation problem. The authors’ proposed framework uses decision tree, KNN, and support vector machine (SVM). Two composite methods (i.e., opt.cv and benchmark.cv) are incorporated in their framework. opt.cv selects the best method from opt.svm, opt.tree, and opt.knn. benchmark.cv selects the best method from iterative KNNs, KNNs, Bayesian principal component analysis mean, and predictive-mean matching. Although their proposed method performed better than other methods, the time used for picking out the best method that gives lowest error was long, in addition to the small size of the data used in the experiments. Batista and Monard [38] compared between some common methods (i.e., KNN, C4.5, and CN2). The authors performed the experiments at different ratio of the missing values. Although the experiments indicate that KNN performed better than C4.5 and CN2, C4.5 is competitive to ten-nearest neighbor in some cases. To assert the analysis significance, the value of the parameter K used should be increased. Aydilek and Arslan [39] combined two methods (support vector regression (SVR) and genetic algorithm (GA)) with fuzzy clustering to fill in missing data. The authors compared their proposed method with SvrGa, FcmGa, and Zeroimpute methods. Although the accuracy of the imputation was better, the size of the complete dataset affects the efficiency of the training phase by SVR, which means that if many variables contain many missing values, many instances will be discarded. Qin et al. [40] proposed a stochastic semi-parametric regression for imputing missing values in semi-parametric data. The authors compared their method with deterministic semi-parametric regression imputation from the point of view of RMSE using simulated and real data experimentally. Although effectiveness and efficiency of their proposed method were better than deterministic semi-parametric imputation, both of RMSE and mean squared error (MSE) are susceptible to the outliers since they give extra weights to large errors [41]. Li et al. [34] took advantage of fuzzy K-means for imputing missing values. Each object belongs to a cluster with a fuzzy membership degree. The authors evaluated the algorithm performance from the RMSE point of view. Depending on the fuzzifier value, K-means may perform better than fuzzy K-means and vice versa, which means that determining the adequate value of the fuzzifier parameter represents an issue. Acuña and Rodriguez [42] compared between four popular approaches, mean imputation, KNN, median imputation, and complete case analysis in supervised classification problems. The comparison was done using 12 datasets. Muñoz and Rueda [43] proposed two imputation approaches based on quantiles. The authors did not benefit from auxiliary information in the first algorithm, whereas they benefited from auxiliary information in the second algorithm. Determining the relationship between the variable of interest and the auxiliary variable is an issue. Batista and Monard [31] applied the measure of Euclidean distance to find the k cases that have the most similarities. Then, it fills in the missing categorical values in the variable of interest using the most frequent value within KNN cases. The missing numerical values are imputed with the unconditional mean. Although the KNN imputation algorithm is simple and its performance is higher than the mean/mode approaches, it is expensive in dataset with large size because it needs to look into as many times as the number of instances which contain missing values to find the nearest neighbours of each instance with missing value(s). Honghai et al. [44] used support vector machine (SVM) to impute missing values. Their work was devoid of any comparison with other imputation algorithms. Furthermore, the size of the training data with no missing values is not enough, which leads that the accuracy of regression will be affected. Pelckmans et al. [45] applied maximum-likelihood approach to get the estimates for the models assumed from their method for the covariates of missing data. The classification rules can be learnt from the data even when the input variables contain missing values, however, the aim of their proposed method is for high classification accuracy rather than high imputation accuracy. Samih [11] proposed a linear regression based algorithm called cumulative linear regression (CLR). His proposed algorithm incorporates the imputed variables into a linear regression equation to filling in the missing values in the subsequent incomplete variable. The author compared his proposed algorithm with eight algorithms namely Mice, ForImp, missForest, Simputation, VIM, IterativeImputer, KNN, and SoftImpute. In some cases, CLR outperforms other algorithms and vice versa. Samih and Hirofumi [46] presented novel technique to improve the accuracy of the prediction by clustering the data using K-means clustering method, and applying the selected prediction algorithm to every cluster. Their proposed approach benefits from the similarities between the cases to improve the prediction accuracy. Samih [47] proposed a new method for imputing missing values by the aid of other features (donors). His proposed method depends on the correlation between the donor and the variable of interest. In addition to the correlation, the number of cases which will be used in the imputation is another factor in the selection. His proposed method is not efficient with the dataset with categorical attributes; it behaves somewhat similar to unconditional mean. The missing values which have not been imputed by the donors will be imputed using another imputation method (i.e., unconditional mean).

3 Data preprocessing phase

This section gives a more in depth description of the proposed preprocessing phase. In the proposed preprocessing phase, the author takes benefit from the similarity attribute of the data to improve the accuracy of the imputation. Instead of applying the imputation method on the whole data, the proposed preprocessing phase applies the imputation method on each cluster. The proposed method assumes that the dataset is clustered and the number of clusters is known in advance, and the selected imputation algorithm is applied for every cluster.

To give more in depth clarification, an illustrative example is discussed in the next subsection.

3.1 Illustrative Example

An artificial data with 130 observations has been created in two clusters. The observations with the missing data are labeled, which means that these observations are already clustered. The positions of the missing values in the figure do not mean that the values are known, they are positioned in order to clarify that these observations belong to this cluster. The missing data in each cluster will be filled by the chosen algorithm with respect to the data in each cluster rather than the whole data. Figure 4 shows the data with missing values and Fig. 5 shows the data after imputation.

Fig. 4

Observations with missing values.

Fig. 5

Observations with filled values.

3.2 Experimental implementation

3.2.1 Benchmark datasets

Four datasets that are commonly used in different data bases repository and the literature are used in this comparative study (Table 2). For assessing the performance and generalisation of the methods, the datasets vary on types and numbers of missing values. The author generated the missing values in each dataset under the three types of mechanisms; MAR, MNAR, and MCAR, each type with 10%, 20%, and 30% missingness ratios (MRs)

Table 2
Datasets specifications

Dataset name #Instances #Features #Clusters References

aggregation 788 2 7 [48, 49]

flame 240 2 2 [50, 51]

Jain 373 2 2 [52, 53]

pathbased 300 2 3 [54, 55]

Dataset name	#Instances	#Features	#Clusters	References
aggregation	788	2	7	[48, 49]
flame	240	2	2	[50, 51]
Jain	373	2	2	[52, 53]
pathbased	300	2	3	[54, 55]

3.2.2 Imputation algorithms

The proposed preprocessing technique benefits from the similarity between records in the same cluster which in turn improves the accuracy of the imputation algorithm. Both Python and R provide some imputation algorithms. The comparison is done between ten common imputation algorithms namely mean, median, mode, KNN, IterativeImputer, IterativeSVD, Softimpute, Mice, Forimp, and Missforest. Table 3 gives short descriptions of the prediction algorithms used.

Table 3
Packages and functions used for experiments

Function name Package Programming language Description References

ampute Mice R generates missing data in MCAR, MNAR, and MAR. [56]

Mice Mice R implements the Multivariate Imputation by Chained Equations algorithm. [56, 57]

ForImp ForImp R implements the Forward Imputation algorithm. [58]

MissForest MissForest R uses a random forest trained on the observed values to predict the missing values. [59]

IterativeImputer Fancyimpute Python implements the Multivariate Imputation by Chained Equations algorithm. [60]

KNN Fancyimpute Python imputes missing values relying on weights of samples for rows which have observed data. [60]

SoftImpute Fancyimpute Python implements Spectral Regularization Algorithms for Learning Large Incomplete Matrices. [60, 61]

IterativeSVD Fancyimpute Python a strategy for imputing missing values by iterative low-rank SVD decomposition. Should be similar to SVDimpute [33]. [60]

Function name	Package	Programming language	Description	References
ampute	Mice	R	generates missing data in MCAR, MNAR, and MAR.	[56]
Mice	Mice	R	implements the Multivariate Imputation by Chained Equations algorithm.	[56, 57]
ForImp	ForImp	R	implements the Forward Imputation algorithm.	[58]
MissForest	MissForest	R	uses a random forest trained on the observed values to predict the missing values.	[59]
IterativeImputer	Fancyimpute	Python	implements the Multivariate Imputation by Chained Equations algorithm.	[60]
KNN	Fancyimpute	Python	imputes missing values relying on weights of samples for rows which have observed data.	[60]
SoftImpute	Fancyimpute	Python	implements Spectral Regularization Algorithms for Learning Large Incomplete Matrices.	[60, 61]
IterativeSVD	Fancyimpute	Python	a strategy for imputing missing values by iterative low-rank SVD decomposition. Should be similar to SVDimpute [33].	[60]

3.2.3 Performance evaluation

The imputation performance is evaluated using R², MAE, RMSE and the imputation time in seconds (t).

•RMSE: given by (1), in which y_i and $\hat{y_{i}}$ are the real value and predicted value of the i-th instance respectively, and n is the number samples [39, 62]. $RMSE = \sqrt{\frac{\sum_{i = 1}^{n} (y_{i} - \hat{y_{i}})^{2}}{n}}$ (7)

•MAE: given by (2) $MAE (y, \hat{y}) = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - \hat{y_{i}} |$ (8)

•R²: given by (3) $R^{2} (y, \hat{y}) = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - \hat{y_{i}})^{2}}{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}};$ (9) $\bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}$

•Time of imputation in seconds (t).

The imputation algorithm is considered to be efficient if it imputes in a little time with high accuracy and small error. The experiments were executed using a computer with the following specifications: Intel core i5-2400 (3.10 GHz) processor, 16GB memory, Gnu/Linux Fedora 28 OS, 1TB HDD, and R (version 3.5.2) and Python (version 3.7) programming language.

3.2.4 Experimental results and discussion

Table 4, 5, 6, 7 show imputation time, R² score, RMSE and MAE comparisons of the ten imputation algorithms with and without considering the clustering for flame, jain, pathbased, and aggregation datasets respectively. Figures 6 and 7 show how much improvement is achieved by the proposed algorithm in the aggregation dataset from the point of view of R² score, and RMSE and MAE respectively. Figures 8 and 9 belong to flame dataset, Figs. 10 and 11 belong to Jain dataset, and Figs. 12 and 13 belong to pathpased dataset.

Table 4
Imputation time, R² score, RMSE, and MAE comparison between the ten imputation algorithms with/out considering the clustering (flame dataset)

Flame

mean median

Whole data Clustered data Whole data Clustered data

Mech. MR% Time R²Score RMSE MAE Time R²Score RMSE MAE Time R²Score RMSE MAE Time R²Score RMSE MAE

MAR 10 0.002 0.814 3.074 3.074 0.366 0.811 3.097 2.399 0.007 0.812 3.087 3.087 0.383 0.808 3.124 2.453

20 0.002 0.864 2.625 2.056 0.350 0.914 2.084 1.693 0.003 0.865 2.617 2.050 0.375 0.912 2.111 1.721

30 0.005 0.747 3.542 2.981 0.355 0.821 2.982 2.355 0.003 0.747 3.542 2.984 0.371 0.822 2.972 2.344

Average 0.003 0.808 3.080 2.703 0.357 0.849 2.721 2.149 0.004 0.808 3.082 2.707 0.376 0.847 2.736 2.173

Improvement –99.07% 4.99% 13.22% 25.79% –98.94% 4.84% 12.66% 24.59%

MCAR 10 0.002 0.664 3.655 3.042 0.353 0.731 3.271 2.830 0.006 0.670 3.627 3.022 0.376 0.729 3.283 2.854

20 0.003 0.795 3.082 2.399 0.358 0.836 2.755 2.268 0.003 0.794 3.094 2.417 0.374 0.834 2.777 2.299

30 0.002 0.763 3.459 2.894 0.356 0.840 2.846 2.352 0.003 0.764 3.452 2.887 0.375 0.840 2.839 2.344

Average 0.002 0.741 3.399 2.778 0.356 0.802 2.957 2.483 0.004 0.742 3.391 2.775 0.375 0.801 2.966 2.499

Improvement –99.33% 8.30% 14.92% 11.88% –99.01% 7.91% 14.31% 11.06%

MNAR 10 0.003 0.674 4.316 3.741 0.355 0.766 3.657 3.019 0.003 0.671 4.338 3.765 0.373 0.774 3.598 2.964

20 0.002 0.700 4.216 3.692 0.362 0.866 2.815 2.385 0.003 0.695 4.253 3.729 0.380 0.868 2.797 2.374

30 0.002 0.713 3.836 3.131 0.359 0.819 3.043 2.445 0.003 0.699 3.929 3.220 0.378 0.812 3.106 2.490

Average 0.003 0.696 4.123 3.521 0.358 0.817 3.172 2.616 0.003 0.688 4.174 3.571 0.377 0.818 3.167 2.609

Improvement –99.32% 17.46% 29.99% 34.59% –99.27% 18.84% 31.79% 36.88%

Flame

mode KNN

Whole data Clustered data Whole data Clustered data

Mech. MR% Time R²Score RMSE MAE Time R²Score RMSE MAE Time R²Score RMSE MAE Time R²Score RMSE MAE

MAR 10 0.011 0.714 3.808 3.808 0.189 0.533 4.867 3.532 0.019 0.597 4.519 4.519 0.021 0.721 3.762 2.348

20 0.011 0.806 3.139 2.377 0.193 0.808 3.124 2.321 0.020 0.721 3.763 2.973 0.021 0.887 2.396 1.924

30 0.012 0.710 3.794 3.017 0.192 0.805 3.109 2.523 0.020 0.452 5.217 4.080 0.024 0.561 4.671 3.178

Average 0.011 0.743 3.581 3.067 0.191 0.715 3.700 2.792 0.019 0.590 4.500 3.857 0.022 0.723 3.610 2.483

Improvement –94.07% –3.78% –3.24% 9.86% –11.54% 22.50% 24.65% 55.34%

MCAR 10 0.012 0.668 3.636 3.146 0.188 0.723 3.319 2.842 0.018 0.611 3.933 3.177 0.020 0.784 2.9341 2.2026

20 0.016 0.207 6.065 5.029 0.193 0.560 4.517 3.488 0.019 0.634 4.120 3.150 0.021 0.756 3.3636 2.4066

30 0.010 0.719 3.764 3.019 0.191 0.836 2.877 2.369 0.021 0.644 4.236 3.478 0.023 0.795 3.2181 2.4534

Average 0.012 0.531 4.488 3.731 0.191 0.706 3.571 2.900 0.019 0.630 4.096 3.268 0.021 0.778 3.1719 2.3542

Improvement –93.49% 32.96% 25.68% 28.67% –9.86% 23.53% 29.14% 38.82%

MNAR 10 0.011 0.522 5.225 4.517 0.195 0.244 6.573 5.148 0.018 0.547 5.089 4.316 0.020 0.775 3.588 2.741

20 0.011 0.512 5.381 4.555 0.186 0.676 4.382 3.026 0.025 0.465 5.635 4.590 0.021 0.752 3.839 2.862

30 0.010 0.053 6.969 6.219 0.192 0.415 5.478 4.571 0.020 0.562 4.738 3.660 0.023 0.812 3.108 2.319

Average 0.011 0.363 5.858 5.097 0.191 0.445 5.478 4.248 0.021 0.525 5.154 4.189 0.021 0.779 3.512 2.641

Improvement –94.38% 22.79% 6.94% 19.97% –1.67% 48.54% 46.76% 58.62%

MAR 10 0.03 0.81 3.09 3.09 0.06 0.80 3.19 2.47 0.05 –0.72 9.35 9.35 0.09 0.02 7.06 5.72

20 0.03 0.86 2.62 2.06 0.06 0.91 2.11 1.70 0.04 –0.42 8.48 6.19 0.09 0.18 6.47 4.86

30 0.03 0.74 3.56 3.01 0.06 0.81 3.09 2.45 0.05 0.02 6.97 5.16 0.11 0.35 5.66 4.46

Average 0.03 0.81 3.09 2.72 0.06 0.84 2.79 2.20 0.05 –0.37 8.27 6.90 0.10 0.18 6.40 5.01

Improvement –53.14% 4.15% 10.71% 23.44% –54.92% 148.68% 29.22% 37.57%

MCAR 10 0.03 0.66 3.66 3.04 0.06 0.74 3.24 2.86 0.05 0.15 5.81 4.71 0.08 0.15 5.81 4.46

20 0.03 0.80 3.08 2.41 0.06 0.83 2.80 2.30 0.04 –0.55 8.48 6.17 0.09 –0.22 7.52 5.45

30 0.03 0.76 3.47 2.90 0.06 0.84 2.84 2.36 0.04 –0.12 7.52 5.32 0.09 0.01 7.07 4.97

Average 0.03 0.74 3.40 2.78 0.06 0.80 2.96 2.51 0.04 –0.17 7.27 5.40 0.09 –0.02 6.80 4.96

Improvement –0.50 0.08 0.15 0.11 –0.51 0.89 0.07 0.09

MNAR 10 0.03 0.67 4.35 3.80 0.06 0.76 3.72 3.11 0.04 0.34 6.13 4.72 0.09 0.54 5.12 4.20

20 0.03 0.70 4.22 3.69 0.06 0.87 2.79 2.32 0.04 –0.15 8.28 6.45 0.09 0.19 6.91 5.19

30 0.03 0.71 3.88 3.18 0.06 0.82 3.08 2.51 0.04 –0.23 7.95 5.95 0.10 –0.10 7.50 5.52

Average 0.03 0.69 4.15 3.56 0.06 0.81 3.20 2.65 0.04 –0.01 7.45 5.71 0.09 0.21 6.51 4.97

Improvement –51.67% 17.70% 29.89% 34.34% –57.01% 1538.96% 14.47% 14.82%

Flame

Softimpute Mice

Whole data Clustered data Whole data Clustered data

Mech. MR% Time R²Score RMSE MAE Time R²Score RMSE MAE Time R²Score RMSE MAE Time R²Score RMSE MAE

MAR 10 0.07 –0.53 8.79 8.79 0.04 –0.52 8.78 7.33 0.13 0.82 3.04 3.04 0.21 0.83 2.91 2.91

20 0.02 –0.57 8.92 6.97 0.06 –0.52 8.79 7.10 0.14 0.73 3.69 3.69 0.21 0.79 3.29 3.29

30 0.04 –0.56 8.81 7.04 0.04 –0.37 8.24 6.83 0.14 0.59 4.49 4.49 0.23 0.58 4.59 4.59

Average 0.040 –0.552 8.840 7.600 0.050 –0.471 8.604 7.086 0.136 0.714 3.743 3.743 0.219 0.732 3.598 3.598

Improvement –19.51% 14.73% 2.75% 7.24% –38.18% 2.48% 4.04% 4.04%

MCAR 10 0.03 –1.38 9.72 8.22 0.04 –1.15 9.25 7.84 0.14 0.66 3.65 3.65 0.21 0.73 3.27 3.27

20 0.02 –1.48 10.72 8.49 0.05 –1.33 10.41 8.35 0.14 0.49 4.86 4.86 0.21 0.71 3.66 3.66

30 0.04 –1.10 10.29 7.75 0.07 –0.83 9.60 7.29 0.15 0.69 3.94 3.94 0.21 0.70 3.91 3.91

Average 0.030 –1.318 10.245 8.154 0.052 –1.104 9.751 7.828 0.142 0.615 4.153 4.153 0.211 0.713 3.613 3.613

Improvement –42.03% 16.27% 5.07% 4.16% –32.90% 15.88% 14.96% 14.96%

MNAR 10 0.04 –0.80 10.15 7.98 0.05 –0.35 8.77 7.07 0.14 0.81 3.29 3.29 0.23 0.52 5.24 5.24

20 0.03 –2.14 13.66 11.46 0.05 –1.79 12.86 10.87 0.14 0.84 3.05 3.05 0.21 0.80 3.43 3.43

30 0.02 –2.24 12.90 10.67 0.06 –2.11 12.63 10.45 0.14 0.78 3.33 3.33 0.22 0.78 3.33 3.33

Average 0.031 –1.731 12.238 10.035 0.055 –1.414 11.419 9.462 0.139 0.812 3.224 3.224 0.219 0.702 4.000 4.000

Improvement –43.06% 18.31% 7.17% 6.05% –36.59% –13.64% –19.41% –19.41%

MAR 10 0.80 0.58 4.64 4.64 1.56 0.66 4.15 4.15 0.03 0.80 3.16 3.16 0.03 0.82 3.06 3.06

20 0.99 0.71 3.83 3.83 1.46 0.81 3.09 3.09 0.02 0.93 1.87 1.87 0.03 0.89 2.41 2.41

30 1.03 0.37 5.58 5.58 2.24 0.48 5.08 5.08 0.02 0.81 3.09 3.09 0.02 0.78 3.28 3.28

Average 0.941 0.554 4.680 4.680 1.756 0.650 4.110 4.110 0.022 0.847 2.709 2.709 0.027 0.828 2.915 2.915

Improvement –46.40% 17.47% 13.89% 13.89% –18.63% –2.23% –7.08% –7.08%

MCAR 10 0.85 0.45 4.70 4.70 1.99 0.55 4.23 4.23 0.02 0.74 3.21 3.21 0.02 0.65 3.75 3.75

20 1.05 0.51 4.79 4.79 2.36 0.66 3.98 3.98 0.03 0.85 2.61 2.61 0.03 0.81 2.94 2.94

30 1.11 0.61 4.44 4.44 2.02 0.69 3.94 3.94 0.02 0.86 2.69 2.69 0.03 0.80 3.22 3.22

Average 1.002 0.520 4.640 4.640 2.124 0.634 4.049 4.049 0.022 0.817 2.837 2.837 0.028 0.752 3.303 3.303

Improvement –52.82% 21.80% 14.61% 14.61% –22.40% –7.99% –14.09% –14.09%

MNAR 10 0.85 0.50 5.36 5.36 1.50 0.71 4.06 4.06 0.02 0.80 3.41 3.41 0.03 0.86 2.83 2.83

20 1.00 0.56 5.11 5.11 2.12 0.67 4.44 4.44 0.02 0.88 2.71 2.71 0.02 0.90 2.49 2.49

30 1.12 0.58 4.67 4.67 1.24 0.73 3.73 3.73 0.03 0.83 2.93 2.93 0.03 0.82 3.05 3.05

Average 0.988 0.544 5.046 5.046 1.624 0.703 4.077 4.077 0.023 0.835 3.015 3.015 0.029 0.858 2.792 2.792

Improvement –39.14% 29.17% 23.79% 23.79% –21.70% 2.70% 8.00% 8.00%

	Flame
MAR	10	0.002	0.814	3.074	3.074	0.366	0.811	3.097	2.399	0.007	0.812	3.087	3.087	0.383	0.808	3.124	2.453
	20	0.002	0.864	2.625	2.056	0.350	0.914	2.084	1.693	0.003	0.865	2.617	2.050	0.375	0.912	2.111	1.721
	30	0.005	0.747	3.542	2.981	0.355	0.821	2.982	2.355	0.003	0.747	3.542	2.984	0.371	0.822	2.972	2.344
	Average	0.003	0.808	3.080	2.703	0.357	0.849	2.721	2.149	0.004	0.808	3.082	2.707	0.376	0.847	2.736	2.173
	Improvement					–99.07%	4.99%	13.22%	25.79%					–98.94%	4.84%	12.66%	24.59%
MCAR	10	0.002	0.664	3.655	3.042	0.353	0.731	3.271	2.830	0.006	0.670	3.627	3.022	0.376	0.729	3.283	2.854
	20	0.003	0.795	3.082	2.399	0.358	0.836	2.755	2.268	0.003	0.794	3.094	2.417	0.374	0.834	2.777	2.299
	30	0.002	0.763	3.459	2.894	0.356	0.840	2.846	2.352	0.003	0.764	3.452	2.887	0.375	0.840	2.839	2.344
	Average	0.002	0.741	3.399	2.778	0.356	0.802	2.957	2.483	0.004	0.742	3.391	2.775	0.375	0.801	2.966	2.499
	Improvement					–99.33%	8.30%	14.92%	11.88%					–99.01%	7.91%	14.31%	11.06%
MNAR	10	0.003	0.674	4.316	3.741	0.355	0.766	3.657	3.019	0.003	0.671	4.338	3.765	0.373	0.774	3.598	2.964
	20	0.002	0.700	4.216	3.692	0.362	0.866	2.815	2.385	0.003	0.695	4.253	3.729	0.380	0.868	2.797	2.374
	30	0.002	0.713	3.836	3.131	0.359	0.819	3.043	2.445	0.003	0.699	3.929	3.220	0.378	0.812	3.106	2.490
	Average	0.003	0.696	4.123	3.521	0.358	0.817	3.172	2.616	0.003	0.688	4.174	3.571	0.377	0.818	3.167	2.609
	Improvement					–99.32%	17.46%	29.99%	34.59%					–99.27%	18.84%	31.79%	36.88%
	Flame
	mode	KNN
	Whole data	Clustered data	Whole data	Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.011	0.714	3.808	3.808	0.189	0.533	4.867	3.532	0.019	0.597	4.519	4.519	0.021	0.721	3.762	2.348
	20	0.011	0.806	3.139	2.377	0.193	0.808	3.124	2.321	0.020	0.721	3.763	2.973	0.021	0.887	2.396	1.924
	30	0.012	0.710	3.794	3.017	0.192	0.805	3.109	2.523	0.020	0.452	5.217	4.080	0.024	0.561	4.671	3.178
	Average	0.011	0.743	3.581	3.067	0.191	0.715	3.700	2.792	0.019	0.590	4.500	3.857	0.022	0.723	3.610	2.483
	Improvement					–94.07%	–3.78%	–3.24%	9.86%					–11.54%	22.50%	24.65%	55.34%
MCAR	10	0.012	0.668	3.636	3.146	0.188	0.723	3.319	2.842	0.018	0.611	3.933	3.177	0.020	0.784	2.9341	2.2026
	20	0.016	0.207	6.065	5.029	0.193	0.560	4.517	3.488	0.019	0.634	4.120	3.150	0.021	0.756	3.3636	2.4066
	30	0.010	0.719	3.764	3.019	0.191	0.836	2.877	2.369	0.021	0.644	4.236	3.478	0.023	0.795	3.2181	2.4534
	Average	0.012	0.531	4.488	3.731	0.191	0.706	3.571	2.900	0.019	0.630	4.096	3.268	0.021	0.778	3.1719	2.3542
	Improvement					–93.49%	32.96%	25.68%	28.67%					–9.86%	23.53%	29.14%	38.82%
MNAR	10	0.011	0.522	5.225	4.517	0.195	0.244	6.573	5.148	0.018	0.547	5.089	4.316	0.020	0.775	3.588	2.741
	20	0.011	0.512	5.381	4.555	0.186	0.676	4.382	3.026	0.025	0.465	5.635	4.590	0.021	0.752	3.839	2.862
	30	0.010	0.053	6.969	6.219	0.192	0.415	5.478	4.571	0.020	0.562	4.738	3.660	0.023	0.812	3.108	2.319
	Average	0.011	0.363	5.858	5.097	0.191	0.445	5.478	4.248	0.021	0.525	5.154	4.189	0.021	0.779	3.512	2.641
	Improvement					–94.38%	22.79%	6.94%	19.97%					–1.67%	48.54%	46.76%	58.62%
MAR	10	0.03	0.81	3.09	3.09	0.06	0.80	3.19	2.47	0.05	–0.72	9.35	9.35	0.09	0.02	7.06	5.72
	20	0.03	0.86	2.62	2.06	0.06	0.91	2.11	1.70	0.04	–0.42	8.48	6.19	0.09	0.18	6.47	4.86
	30	0.03	0.74	3.56	3.01	0.06	0.81	3.09	2.45	0.05	0.02	6.97	5.16	0.11	0.35	5.66	4.46
	Average	0.03	0.81	3.09	2.72	0.06	0.84	2.79	2.20	0.05	–0.37	8.27	6.90	0.10	0.18	6.40	5.01
	Improvement					–53.14%	4.15%	10.71%	23.44%					–54.92%	148.68%	29.22%	37.57%
MCAR	10	0.03	0.66	3.66	3.04	0.06	0.74	3.24	2.86	0.05	0.15	5.81	4.71	0.08	0.15	5.81	4.46
	20	0.03	0.80	3.08	2.41	0.06	0.83	2.80	2.30	0.04	–0.55	8.48	6.17	0.09	–0.22	7.52	5.45
	30	0.03	0.76	3.47	2.90	0.06	0.84	2.84	2.36	0.04	–0.12	7.52	5.32	0.09	0.01	7.07	4.97
	Average	0.03	0.74	3.40	2.78	0.06	0.80	2.96	2.51	0.04	–0.17	7.27	5.40	0.09	–0.02	6.80	4.96
	Improvement					–0.50	0.08	0.15	0.11					–0.51	0.89	0.07	0.09
MNAR	10	0.03	0.67	4.35	3.80	0.06	0.76	3.72	3.11	0.04	0.34	6.13	4.72	0.09	0.54	5.12	4.20
	20	0.03	0.70	4.22	3.69	0.06	0.87	2.79	2.32	0.04	–0.15	8.28	6.45	0.09	0.19	6.91	5.19
	30	0.03	0.71	3.88	3.18	0.06	0.82	3.08	2.51	0.04	–0.23	7.95	5.95	0.10	–0.10	7.50	5.52
	Average	0.03	0.69	4.15	3.56	0.06	0.81	3.20	2.65	0.04	–0.01	7.45	5.71	0.09	0.21	6.51	4.97
	Improvement					–51.67%	17.70%	29.89%	34.34%					–57.01%	1538.96%	14.47%	14.82%
	Flame
	Softimpute	Mice
	Whole data	Clustered data	Whole data	Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.07	–0.53	8.79	8.79	0.04	–0.52	8.78	7.33	0.13	0.82	3.04	3.04	0.21	0.83	2.91	2.91
	20	0.02	–0.57	8.92	6.97	0.06	–0.52	8.79	7.10	0.14	0.73	3.69	3.69	0.21	0.79	3.29	3.29
	30	0.04	–0.56	8.81	7.04	0.04	–0.37	8.24	6.83	0.14	0.59	4.49	4.49	0.23	0.58	4.59	4.59
	Average	0.040	–0.552	8.840	7.600	0.050	–0.471	8.604	7.086	0.136	0.714	3.743	3.743	0.219	0.732	3.598	3.598
	Improvement					–19.51%	14.73%	2.75%	7.24%					–38.18%	2.48%	4.04%	4.04%
MCAR	10	0.03	–1.38	9.72	8.22	0.04	–1.15	9.25	7.84	0.14	0.66	3.65	3.65	0.21	0.73	3.27	3.27
	20	0.02	–1.48	10.72	8.49	0.05	–1.33	10.41	8.35	0.14	0.49	4.86	4.86	0.21	0.71	3.66	3.66
	30	0.04	–1.10	10.29	7.75	0.07	–0.83	9.60	7.29	0.15	0.69	3.94	3.94	0.21	0.70	3.91	3.91
	Average	0.030	–1.318	10.245	8.154	0.052	–1.104	9.751	7.828	0.142	0.615	4.153	4.153	0.211	0.713	3.613	3.613
	Improvement					–42.03%	16.27%	5.07%	4.16%					–32.90%	15.88%	14.96%	14.96%
MNAR	10	0.04	–0.80	10.15	7.98	0.05	–0.35	8.77	7.07	0.14	0.81	3.29	3.29	0.23	0.52	5.24	5.24
	20	0.03	–2.14	13.66	11.46	0.05	–1.79	12.86	10.87	0.14	0.84	3.05	3.05	0.21	0.80	3.43	3.43
	30	0.02	–2.24	12.90	10.67	0.06	–2.11	12.63	10.45	0.14	0.78	3.33	3.33	0.22	0.78	3.33	3.33
	Average	0.031	–1.731	12.238	10.035	0.055	–1.414	11.419	9.462	0.139	0.812	3.224	3.224	0.219	0.702	4.000	4.000
	Improvement					–43.06%	18.31%	7.17%	6.05%					–36.59%	–13.64%	–19.41%	–19.41%
MAR	10	0.80	0.58	4.64	4.64	1.56	0.66	4.15	4.15	0.03	0.80	3.16	3.16	0.03	0.82	3.06	3.06
	20	0.99	0.71	3.83	3.83	1.46	0.81	3.09	3.09	0.02	0.93	1.87	1.87	0.03	0.89	2.41	2.41
	30	1.03	0.37	5.58	5.58	2.24	0.48	5.08	5.08	0.02	0.81	3.09	3.09	0.02	0.78	3.28	3.28
	Average	0.941	0.554	4.680	4.680	1.756	0.650	4.110	4.110	0.022	0.847	2.709	2.709	0.027	0.828	2.915	2.915
	Improvement					–46.40%	17.47%	13.89%	13.89%					–18.63%	–2.23%	–7.08%	–7.08%
MCAR	10	0.85	0.45	4.70	4.70	1.99	0.55	4.23	4.23	0.02	0.74	3.21	3.21	0.02	0.65	3.75	3.75
	20	1.05	0.51	4.79	4.79	2.36	0.66	3.98	3.98	0.03	0.85	2.61	2.61	0.03	0.81	2.94	2.94
	30	1.11	0.61	4.44	4.44	2.02	0.69	3.94	3.94	0.02	0.86	2.69	2.69	0.03	0.80	3.22	3.22
	Average	1.002	0.520	4.640	4.640	2.124	0.634	4.049	4.049	0.022	0.817	2.837	2.837	0.028	0.752	3.303	3.303
	Improvement					–52.82%	21.80%	14.61%	14.61%					–22.40%	–7.99%	–14.09%	–14.09%
MNAR	10	0.85	0.50	5.36	5.36	1.50	0.71	4.06	4.06	0.02	0.80	3.41	3.41	0.03	0.86	2.83	2.83
	20	1.00	0.56	5.11	5.11	2.12	0.67	4.44	4.44	0.02	0.88	2.71	2.71	0.02	0.90	2.49	2.49
	30	1.12	0.58	4.67	4.67	1.24	0.73	3.73	3.73	0.03	0.83	2.93	2.93	0.03	0.82	3.05	3.05
	Average	0.988	0.544	5.046	5.046	1.624	0.703	4.077	4.077	0.023	0.835	3.015	3.015	0.029	0.858	2.792	2.792
	Improvement					–39.14%	29.17%	23.79%	23.79%					–21.70%	2.70%	8.00%	8.00%

Table 5

Imputation time, R²score, RMSE, and MAE comparison between the ten imputation algorithms with/out considering the clustering (jain dataset)

	jain
	mean								median
	Whole data				Clustered data				Whole data				Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.003	0.170	7.578	0.003	0.368	0.614	5.170	4.093	0.003	0.253	7.188	6.013	0.384	0.609	5.199	3.985
	20	0.003	0.377	7.915	0.003	0.363	0.591	6.417	5.263	0.004	0.385	7.867	6.087	0.407	0.578	6.511	5.242
	30	0.002	0.301	8.905	0.002	0.368	0.629	6.493	5.341	0.003	0.325	8.750	6.879	0.413	0.599	6.744	5.513
	Average	0.003	0.283	8.133	0.003	0.367	0.611	6.027	4.899	0.004	0.321	7.935	6.326	0.402	0.596	6.151	4.913
	Improvement					–99.31%	116.02%	34.95%	–99.31%					–99.14%	85.53%	29.00%	28.76%
MCAR	10	0.003	0.195	9.294	0.003	0.365	0.593	6.609	5.545	0.003	0.150	9.550	7.673	0.401	0.585	6.673	5.589
	20	0.002	0.289	9.285	0.002	0.363	0.605	6.922	5.712	0.009	0.268	9.425	7.639	0.432	0.589	7.061	5.772
	30	0.003	0.401	7.912	0.003	0.363	0.666	5.910	4.861	0.004	0.355	8.212	6.582	0.389	0.654	6.016	4.921
	Average	0.003	0.295	8.830	0.003	0.364	0.621	6.481	5.373	0.005	0.258	9.062	7.298	0.407	0.609	6.583	5.427
	Improvement					–99.32%	110.45%	36.26%	–99.32%					–98.70%	136.54%	37.66%	34.46%
MNAR	10	0.003	–0.220	9.907	0.003	0.373	0.604	5.643	4.632	0.003	–0.453	10.811	9.535	0.381	0.574	5.853	4.705
	20	0.003	0.259	8.403	0.003	0.366	0.639	5.864	4.738	0.004	0.100	9.259	7.753	0.381	0.584	6.296	5.072
	30	0.003	0.132	9.336	0.003	0.358	0.547	6.743	5.536	0.003	–0.054	10.289	8.579	0.389	0.451	7.427	5.993
	Average	0.003	0.057	9.215	0.003	0.366	0.597	6.083	4.969	0.003	–0.136	10.120	8.622	0.384	0.536	6.526	5.257
	Improvement					–99.25%	946.99%	51.48%	–99.25%					–99.17%	495.23%	55.08%	64.03%
	jain
	mode								KNN
	Whole data				Clustered data				Whole data				Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.02	0.21	7.38	4.69	0.19	–0.18	9.04	6.62	0.03	0.07	8.04	5.07	0.03	0.09	7.95	4.85
	20	0.02	0.04	9.81	6.90	0.20	–0.34	11.61	8.82	0.04	0.21	8.92	5.97	0.03	0.31	8.34	5.45
	30	0.02	0.12	9.99	7.34	0.19	0.14	9.86	7.06	0.04	0.25	9.24	6.26	0.04	0.50	7.55	4.78
	Average	0.02	0.13	9.06	6.31	0.20	–0.13	10.17	7.50	0.03	0.17	8.73	5.76	0.03	0.30	7.95	5.03
	Improvement					–91.12%	–200.42%	–10.92%	–15.86%					0.75%	70.81%	9.88%	14.70%
MCAR	10	0.02	–0.32	11.91	9.93	0.19	0.00	10.34	7.87	0.03	0.23	9.10	6.23	0.03	0.38	8.16	5.36
	20	0.02	–0.07	11.41	8.82	0.19	0.46	8.11	6.27	0.04	0.11	10.37	7.45	0.04	0.29	9.31	6.36
	30	0.02	–0.18	11.10	8.67	0.20	0.05	9.98	7.22	0.04	0.30	8.53	5.61	0.03	0.39	7.97	5.00
	Average	0.02	–0.19	11.47	9.14	0.20	0.17	9.48	7.12	0.04	0.22	9.33	6.43	0.03	0.35	8.48	5.57
	Improvement					–91.48%	188.93%	21.07%	28.42%					4.21%	63.96%	10.09%	15.37%
MNAR	10	0.02	–1.68	14.68	13.06	0.19	–0.40	10.60	7.73	0.03	0.29	7.54	5.43	0.03	0.46	6.61	4.27
	20	0.02	–0.75	12.91	11.01	0.20	–0.29	11.10	8.60	0.04	–0.05	10.01	8.46	0.04	0.30	8.19	6.75
	30	0.02	–0.85	13.64	11.66	0.19	–0.54	12.44	9.90	0.04	–0.06	10.32	7.21	0.04	0.26	8.61	5.55
	Average	0.02	–1.09	13.74	11.91	0.19	–0.41	11.38	8.75	0.04	0.06	9.29	7.03	0.04	0.34	7.80	5.52
	Improvement					–90.60%	62.52%	20.78%	36.17%					–0.08%	463.67%	19.12%	27.34%
MAR	10	0.032	0.518	5.774	4.481	0.059	0.612	5.180	4.088	0.033	–6.713	23.102	17.142	0.063	0.302	6.949	4.421
	20	0.033	0.419	7.647	6.037	0.064	0.584	6.466	5.299	0.038	–3.220	20.600	15.312	0.067	0.419	7.644	5.178
	30	0.034	0.435	8.008	6.246	0.057	0.619	6.578	5.413	0.036	–2.428	19.727	14.072	0.068	0.130	9.941	7.429
	Average	0.033	0.457	7.143	5.588	0.060	0.605	6.075	4.933	0.036	–4.121	21.143	15.508	0.066	0.284	8.178	5.676
	Improvement					–44.73%	32.33%	17.58%	13.27%					–46.08%	106.88%	158.54%	173.23%
MCAR	10	0.0335	0.3642	8.2610	6.3685	0.0568	0.5888	6.6435	5.6134	0.0339	–1.9023	17.6506	14.6365	0.0668	–0.2864	11.7510	9.6862
	20	0.0311	0.3723	8.7255	6.9290	0.0558	0.5994	6.9707	5.7482	0.0373	–1.6282	17.8548	14.0525	0.0671	–0.0305	11.1804	8.8155
	30	0.0320	0.5263	7.0361	5.6890	0.0601	0.6676	5.8944	4.8173	0.0289	–1.3182	15.5659	12.5928	0.0761	–0.0944	10.6950	8.5917
	Average	0.032	0.421	8.008	6.329	0.058	0.619	6.503	5.393	0.033	–1.616	17.024	13.761	0.070	–0.137	11.209	9.031
	Improvement					–44.04%	46.95%	23.14%	17.35%					–52.30%	91.52%	51.88%	52.37%
MNAR	10	0.0350	0.1732	8.1545	6.6524	0.0604	0.6091	5.6072	4.6593	0.0311	–2.0054	15.5469	14.1162	0.0663	–0.1655	9.6816	8.2147
	20	0.0314	–0.1397	10.4219	8.3814	0.0632	0.5835	6.3000	5.0987	0.0375	–1.1017	14.1525	11.8669	0.0764	0.0014	9.7555	8.3935
	30	0.0314	0.1535	9.2199	7.5581	0.0578	0.4849	7.1919	5.8362	0.0349	–1.4989	15.8411	12.6243	0.0837	0.0105	9.9682	7.9411
	Average	0.033	0.062	9.265	7.531	0.060	0.559	6.366	5.198	0.035	–1.535	15.180	12.869	0.075	–0.051	9.802	8.183
	Improvement					–46.10%	797.15%	45.54%	44.87%					–54.25%	96.67%	54.87%	57.27%
	jain
	Softimpute								Mice
	Whole data				Clustered data				Whole data				Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.036	–0.617	10.579	6.561	0.044	–0.161	8.963	5.859	0.132	–0.631	10.624	10.624	0.231	–0.230	9.226	9.226
	20	0.024	–0.933	13.942	9.632	0.047	–0.458	12.107	8.711	0.145	0.419	7.645	7.645	0.213	0.121	9.401	9.401
	30	0.041	–0.921	14.768	10.316	0.042	–0.435	12.761	8.983	0.144	0.226	9.371	9.371	0.216	0.271	9.097	9.097
	Average	0.034	–0.824	13.096	8.836	0.044	–0.351	11.277	7.851	0.140	0.005	9.213	9.213	0.220	0.054	9.241	9.241
	Improvement					–23.99%	57.38%	16.13%	12.55%					–36.16%	1058.52%	–0.30%	–0.30%
MCAR	10	0.021	–2.076	18.171	15.484	0.043	–1.250	15.542	12.911	0.134	0.102	9.816	9.816	0.230	0.248	8.984	8.984
	20	0.031	–1.727	18.186	14.475	0.042	–1.079	15.880	12.384	0.144	0.254	9.516	9.516	0.235	0.293	9.264	9.264
	30	0.019	–2.047	17.845	14.324	0.052	–1.368	15.732	12.258	0.145	0.322	8.421	8.421	0.220	0.246	8.877	8.877
	Average	0.024	–1.950	18.067	14.761	0.045	–1.232	15.718	12.518	0.141	0.226	9.251	9.251	0.228	0.262	9.042	9.042
	Improvement					–47.50%	36.79%	14.95%	17.92%					–38.13%	16.12%	2.31%	2.31%
MNAR	10	0.023	–4.447	20.930	18.829	0.059	–2.995	17.925	15.979	0.134	–0.112	9.458	9.458	0.213	–0.175	9.722	9.722
	20	0.029	–3.005	19.538	16.634	0.052	–2.050	17.050	14.347	0.139	0.401	7.554	7.554	0.240	0.151	8.995	8.995
	30	0.047	–3.404	21.029	18.056	0.042	–2.553	18.888	16.014	0.145	0.019	9.927	9.927	0.221	–0.150	10.746	10.746
	Average	0.033	–3.619	20.499	17.840	0.051	–2.533	17.955	15.447	0.139	0.103	8.980	8.980	0.224	–0.058	9.821	9.821
	Improvement					–35.69%	30.01%	14.17%	15.49%					–37.92%	–156.59%	–8.57%	–8.57%
MAR	10	1.843	–0.101	8.727	8.727	3.035	–0.090	8.685	8.685	0.028	0.682	4.688	4.688	0.027	0.460	6.112	6.112
	20	2.306	0.064	9.704	9.704	2.708	0.132	9.342	9.342	0.027	0.625	6.142	6.142	0.029	0.664	5.816	5.816
	30	2.507	0.083	10.205	10.205	3.256	0.346	8.616	8.616	0.032	0.636	6.432	6.432	0.027	0.633	6.457	6.457
	Average	2.219	0.015	9.545	9.545	3.000	0.129	8.881	8.881	0.029	0.648	5.754	5.754	0.028	0.586	6.128	6.128
	Improvement					–26.03%	754.32%	7.48%	7.48%					5.22%	–9.59%	–6.11%	–6.11%
MCAR	10	2.215	0.146	9.574	9.574	2.393	0.196	9.292	9.292	0.028	0.608	6.487	6.487	0.031	0.590	6.631	6.631
	20	2.457	0.032	10.834	10.834	3.060	0.132	10.263	10.263	0.028	0.594	7.017	7.017	0.034	0.614	6.841	6.841
	30	2.629	0.077	9.823	9.823	2.635	0.195	9.171	9.171	0.031	0.684	5.745	5.745	0.023	0.657	5.990	5.990
	Average	2.434	0.085	10.077	10.077	2.696	0.174	9.575	9.575	0.029	0.629	6.416	6.416	0.029	0.620	6.487	6.487
	Improvement					–9.73%	104.83%	5.24%	5.24%					–1.18%	–1.32%	–1.09%	–1.09%
MNAR	10	1.976	0.148	8.277	8.277	3.724	0.250	7.768	7.768	0.026	0.660	5.226	5.226	0.023	0.475	6.499	6.499
	20	2.305	0.223	8.605	8.605	3.134	0.456	7.201	7.201	0.027	0.709	5.264	5.264	0.029	0.529	6.698	6.698
	30	2.608	–0.145	10.723	10.723	2.948	–0.003	10.035	10.035	0.031	0.532	6.857	6.857	0.024	0.565	6.608	6.608
	Average	2.296	0.075	9.202	9.202	3.269	0.234	8.335	8.335	0.028	0.634	5.782	5.782	0.025	0.523	6.602	6.602
	Improvement					–29.75%	210.87%	10.41%	10.41%					10.16%	–17.48%	–12.42%	–12.42%

Table 6

Imputation time, R²score, RMSE, and MAE comparison between the ten imputation algorithms with/out considering the clustering (pathbased dataset)

	pathbased
	mean								median
	Whole data				Clustered data				Whole data				Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.003	–0.046	6.734	5.628	0.357	0.081	6.313	4.818	0.004	–0.077	6.834	5.591	0.420	0.036	6.464	4.896
	20	0.357	0.081	6.313	4.818	0.372	0.218	5.523	3.963	0.003	0.014	6.199	4.829	0.385	0.126	5.836	4.194
	30	0.006	0.027	7.270	6.009	0.365	0.275	6.276	4.573	0.005	0.034	7.243	5.939	0.385	0.278	6.263	4.571
	Average	0.122	0.021	6.773	5.485	0.365	0.191	6.037	4.451	0.004	–0.010	6.759	5.453	0.397	0.147	6.188	4.554
	Improvement					–66.53%	826.69%	12.18%	23.22%					–99.00%	1633.44%	9.22%	19.74%
MCAR	10	0.003	–0.009	7.302	6.168	0.364	0.179	6.588	5.100	0.003	–0.026	7.363	6.202	0.389	0.152	6.695	5.191
	20	0.002	–0.043	7.136	5.846	0.363	0.248	6.060	4.391	0.015	–0.112	7.370	5.913	0.389	0.128	6.527	4.681
	30	0.006	0.014	6.907	5.902	0.359	0.362	5.558	4.114	0.003	0.006	6.937	5.911	0.384	0.354	5.591	4.145
	Average	0.004	–0.012	7.115	5.972	0.362	0.263	6.069	4.535	0.007	–0.044	7.223	6.009	0.387	0.211	6.271	4.672
	Improvement					–98.94%	2252.66%	17.24%	31.69%					–98.19%	581.01%	15.18%	28.60%
MNAR	10	0.003	–0.451	9.180	7.945	0.367	–0.171	8.246	6.595	0.007	–0.593	9.620	8.307	0.386	–0.155	8.190	6.593
	20	0.003	–0.616	8.681	7.432	0.385	–0.284	7.741	6.054	0.002	–1.094	9.883	8.404	0.386	–0.360	7.965	6.173
	30	0.002	–0.488	8.197	6.884	0.366	–0.121	7.115	5.392	0.005	–0.949	9.379	7.805	0.376	–0.341	7.781	5.747
	Average	0.003	–0.518	8.686	7.421	0.373	–0.192	7.701	6.014	0.005	–0.878	9.627	8.172	0.383	–0.285	7.979	6.171
	Improvement					–99.29%	62.93%	12.80%	23.39%					–98.74%	67.52%	20.66%	32.42%
	pathbased
	mode								KNN
	Whole data				Clustered data				Whole data				Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.024	–0.127	6.990	5.489	0.287	–1.479	10.368	7.483	0.026	–0.348	7.646	5.924	0.028	–0.036	6.702	4.587
	20	0.014	–0.479	7.591	5.480	0.285	–2.798	12.167	9.143	0.026	–0.253	6.988	5.202	0.028	–0.190	6.810	4.392
	30	0.016	–0.305	8.421	6.452	0.290	–1.043	10.534	7.201	0.028	–0.806	9.906	7.812	0.030	–0.022	7.450	4.936
	Average	0.018	–0.304	7.667	5.807	0.287	–1.773	11.023	7.942	0.027	–0.469	8.180	6.313	0.029	–0.083	6.988	4.638
	Improvement					–93.80%	–484.30%	–30.45%	–26.88%					–6.95%	82.41%	17.07%	36.11%
MCAR	10	0.015	–0.291	8.262	6.238	0.286	–1.729	12.011	8.693	0.026	–0.761	9.650	7.256	0.028	–0.962	10.185	6.957
	20	0.014	–0.551	8.704	6.587	0.286	–1.586	11.240	7.656	0.027	–0.672	9.036	6.694	0.031	–0.143	7.471	5.058
	30	0.015	–1.161	11.205	9.398	0.288	–3.338	15.875	12.302	0.028	–0.761	9.232	7.184	0.031	–0.149	7.456	4.991
	Average	0.015	–0.668	9.390	7.408	0.287	–2.218	13.042	9.550	0.027	–0.731	9.306	7.045	0.030	–0.418	8.371	5.668
	Improvement					–94.95%	–232.11%	–28.00%	–22.43%					–9.44%	42.87%	11.17%	24.28%
MNAR	10	0.015	–1.161	11.205	9.398	0.288	–3.338	15.875	12.302	0.025	–1.419	11.854	9.400	0.026	–0.929	10.584	7.212
	20	0.014	–1.605	11.023	9.243	0.292	–3.812	14.982	11.468	0.029	–1.607	11.027	8.647	0.028	–1.127	9.962	6.592
	30	0.013	–1.297	10.183	8.420	0.296	–3.079	13.570	10.043	0.029	–1.360	10.322	8.040	0.030	–0.971	9.434	5.998
	Average	0.014	–1.354	10.804	9.020	0.292	–3.410	14.809	11.271	0.028	–1.462	11.068	8.696	0.028	–1.009	9.993	6.601
	Improvement					–95.15%	–151.76%	–27.05%	–19.97%					–2.29%	30.98%	10.75%	31.73%
MAR	10	0.032	–0.051	6.749	5.626	0.083	0.071	6.348	4.840	0.016	–1.937	11.286	9.268	0.049	–1.221	9.813	7.324
	20	0.032	0.023	6.170	4.877	0.090	0.208	5.558	3.991	0.014	–2.101	10.994	9.135	0.050	–1.425	9.722	7.030
	30	0.031	0.027	7.272	6.004	0.085	0.275	6.276	4.569	0.015	–1.050	10.552	8.768	0.053	–0.574	9.245	6.724
	Average	0.031	0.000	6.730	5.502	0.086	0.184	6.060	4.467	0.015	–1.696	10.944	9.057	0.051	–1.073	9.594	7.026
	Improvement					–63.40%	85960.23%	11.05%	23.18%					–70.58%	36.73%	14.08%	28.91%
MCAR	10	0.029	–0.023	7.353	6.205	0.086	0.167	6.636	5.134	0.016	–1.948	12.485	10.274	0.053	–1.745	12.047	9.492
	20	0.032	–0.056	7.183	5.886	0.085	0.234	6.119	4.432	0.012	–1.298	10.596	8.949	0.052	–0.822	9.434	6.930
	30	0.031	0.008	6.928	5.932	0.082	0.347	5.622	4.190	0.014	–0.809	9.357	7.739	0.052	–0.324	8.003	5.650
	Average	0.031	–0.024	7.154	6.008	0.085	0.249	6.126	4.585	0.014	–1.352	10.812	8.987	0.052	–0.964	9.828	7.357
	Improvement					–63.79%	1160.69%	16.79%	31.02%					–73.27%	28.72%	10.01%	22.15%
MNAR	10	0.031	–0.452	9.182	7.954	0.084	–0.167	8.234	6.587	0.013	–1.288	11.529	9.459	0.049	–1.072	10.970	8.516
	20	0.031	–0.628	8.714	7.478	0.082	–0.310	7.816	6.178	0.014	–1.342	10.453	8.827	0.055	–0.963	9.570	7.192
	30	0.029	–0.489	8.198	6.884	0.087	–0.121	7.113	5.395	0.013	–1.530	10.686	8.825	0.057	–1.170	9.898	7.514
	Average	0.030	–0.523	8.698	7.438	0.085	–0.199	7.721	6.053	0.013	–1.387	10.890	9.037	0.054	-1.068	10.146	7.741
	Improvement					–64.15%	61.90%	12.66%	22.89%					–74.94%	22.96%	7.33%	16.75%
	pathbased
	Softimpute								Mice
	Whole data				Clustered data				Whole data				Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.022	–0.283	7.459	6.073	0.045	–0.232	7.309	5.613	0.133	–0.275	7.434	7.434	0.311	–1.154	9.664	9.664
	20	0.043	–1.436	9.744	7.020	0.056	–1.723	10.303	7.195	0.142	0.020	6.182	6.182	0.336	–0.522	7.701	7.701
	30	0.022	–0.839	9.995	7.643	0.045	–0.589	9.292	6.468	0.133	–0.092	7.701	7.701	0.314	–0.077	7.649	7.649
	Average	0.029	–0.853	9.066	6.912	0.049	–0.848	8.968	6.425	0.136	–0.116	7.106	7.106	0.320	–0.584	8.338	8.338
	Improvement					–40.68%	0.52%	1.10%	7.58%					–57.52%	–405.49%	–14.78%	–14.78%
MCAR	10	0.030	–1.907	12.396	9.550	0.051	–1.689	11.923	8.236	0.139	–0.124	7.710	7.710	0.333	–0.263	8.171	8.171
	20	0.044	–2.245	12.590	9.961	0.063	–1.866	11.832	8.473	0.138	–0.109	7.361	7.361	0.309	–0.259	7.843	7.843
	30	0.029	–1.736	11.507	9.200	0.054	–1.102	10.086	7.354	0.145	–0.358	8.106	8.106	0.313	0.022	6.880	6.880
	Average	0.034	–1.963	12.164	9.570	0.056	–1.552	11.280	8.021	0.141	–0.197	7.726	7.726	0.318	–0.167	7.631	7.631
	Improvement					–38.38%	20.90%	7.84%	19.32%					–55.80%	15.41%	1.24%	1.24%
MNAR	10	0.016	–3.126	15.482	13.644	0.062	–3.036	15.312	12.498	0.130	–2.220	13.677	13.677	0.329	–1.272	11.488	11.488
	20	0.022	–3.915	15.142	13.645	0.057	–3.517	14.516	12.010	0.138	–0.694	8.888	8.888	0.315	–0.621	8.697	8.697
	30	0.043	–3.877	14.838	13.009	0.073	–3.361	14.030	11.239	0.140	–1.090	9.712	9.712	0.329	–0.068	6.942	6.942
	Average	0.027	–3.640	15.154	13.432	0.064	–3.305	14.619	11.916	0.136	–1.335	10.759	10.759	0.324	–0.654	9.042	9.042
	Improvement					–57.52%	9.20%	3.66%	12.73%					–58.04%	51.02%	18.99%	18.99%
MAR	10	1.542	–0.736	8.675	8.675	2.315	–0.418	7.840	7.840	0.030	0.060	6.383	6.383	0.039	–1.082	9.502	9.502
	20	1.667	–0.651	8.021	8.021	2.328	–0.685	8.105	8.105	0.029	0.185	5.638	5.638	0.039	0.074	6.009	6.009
	30	1.716	–1.250	11.057	11.057	2.174	–0.560	9.205	9.205	0.035	0.322	6.070	6.070	0.040	–0.174	7.987	7.987
	Average	1.642	–0.879	9.251	9.251	2.272	–0.554	8.383	8.383	0.031	0.189	6.030	6.030	0.040	–0.394	7.832	7.832
	Improvement					–27.75%	36.94%	10.35%	10.35%					–20.88%	–308.70%	–23.01%	–23.01%
MCAR	10	1.330	–1.415	11.300	11.300	2.342	–1.250	10.905	10.905	0.030	0.055	7.067	7.067	0.054	–0.044	7.430	7.430
	20	1.856	–0.840	9.479	9.479	1.597	–0.623	8.905	8.905	0.027	0.190	6.290	6.290	0.041	0.273	5.959	5.959
	30	1.773	–0.789	9.304	9.304	2.284	–0.356	8.099	8.099	0.033	0.243	6.051	6.051	0.033	0.293	5.848	5.848
	Average	1.653	–1.015	10.028	10.028	2.074	–0.743	9.303	9.303	0.030	0.163	6.470	6.470	0.043	0.174	6.412	6.412
	Improvement					–20.31%	26.79%	7.79%	7.79%					–29.67%	6.88%	0.89%	0.89%
MNAR	10	1.571	–1.627	12.352	12.352	2.111	–0.916	10.549	10.549	0.030	–0.171	8.247	8.247	0.043	–0.387	8.975	8.975
	20	1.910	–1.793	11.414	11.414	2.650	–1.142	9.996	9.996	0.028	–0.465	8.267	8.267	0.035	–0.071	7.068	7.068
	30	1.744	–1.833	11.309	11.309	3.061	–1.561	10.751	10.751	0.034	–0.241	7.484	7.484	0.046	–0.261	7.545	7.545
	Average	1.742	–1.751	11.692	11.692	2.607	–1.206	10.432	10.432	0.031	–0.292	7.999	7.999	0.042	–0.240	7.863	7.863
	Improvement					–33.20%	31.12%	12.07%	12.07%					–26.36%	18.01%	1.74%	1.74%

Table 7

Imputation time, R²score, RMSE, and MAE comparison between the ten imputation algorithms with/out considering the clustering (aggregation dataset)

	aggregation
	mean								median
	Whole data				Clustered data				Whole data				Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.049	0.015	9.061	8.242	0.725	0.903	2.846	2.296	0.003	0.023	9.027	8.041	0.721	0.903	2.839	2.296
	20	0.003	0.038	9.632	8.717	0.657	0.908	2.973	2.466	0.003	–0.032	9.976	8.792	0.711	0.905	3.028	2.514
	30	0.004	0.057	9.559	8.391	0.641	0.913	2.903	2.434	0.003	–0.027	9.973	8.523	0.708	0.913	2.907	2.440
	Average	0.019	0.037	9.417	8.450	0.674	0.908	2.907	2.399	0.003	–0.012	9.659	8.452	0.713	0.907	2.924	2.417
	Improvement					–97.21%	2379.94%	223.96%	252.23%					–99.55%	7625.53%	230.29%	249.71%
MCAR	10	0.003	0.042	8.959	7.960	0.650	0.911	2.723	2.155	0.003	–0.007	9.184	7.888	0.689	0.911	2.736	2.167
	20	0.003	0.066	9.304	8.310	0.659	0.914	2.815	2.361	0.003	0.022	9.518	8.162	0.703	0.914	2.823	2.373
	30	0.006	0.113	8.324	7.271	0.669	0.891	2.917	2.344	0.003	0.025	8.727	7.333	0.699	0.891	2.919	2.350
	Average	0.004	0.073	8.862	7.847	0.659	0.906	2.818	2.287	0.003	0.013	9.143	7.794	0.697	0.905	2.826	2.297
	Improvement					–99.40%	1133.19%	214.47%	243.16%					–99.57%	6640.97%	223.50%	239.39%
MNAR	10	0.004	–0.688	11.243	10.228	0.658	0.878	3.026	2.487	0.003	–1.140	12.658	11.562	0.704	0.876	3.048	2.506
	20	0.005	–0.522	10.627	9.401	0.652	0.891	2.839	2.293	0.003	–0.966	12.078	10.629	0.706	0.890	2.863	2.313
	30	0.003	–0.545	10.400	9.267	0.660	0.902	2.617	2.050	0.003	–1.009	11.860	10.475	0.699	0.902	2.625	2.059
	Average	0.004	–0.585	10.757	9.632	0.657	0.890	2.827	2.277	0.003	–1.038	12.198	10.889	0.703	0.889	2.846	2.293
	Improvement					–99.45%	252.22%	280.45%	323.08%					–99.55%	185.63%	328.67%	374.94%
	aggregation
	mode								KNN
	Whole data				Clustered data				Whole data				Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.024	–1.747	15.134	11.822	0.716	0.854	3.490	2.777	0.125	–0.487	11.135	9.198	0.071	0.868	3.320	2.786
	20	0.025	–0.841	13.323	10.648	0.729	0.838	3.952	3.194	0.105	–0.504	12.044	9.532	0.076	0.855	3.742	3.073
	30	0.023	–1.638	15.983	12.947	0.717	0.855	3.745	2.983	0.114	–0.604	12.464	9.980	0.077	0.874	3.500	2.888
	Average	0.024	–1.408	14.813	11.806	0.721	0.849	3.729	2.984	0.115	–0.532	11.881	9.570	0.075	0.865	3.521	2.916
	Improvement					–96.65%	160.29%	297.25%	295.58%					54.02%	262.74%	237.47%	228.22%
MCAR	10	0.027	–1.617	14.803	11.898	0.719	0.819	3.898	3.027	0.103	–0.616	11.634	9.299	0.070	0.877	3.205	2.542
	20	0.026	–1.465	15.111	12.078	0.719	0.826	4.013	3.173	0.107	–0.571	12.065	9.719	0.072	0.878	3.366	2.766
	30	0.027	–1.702	14.531	11.902	0.718	0.800	3.957	3.210	0.113	–0.271	9.967	7.916	0.078	0.800	3.955	3.078
	Average	0.027	–1.595	14.815	11.959	0.719	0.815	3.956	3.137	0.108	–0.486	11.222	8.978	0.073	0.852	3.509	2.795
	Improvement					–96.27%	151.09%	274.48%	281.25%					46.67%	275.14%	219.82%	221.17%
MNAR	10	0.027	–1.943	14.844	12.122	0.712	0.683	4.872	4.095	0.101	–1.017	12.288	10.326	0.068	0.798	3.891	3.070
	20	0.028	–4.139	19.527	17.741	0.750	0.782	4.026	3.233	0.107	–0.858	11.743	9.713	0.074	0.815	3.702	2.811
	30	0.026	–4.157	19.002	17.247	0.721	0.776	3.960	3.095	0.113	–1.160	12.297	10.024	0.078	0.854	3.201	2.462
	Average	0.027	–3.413	17.791	15.704	0.728	0.747	4.286	3.474	0.107	–1.012	12.109	10.021	0.073	0.822	3.598	2.781
	Improvement					–96.32%	121.89%	315.09%	352.02%					46.24%	181.30%	236.57%	260.34%
MAR	10	0.122	0.010	9.084	8.258	0.201	0.901	2.868	2.332	0.064	–1.532	14.529	11.592	0.174	0.808	4.001	3.212
	20	0.031	0.037	9.635	8.724	0.168	0.908	2.982	2.476	0.022	–1.442	15.345	11.756	0.369	0.789	4.516	3.455
	30	0.034	0.057	9.556	8.397	0.175	0.907	3.006	2.527	0.022	–1.336	15.041	12.067	0.193	0.767	4.748	3.643
	Average	0.062	0.035	9.425	8.460	0.181	0.905	2.952	2.445	0.036	–1.436	14.972	11.805	0.245	0.788	4.422	3.437
	Improvement					–65.75%	2490.90%	219.27%	246.03%					–85.18%	154.85%	238.59%	243.50%
MCAR	10	0.033	0.040	8.969	7.959	0.203	0.912	2.709	2.147	0.020	–1.228	13.658	10.514	0.190	0.651	5.407	3.790
	20	0.033	0.052	9.369	8.377	0.205	0.915	2.808	2.362	0.027	–1.299	14.593	11.704	0.206	0.719	5.106	3.862
	30	0.032	0.113	8.326	7.266	0.211	0.890	2.927	2.353	0.021	–1.034	12.606	9.950	0.209	0.533	6.041	4.468
	Average	0.033	0.068	8.888	7.867	0.206	0.906	2.815	2.287	0.023	–1.187	13.619	10.723	0.202	0.634	5.518	4.040
	Improvement					–84.10%	1228.42%	215.79%	243.97%					–88.82%	153.44%	146.81%	165.43%
MNAR	10	0.033	–0.691	11.254	10.226	0.202	0.878	3.021	2.499	0.023	–2.006	15.002	12.417	0.198	0.388	6.772	5.056
	20	0.032	–0.524	10.635	9.407	0.187	0.889	2.877	2.329	0.020	–1.232	12.868	10.495	0.192	0.429	6.511	5.061
	30	0.030	–0.552	10.424	9.296	0.195	0.901	2.635	2.066	0.021	–1.408	12.985	10.542	0.228	0.439	6.269	4.876
	Average	0.032	–0.589	10.771	9.643	0.195	0.889	2.844	2.298	0.021	–1.548	13.618	11.151	0.206	0.418	6.517	4.997
	Improvement					–83.71%	250.91%	278.70%	319.66%					–89.77%	127.02%	108.96%	123.15%
	aggregation
	Softimpute								Mice
	Whole data				Clustered data				Whole data				Clustered data
Mech.	MR%	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE	Time	R²Score	RMSE	MAE
MAR	10	0.307	–0.289	10.367	7.812	0.171	0.743	4.625	3.282	0.169	0.226	8.034	8.034	0.800	0.788	4.207	4.207
	20	0.058	–0.685	12.747	10.185	0.123	0.549	6.596	4.890	0.190	–0.554	12.240	12.240	0.690	0.818	4.193	4.193
	30	0.028	–1.185	14.546	11.412	0.157	0.088	9.400	6.924	0.222	0.046	9.614	9.614	0.751	0.865	3.611	3.611
	Average	0.131	–0.720	12.553	9.803	0.150	0.460	6.874	5.032	0.194	–0.094	9.963	9.963	0.747	0.824	4.004	4.004
	Improvement					–12.78%	163.93%	82.63%	94.82%					–74.08%	976.06%	148.84%	148.84%
MCAR	10	0.026	–0.921	12.684	9.875	0.145	0.271	7.813	5.396	0.171	0.673	5.236	5.236	0.846	0.803	4.063	4.063
	20	0.038	–0.959	13.469	10.681	0.117	0.266	8.246	5.991	0.189	–1.010	13.646	13.646	0.902	0.840	3.846	3.846
	30	0.025	–1.535	14.074	11.620	0.155	–0.175	9.581	7.191	0.218	–0.172	9.571	9.571	0.853	0.816	3.789	3.789
	Average	0.029	–1.138	13.409	10.725	0.139	0.121	8.547	6.193	0.193	–0.170	9.484	9.484	0.867	0.820	3.899	3.899
	Improvement					–78.82%	110.61%	56.89%	73.20%					–77.78%	582.23%	143.23%	143.23%
MNAR	10	0.051	–4.063	19.472	17.608	0.177	–1.333	13.217	10.755	0.165	–0.594	10.924	10.924	0.888	0.758	4.259	4.259
	20	0.024	–3.723	18.721	16.757	0.116	–1.257	12.942	10.443	0.189	–0.152	9.245	9.245	0.807	0.791	3.939	3.939
	30	0.043	–3.731	18.200	16.187	0.116	–1.447	13.090	10.518	0.209	–0.160	9.013	9.013	0.794	0.840	3.343	3.343
	Average	0.039	–3.839	18.798	16.850	0.136	–1.346	13.083	10.572	0.188	–0.302	9.727	9.727	0.830	0.796	3.847	3.847
	Improvement					–71.36%	64.95%	43.68%	59.39%					–77.38%	363.79%	152.84%	152.84%
MAR	10	8.647	–0.123	9.678	9.678	7.410	0.848	3.564	3.564	0.093	0.901	2.867	2.867	0.219	0.905	2.815	2.8149
	20	11.011	–0.467	11.892	11.892	10.282	0.821	4.160	4.160	0.092	0.907	3.000	3.000	0.232	0.909	2.961	2.9612
	30	12.181	–0.715	12.886	12.886	5.405	0.836	3.983	3.983	0.086	0.909	2.972	2.972	0.354	0.912	2.927	2.9265
	Average	10.613	–0.435	11.485	11.485	7.699	0.835	3.902	3.902	0.090	0.906	2.946	2.946	0.269	0.909	2.901	2.9009
	Improvement					37.85%	292.00%	194.32%	194.32%					–66.37%	0.32%	1.57%	1.57%
MCAR	10	8.0030	–0.5278	11.3114	11.3114	5.7206	0.8025	4.0665	4.0665	0.0980	0.9046	2.8259	2.8259	0.2823	0.9097	2.7493	2.7493
	20	11.2430	–0.5266	11.8914	11.8914	6.3866	0.8598	3.6033	3.6033	0.0910	0.9144	2.8154	2.8154	0.5126	0.9153	2.8008	2.8008
	30	12.8170	–0.4341	10.5859	10.5859	5.4358	0.7278	4.6115	4.6115	0.0890	0.8825	3.0302	3.0302	0.2302	0.8870	2.9708	2.9708
	Average	10.6877	–0.4962	11.2629	11.2629	5.8477	0.7967	4.0938	4.0938	0.0927	0.9005	2.8905	2.8905	0.3417	0.9040	2.8403	2.8403
	Improvement					82.77%	260.58%	175.12%	175.12%					–72.88%	0.39%	1.77%	1.77%
MNAR	10	8.5870	–1.2400	12.9510	12.9510	6.5083	0.7821	4.0392	4.0392	0.0990	0.8615	3.2204	3.2204	0.2569	0.8760	3.0468	3.0468
	20	11.6740	–0.7613	11.4316	11.4316	6.5144	0.7390	4.4010	4.4010	0.0910	0.8803	2.9806	2.9806	0.2272	0.8852	2.9180	2.9180
	30	11.4090	–1.5395	13.3346	13.3346	7.9338	0.8096	3.6514	3.6514	0.0890	0.8959	2.6998	2.6998	0.2480	0.9012	2.6307	2.6307
	Average	10.5567	–1.1802	12.5724	12.5724	6.9855	0.7769	4.0305	4.0305	0.0930	0.8792	2.9669	2.9669	0.2440	0.8875	2.8652	2.8652
	Improvement					51.12%	165.82%	211.93%	211.93%					–61.89%	0.94%	3.55%	3.55%

Fig. 6

R² score Improvement percentage of the proposed algorithm over the compared algorithms (aggregation dataset).

Fig. 7

RMSE and MAE Improvement percentage of the proposed algorithm over the compared algorithms (aggregation dataset).

Fig. 8

R² score Improvement percentage of the proposed algorithm over the compared algorithms (flame dataset).

Fig. 9

RMSE and MAE Improvement percentage of the proposed algorithm over the compared algorithms (flame dataset).

Fig. 10

R² score Improvement percentage of the proposed algorithm over the compared algorithms (Jain dataset).

Fig. 11

RMSE and MAE Improvement percentage of the proposed algorithm over the compared algorithms (Jain dataset).

Fig. 12

R² score Improvement percentage of the proposed algorithm over the compared algorithms (pathpased dataset).

Fig. 13

RMSE and MAE Improvement percentage of the proposed algorithm over the compared algorithms (pathpased dataset).

The prominent observations are:

The imputation times for most algorithms when applying them on the whole data are better than the imputation times when applying them on the clustered data.

For flame dataset: The results when applying the imputation algorithms on the clusters are better than applying the imputation algorithms on the whole data, except with the mode algorithm which gives very little better results with the whole data in MAR. The Mice algorithm gives little better results in MNAR. The Missforest gives little better results in MCAR and MAR. In MAR, R² score of the proposed technique is better than the compared algorithms except for the mode and Missforest algorithms. RMSE of the proposed technique is better than all the compared algorithms except for the mode algorithm. MAE of the proposed technique is better than all the compared algorithms except for the mode and Missforest algorithms. In MCAR, R² score, RMSE, and MAE of the proposed technique are better than all the compared algorithms except for the Missforest algorithm. In MNAR, R² score, RMSE, and MAE of the proposed technique are better than all the compared algorithms except for the Mice algorithm.

For jain dataset: The mean algorithm gives better MAE in the three mechanisms. The median algorithm gives better results with the clustered data. The mode algorithm gives better results with the whole data in MAR. The KNN gives better results with the clustered data. The IterativeImputer, IterativeSVD, and Softimpute algorithms give better results with the clustered data. The Mice algorithm gives better results better RMSE and MAE in MAR and MNAR, and gives better R² score in MAR. The Forimp gives better results with the clustered data in the three mechanisms. The Missforest gives better results with

the whole data in the three mechanisms. In MAR, R² score, RMSE, and MAE of the proposed technique are better than the compared algorithms except for the mode and Missforest algorithms. RMSE and MAE of the Mice algorithm are better than the proposed technique. In MCAR, R² score, RMSE, and MAE of the proposed technique are better than all the compared algorithms except for the Missforest algorithm. In MNAR, R² score, RMSE, and MAE of the proposed technique are better than all the compared algorithms except for the Mice and Missforest algorithms.

For pathbased dataset: The mean, median, KNN, IterativeImputer, IterativeSVD, Softimpute, and Forimp algorithms give better results with the clustered data in the three mechanisms. The mode algorithm gives better results with the whole data in the three mechanisms. The Mice algorithm gives better results with the whole data in MAR, and gives better results with the clustered data in MCAR and MNAR. The Missforest gives better results with the whole data in MAR, and gives better results with the clustered data in MCAR and MNAR. In MAR, R² score, RMSE, and MAE of the proposed technique are better than the compared algorithms except for the mode, Mice, and Missforest algorithms. In MCAR, R² score, RMSE, and MAE of the proposed technique are better than all the compared algorithms except for the mode algorithm. In MNAR, R² score, RMSE, and MAE of the proposed technique are better than all the compared algorithms except for the mode algorithms.

For aggregation dataset: The mean, median, mode, KNN, IterativeImputer, IterativeSVD, Softimpute, Mice, Forimp, and Missforest give better results with the clustered data in the three mechanisms. In MAR, MCAR, and MNAR, R² score, RMSE, and MAE of the proposed technique are better than the compared algorithms.

4 Summary and conclusion

Data preprocessing has a considerable impact on the statistical analysis. Handling the missing data is an important stage in the data preprocessing, therefore, it has magnitude significance in data analysis. The problem with existing methods which handle missing values is that they deal with the whole data ignoring the characteristics of the data (e.g., clustered data). The effect of imputing the missing values is investigated in this paper using ten popular imputation algorithms. This paper proposed a preprocessing phase benefiting from the similarity attribute of the data to improve the accuracy of the imputation. The proposed method applies the imputation algorithm in each cluster. The comparative study has been done between ten imputation algorithms under two cases; when applying in each cluster and when applying in the whole data on four datasets with different number of clusters, sizes, and shapes. The empirical study showed better effectiveness from the point of view of imputation time, RMSE, MAE, and R² score.

References

Norazian Ramli

M.N.

, Yahaya

A.S.

, Ramli

N.A.

, Yusof

N.F.F.M.

and Abdullah

M.M.A.

, Roles of imputation methods for filling the missing values: A review, Adv Environ Biol 7(12) (2013), 3861–3869.

Luengo

, García

and Herrera

, On the choice of the best imputation methods for missing values considering three groups of classification methods, Knowl Inf Syst 32(1) (2012), 77–108, doi: 10.1007/s10115-011-0424-2

Choudhury

S.J.

and Pal

N.R.

, Imputation of missing data with neural networks for classification, Knowledge-Based Syst 182 2019, doi: 10.1016/j.knosys.2019.07.009

Razavi-Far

, Cheng

, Saif

and Ahmadi

, Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Syst 187 (2020), pp. 9–12, doi: 10.1016/j.knosys.2019.06.013

Jordanov

, Petrov

and Petrozziello

, Classifiers Accuracy Improvement Based on Missing Data Imputation, J Artif Intell Soft Comput Res 8(1) (2018), 31–48, doi: 10.1515/jaiscr-2018-0002

Enders

C.K.

, Applied Missing Data Analysis, Guilford Press. New York London., 2010.

Mostafa

S.M.

and Amano

, Dynamic Round Robin CPU Scheduling Algorithm Based on K-Means Clustering Technique, Appl Sci 10(15) (2020), 1–14, doi: 10.3390/app10155134

Rubin

D.B.

, Inference and missing data, Biometrika 63(3) (1976), 581–592, doi: 10.2307/2335739

Wei

, Wang

, Jia

, Chen

, Ni

and Jia

, GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies, PLoS Comput Biol 14(1) (2018), 1–14, doi: 10.1371/journal.pcbi.1005973

10.

A.W.

, Siah

K.W.

and Wong

C.H.

, Machine Learning with Statistical Imputation for Predicting Drug Approval, Harvard Data Sci Rev, 2019, doi: 10.1162/99608f92.5c5f0525

11.

Mostafa

S.M.

, Imputing missing values using cumulative linear regression, CAAI Trans Intell Technol 4(3) (2019), 182–200, doi: 10.1049/trit.2019.0032

12.

Pigott

T.D.

, A Review of Methods for Missing Data, Educ Res Eval 7(4) (2001), 353–383, doi: 10.1076/edre.7.4.353.8937

13.

Kalkan

Ö.K.

, Kara

and Kelecioğlu

, Evaluating Performance of Missing Data Imputation Methods in IRT Analyses, Int J Assess Tools Educ 5(3) (2018), 403–416, doi: 10.21449/ijate.430720

14.

Masconi

K.L.

, Matsha

T.E.

, Erasmus

R.T.

and Kengne

A.P.

, Effects of different missing data imputation techniques on the performance of undiagnosed diabetes risk prediction models in a mixed-ancestry population of South Africa, PLoS One 10(9) (2015), pp. 1–12, doi: 10.1371/journal.pone.0139210

15.

Lakshminarayan

, Harp

S.A.

and Samad

, Imputation of missing data in industrial databases, Appl Intell 11 (3) (1999), pp. 259–275, doi: 10.1023/A:1008334909089

16.

Horton

N.J.

and Kleinman

K.P.

, Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models, Am Stat 6(1) (2007), 79–90, doi: 10.1198/000313007X172556

17.

Chaudhry

, Li

, Basri

and Patenaude

, A Method for Improving Imputation and Prediction Accuracy of Highly Seasonal Univariate Data with Large Periods of Missingness, Wirel Commun Mob Comput 2019 2019, doi: 10.1155/2019/4039758

18.

Farhangfar

, Kurgan

L.A.

and Pedrycz

, A Novel Framework for Imputation of Missing Values in Databases, IEEE Trans Syst Man, Cybern - Part A Syst. Humans 37(5) (2007), 692–709, doi: 10.1109/TSMCA.2007.902631

19.

Little

and Rubin

, Statistical Analysis with Missing Data, Second Edition, John Wiley Sons, Inc B Ser Ser Probab Stat, 2012, doi: 10.1002/9781119013563

20.

Royston

, Multiple imputation of missing values, Stata J 4(3) (2004), 227–241, [Online]. Available: https://econpapers.repec.org/RePEc:tsj:stataj:v:4:y:2004:i:3:p:227-241.

21.

Storlie

C.B.

, Therneau

T.M.

, Carter

R.E.

, Chia

, Bergquist

J.R.

, Huddleston

J.M.

and Romero-Brufau

, Prediction and Inference With Missing Data in Patient Alert Systems, J Am Stat Assoc 115(529) (2020), 32–46, doi: 10.1080/01621459.2019.1604359

22.

Albayrak

A.M.

, Turhan

and Kurt

, A Missing Data Imputation Approach Using Clustering and Maximum Likelihood Estimation, 2017 Med Technol Natl Congr (TIPTEKNO), Trabzon, (2017), 1–4, doi: 10.1109/TIPTEKNO.2017.8238064

23.

Scheffer

, Dealing with Missing Data, Res Lett Inf Math Sci 3 (2002), 153–160.

24.

Mander

and Clayton

, Hotdeck imputation, Stata Tech Bull Repr 9(51) (2000), 196–199.

25.

Mucherino

, Papajorgji

P.J.

and Pardalos

P.M.

, K-nearest neighbor classification, Data Min Agric Springer, (2009), 83–106.

26.

Kim

, Ko

and Kim

, Analysis and impact evaluation of missing data imputation in day-ahead PV generation forecasting, Appl Sci 9(1) (2019), 1–18, doi: 10.3390/app9010204

27.

Baraldi

A.N.

and Enders

C.K.

, An introduction to modern missing data analyses, J Sch Psychol 48(1) (2010), 5–37, doi: 10.1016/j.jsp.2009.10.001

28.

Rubin

D.B.

, Formalizing subjective notions about the effect of nonrespondents in sample surveys, J Am Stat Assoc 72(359) (1977), 538–543, doi: 10.2307/2286214

29.

Campion

W.M.

and Rubin

D.B.

, Multiple Imputation for Nonresponse in Surveys, J Mark Res 26(4) (1989), 485, doi: 10.2307/3172772

30.

Knorr Held

, Analysis of Incomplete Multivariate Data, Schafer J. L., Chapman Hall, London, Stat Med 19(7) (1008), 1006–1008, doi: 10.1002/(SICI)1097-0258(20000415)19:7 1006::AID-SIM3843.0.CO;2-T.

31.

Batista

G.E.A.P.A.

and Monard

M.C.

, An analysis of four missing data treatment methods for supervised learning, Appl Artif Intell 17(5–6) (2003), 519–533, doi: 10.1080/713827181

32.

Aieb

, Madani

, Scarpa

, Bonacorso

and Lefsih

, A new approach for processing climate missing databases applied to daily rainfall data in Soummam watershed, Algeria, Heliyon 5(2) (2019), e01247, doi: 10.1016/j.heliyon.2019.e01247

33.

Troyanskaya

, Cantor

, Sherlock

, Brown

, Hastie

, Tibshirani

, Botstein

and Altman

R.B.

, Missing value estimation methods for DNA microarrays, Bioinformatics 17(6) (2001), 520–525, doi: 10.1093/bioinformatics/17.6.520

34.

, Deogun

, Spaulding

and Shuart

, Towards missing data imputation: A study of fuzzy K-means clustering method, Int Conf Rough Sets Curr Trends Comput RSCTC 2004 Lect Notes Comput Sci Springer, Berlin, Heidelberg. 3066, pp. 573–579, 2004, doi: 10.1007/978-3-540-25929-9_70

35.

Shao

, Cold Deck and Ratio Imputation, Surv Methodol 26(1) (2000), 79–85.

36.

Cismondi

, Fialho

A.S.

, Vieira

S.M.

, Reti

S.R.

, Sousa

J.M.C.

and Finkelstein

S.N.

, Missing data in medical databases: Impute, delete or classify? Artif Intell Med 58(1) (2013), 63–72, doi: 10.1016/j.artmed.2013.01.003

37.

Hapfelmeier

, Hothorn

, Ulm

and Strobl

, A new variable importance measure for random forests with missing data, Stat Comput 24(1) (2014), 21–34, doi: 10.1007/s11222-012-9349-1

38.

Batista

and Monard

M.C.

, A Study of K-Nearest Neighbour as an Imputation Method, HIS’02 2nd Int Conf Hybrid Intell Syst 87 (2002), 251–260, [Online]. Available: http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf.

39.

Aydilek

I.B.

and Arslan

, A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm, Inf Sci (Ny) 233 (2013), 25–35, doi: 10.1016/j.ins.2013.01.021

40.

Qin

, Zhang

, Zhu

, Zhang

and Zhang

, Semi-parametric optimization for missing data imputation, Appl Intell 27(1) (2007), 79–88, doi: 10.1007/s10489-006-0032-0

41.

Chen

, Twycross

and Garibaldi

J.M.

, A new accuracy measure based on bounded relative error for time series forecasting, PLoS One 12(3) (2017), 1–23, doi: 10.1371/journal.pone.0174202

42.

Acuña

and Rodriguez

, The Treatment of Missing Values and its Effect on Classifier Accuracy, Banks D., McMorris F.R., Arab. P., Gaul W. Classif. Clust. Data Min. Appl. Stud. Classif. Data Anal. Knowl. Organ. Springer, Berlin, Heidelberg., (2004), pp. 639–647.

43.

Muñoz

J. F.

and Rueda

, New imputation methods for missing data using quantiles, J Comput Appl Math 232(2) (2009), 305–317, doi: 10.1016/j.cam.2009.06.011

44.

Honghai

, Guoshun

, Cheng

, Bingru

and Yumei

, A SVM Regression Based Approach to Filling in Missing Values, Proc Khosla R, Howlett R.J., Jain L.C. Knowledge-Based Intell. Inf. Eng. Syst. KES 2005, Lect. Notes Comput. Sci. Springer, Berlin, Heidelberg 3683 (2005), 581–587.

45.

Pelckmans

, De Brabanter

, Suykens

J.A.K.

and De Moor

, Handling missing values in support vector machine classifiers, Neural Networks 18(5–6) (2005), 684–692, doi: 10.1016/j.neunet.2005.06.025

46.

Mostafa

S.M.

and Amano

, Effect of clustering data in improving machine learning model accuracy, J Theor Appl Inf Technol 97(21) (2019), 2973–2981.

47.

Mostafa

S.M.

, Missing Data Imputation by the Aid of Features Similarities, Int. J. Big Data Manag 1(1) (2020), 81–103, doi: 10.1504/ijbdm.2019.10025856

48.

F. P. and S. S, Clustering basic benchmark-Aggregation. http://cs.joensuu.fi/sipu/datasets/Aggregation.txt (accessed Mar. 15, 2020).

49.

Gionis

, Mannila

and Tsaparas

, Clustering aggregation, ACM Trans. Knowl. Discov. from Data 1(1) (2007), 1–30, doi: 10.1109/ICDE.2005.34

50.

Fränti

and Sieranoja

, Clustering basic benchmark-flame. http://cs.joensuu.fi/sipu/datasets/flame.txt (accessed May 25, 2020).

51.

and Medico

, FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data, BMC Bioinformatics 8(3) (2007), 1–15, doi: 10.1186/1471-2105-8-3

52.

Fränti

and Sieranoja

, Clustering basic benchmark-jain. http://cs.joensuu.fi/sipu/datasets/jain.txt (accessed May 25, 2020).

53.

Jain

A.K.

and Law

M.H.C.

, Data clustering: A user’s dilemma, Proc. IInternational Conf. Pattern Recognit. Mach. Intell. Springer, Berlin, Heidelb 3776 (2005), 1–10, doi: 10.1007/11590316_1

54.

Fränti

and Sieranoja

, Clustering basic benchmark-pathbased. http://cs.joensuu.fi/sipu/datasets/pathbased.txt (accessed May 25, 2020).

55.

Chang

and Yeung

D.Y.

, Robust path-based spectral clustering, Pattern Recognit 41(1) (2008), 191–203, doi: 10.1016/j.patcog.2007.04.010

56.

van Buuren

and Groothuis-Oudshoorn

, Multivariate Imputation by Chained Equations in, J Stat Softw 45(3) (2011), doi: 10.18637/jss.v045.i03

57.

Donders

, van der Heijden

, Stijnen

and Moons

K.G.M.

, Review: A gentle introduction to imputation of missing values, J Clin Epidemiol 59(10) (2006), 1087–1091, doi: 10.1016/j.jclinepi.2006.01.014

58.

Alessandro Barbiero

G.M.

and Alda Ferrari

, ForImp: Imputation of Missing Values Through a Forward Imputation Algorithm, cran.r-project, 2015. https://cran.r-project.org/web/packages/ForImp/ (accessed Mar. 15, 2019).

59.

Stekhoven

D.J.

, missForest: Nonparametric Missing Value Imputation using Random Forest, cran.r-project, 2013. https://cran.r-project.org/web/packages/missForest/ (accessed Mar. 01, 2019).

60.

Iskandr, fancyimpute, GitHub, Inc, 2018. https://github.com/iskandr/fancyimpute (accessed Mar. 17, 2019).

61.

Mazumder

, Hastie

and Tibshirani

, Spectral Regularization Algorithms for Learning Large Incomplete Matrices, J Mach Learn Res 18 (2010), 2287–2322, doi: 10.1016/j.surg.2006.10.010.Use

62.

Abd Rani

N.L.

, Azid

, Abdullah Sani

M.S.

, Samsudin

M.S.

, Ku Yusof

K.M.K.

, Muhammad Amin

S.N.S.

and Khalit

S.I.

, Development of missing data prediction model for carbon monoxide, Malaysian J Fundam Appl Sci 15(1) (2019), pp. 13–17, doi: 10.11113/mjfas.v15n2019.969

Towards improving machine learning algorithms accuracy by benefiting from similarities between cases

Abstract

Keywords

1 Introduction

1.2 Missingness data mechanism

1.3.1 Deletion approach

1.3.2 Imputation approach

3 Data preprocessing phase

3.1 Illustrative Example

3.2.1 Benchmark datasets

Table 2 Datasets specifications Dataset name #Instances #Features #Clusters References aggregation 788 2 7 [48, 49] flame 240 2 2 [50, 51] Jain 373 2 2 [52, 53] pathbased 300 2 3 [54, 55]

References

Table 2
Datasets specifications

Dataset name #Instances #Features #Clusters References

aggregation 788 2 7 [48, 49]

flame 240 2 2 [50, 51]

Jain 373 2 2 [52, 53]

pathbased 300 2 3 [54, 55]