Missing data imputation using correlation coefficient and min-max normalization weighting

Abstract

Missing data is one of the challenges a researcher encounters while attempting to draw information from data. The first step in solving this issue is to have the data stage ready for processing. Much effort has been made in this area; removing instances with missing data is a popular method for handling missing data, but it has drawbacks, including bias. It will be impacted negatively on the results. How missing values are handled depends on several vectors, including data types, missing rates, and missing mechanisms. It covers missing data patterns as well as missing at random, missing at completely random, and missing not at random. Other suggestions include using numerous imputation techniques divided into various categories, such as statistical and machine learning methods. One strategy to improve a model’s output is to weight the feature values to better the performance of classification or regression approaches. This research developed a new imputation technique called correlation coefficient min-max weighted imputation (CCMMWI). It combines the correlation coefficient and min-max normalization techniques to balance the feature values. The proposed technique seeks to increase the contribution of features by considering how those elements relate to the desired functionality. We evaluated several established techniques to assess the findings, including statistical techniques, mean and EM imputation, and machine learning imputation techniques, including k-NNI, and MICE. The evaluation also used the imputation techniques CBRL, CBRC, and ExtraImpute. We use various sizes of datasets, missing rates, and random patterns. To compare the imputed datasets and original data, we finally provide the findings and assess them using the root mean squared error (RMSE), mean absolute error (MAE), and $R^{2}$ . According to the findings, the proposed CCMMWI performs better than most other solutions in practically all missing-rate scenarios.

Keywords

Missing data imputation method feature weighting correlation coefficient data standardization Min-Max normalization classification method regression method

1. Introduction

In the field of data mining and machine learning, the data preprocessing stage is a critical step for preparing the data and addressing any difficulties present in the raw data. The issue of missing data is one of the most prevalent problems encountered when working with data [1]. The dataset’s quality is significant because when the dataset is incomplete, obtaining knowledge becomes a complicated task [2]. Therefore, dealing with this issue is placed in the preprocessing stage in data mining. Also, understanding the reasons for missing data is essential, such as questions not answered by respondents because they prefer to avoid answering or consider it unnecessary [3]. Deleting instances/variables with missing data is a known way to handle missing data, but this solution can cause issues, such as bias, and negatively affect the results [4]. Classical techniques that use the deletion strategy are listwise and pairwise. Listwise involves deleting any instances with at least one missing value. Meanwhile, the pairwise strategy means using all the available data and only discarding the missing value when using the mode [5]. Another approach is to use imputation methods. The primary goal of missing data imputation is to improve data quality by using observed data to estimate missing values and fill them in using statistical or machine learning methods [6, 7, 8]. Numerous studies on imputing missing values have found that filling in missing data improves data quality and increases the accuracy of model output [9]. The choice of imputation methods depends on various factors, such as data size, type of data, missing rate, and a missing mechanism, which refers to how the missing values occur. It includes missing at random, missing at completely random, and missing not at random. Statistical methods include mean, median, and mode strategies that predict the methods based on observed values for each feature. Other imputation strategies are based on machine learning algorithms where the observed feature values are used to train the model and then predict the missing values. Techniques such as k-NN, SVM, and regression have been proposed to address the missing data problem. Feature normalization is a crucial step that adjusts the range of values to ensure that each feature contributes equally to the ML process, preventing any single feature from exerting undue influence solely due to its magnitude [10, 11]. The min-max method scales the values of each feature to lie between 0 and 1. On the other hand, the z-score method uses the mean and standard deviation to rescale the data, resulting in features with a mean of zero and unit variance. Many studies have explored the various effects of standardization methods on data to improve model performance [12, 13]. Ahsan, et al. [14] reported that no single normalization method can be considered the best, while Sinsomboonthong [15], Henderi, et al. [16], Raju, et al. [17] obtained good results using min-max scaling (0,1) compared to other methods. However, in the literature, several factors play a significant role in choosing the appropriate normalization method, such as the type of data and the specific application. Pires, et al. [9] studied how the imputation of missing data improves human activity monitoring accuracy using k-NN. The researchers found that imputation of missing values and normalization of the data increase the accuracy rate and improve the classification results. Benhar, et al. [18] have studied the impact of preprocessing methods on the classification methods used for heart disease data. One of the preprocessing impacts was handling missing data, while the other was the impact of normalization methods on the classification accuracy. This study aims to propose a numeric missing data imputation method called correlation coefficient min-max weighted imputation (CCMMWI) that weighs the values of the features using the correlation coefficient and min-max normalization method. The following section will discuss missing value issues, such as the missingness mechanism, missing patterns, and missing models. In Section 2, we will illustrate all the imputation methods used in this paper and how those algorithms work. In Section 3, we will explain the criteria used in this work and then show the results and how we evaluated them. Section 4 will include the experimental analyses of the results. Section 5 will present the discussion, followed by the research conclusion in Section 6.

2. Related work

Missing data has many mechanisms of missingness depending on how the values are missing. It is crucial to comprehend why values are missing [19]. Three types of missing data are: (i) missing completely at random (MCAR), where the data are missing entirely at random, meaning there are no dependencies on the missingness probability related to the missing values and the rest of the values [4]; (ii) missing at random, if the missing value only depends on the observed values [20]; and (iii) missing not at random, where the missing values’ mechanism may depend on other variables but also on the missing value itself [4, 20]. The missing data patterns include univariate and multivariate. The univariate pattern occurs when one variable has missing values; if more than one variable has missing values, it is called multivariate missing values. The missing pattern is called monotone when the variable has no observed values after the first missing values of a variable. Otherwise, the missing pattern is called non-monotone or general. The monotone pattern usually occurs in longitudinal data, where the data are collected from the same participants over the time of the study. A pattern of missing data is considered connected if any recorded data point can be accessed from any other recorded data point by a series of horizontal or vertical moves [21]. If the dataset’s missing values are distributed randomly in both instances and variables, it is called an arbitrary pattern [22]. The missing data pattern type is shown in Fig. 1.

Table 1
Missing data percentage categories [25]

Percentage of data missing Categories

1%–2% Negligible

5%–10% Minor

10%–25% Moderate

25%–50% High

$>$ 50% Excessive

Percentage of data missing	Categories
1%–2%	Negligible
5%–10%	Minor
10%–25%	Moderate
25%–50%	High
$>$ 50%	Excessive

Figure 1.

Missing data pattern (gray is observed, and black is missing).

The missing model is defined based on how the ratio of missing values is distributed among the features in the dataset. Two missing models exist: the uniformly distributed (UD) model and the overall model. In the UD model, the percentage of missing values is the same for each feature. In contrast, the overall model refers to when the proportion of missing values in each feature has a different value [23, 24]. The missing rate indicates how many values are missed in a dataset. Based on the missing percentage, the missing rate is categorized into five groups [25], as described in Table 1.

2.1 Missing data imputation methods

Listwise and pairwise deletion of features or instances containing missing values are the simplest ways to handle such values. Listwise deletion eliminates instances with missing values, while pairwise deletion uses all observed data in the analysis, excluding only the missing values [22, 26]. Imputations fall into two categories based on the features used in the imputation method. The first is intransitive, where imputation depends solely on the feature itself (e.g., mean mode) and not on other features. The second is transitive, where imputation depends on additional features, such as k-NN imputation, EM imputation, and regression method imputation [27]. Mean imputation, replacing missing values with the mean of observed values, is the most straightforward approach for dealing with missing data. Median and mode can also be used [28]. Mean imputation is acceptable if the missing rate is minor; otherwise, it may introduce bias [29, 30]. Another method is expectation–maximization (EM) [31]. The EM imputation method involves two steps: filling in the missing data in the expectation step and then re-optimizing the expectation step in the maximization step. This methodology performs well with massive missing data and a small sample [20]. k-NN (k nearest neighbors) is a widely used nonparametric method in machine learning for classifying data due to its simplicity, generality, and relatively high accuracy [32, 33]. The k-NN imputation (k-NNI) algorithm is proposed to handle incomplete data by averaging the k nearest neighbor values from all complete instances [34, 35]. Another nonparametric imputation method is missForest (MF), proposed by Stekhoven and Bühlmann [36], based on the random forest algorithm. It can handle both numeric and categorical variable types simultaneously and is computationally effective for high-dimensional data [37]. Buuren and Groothuis-Oudshoorn [38] proposed multivariate imputation by chained equations (MICE). MICE estimates incomplete datasets through three steps: data imputation, data analysis, and data pooling, employing one of the regression methods in the estimation step [39, 40]. In literature on data analysis, many researchers have explored the impact of missing data on machine learning model output and how to address this issue [6, 41, 42, 43]. Various imputing methods aim to deal with missing values using different techniques, where the primary goal is to increase data availability with good quality. As with many machine learning techniques, data with varying feature value scales must be addressed during preprocessing to prevent feature dominance. This issue has been resolved using one of the standardization and normalization techniques. Rahman and Islam [44] proposed FIMUS, emphasizing the correlation between features when estimating missing values. Sezer and Başeğmez [45] recommend a technique for missing value imputation based on dataset consistency-based feature selection. This technique identifies subsets of features with contradictory values (i.e., identical feature values but different class labels) using a multivariable measure rather than unary measures such as distance, information, or dependence. Alabadla, et al. [46] introduced the ExtraImpute method, using extremely randomized trees (extra trees) to address missing numeric values. CCMVI, proposed by [47], imputes missing values by computing the class center, similar to determining the cluster center in the k-means technique. Nugroho, et al. [48] extended this work by proposing a class-center imputation technique that evaluates the association of features with the imputation stage. Liu et al. [49] proposed a correlation-based hierarchical k-NNI method, utilizing the correlation coefficient among features to impute missing values. Sefidian and Daneshpour [50] introduced ten correlation-based imputation techniques that maximize the connection between a missing feature and the remaining features. These methods maximize the connection between a missing feature and the remaining features. Manna and Pati [51] proposed imputation, a technique based on the similarity of gene expression and correlation coefficient. Razavi-Far et al. [52] proposed two missing data imputation techniques: kEMI and kEMI+, with the essence based on the k-NN and EM methods, using k-NN for the pre-imputation stage and the EM algorithm for the posterior-imputation stage. Mostafa et al. [53] proposed two imputation methods based on the Bayesian ridge model: CBRL and CBRC. In CBRL, the priority of imputing missing values is determined by features with fewer missing values, while in CBRC, the priority is based on high correlation with the target. The authors also introduced another method named CBRG, which utilizes gain ratio feature selection to establish the priority of imputation [54]. Nonetheless, a significant limitation lies in the utilization of raw data for constructing the estimator model, assigning equal weight to all features, regardless of their varying importance. Furthermore, prior research suffers from the shortcoming of estimating the correlation coefficient based on a sub dataset containing non-missing values. This practice proves inadequate for capturing the intricate relationships between features, chiefly owing to the resultant dataset size reduction. The diminished number of samples can exert a substantial influence on the outcomes of the correlation matrix, particularly when handling datasets characterized by moderate to high missing rates. Moreover, while prior efforts have concentrated on prioritizing features with the highest correlation, there exists an unexplored opportunity to investigate the strength of feature relationships when constructing the model.

3. Proposed method

The proposed method is divided into two stages. The first stage is called CCMMW, which involves Weighting Feature values based on the Correlation Coefficient and Min-Max normalization method. CCMMW combines two preprocessing strategies to prepare the raw data and enhance its quality to achieve optimal output. First, it equalizes the contribution of features, and second, it weights the features based on the correlation among the dependent feature (label feature) and independent features (the rest of the features). Utilizing weighted feature values instead of raw data to build the estimator may enhance the model’s performance and reduce the error rate. The parameter C in CCMMW is proposed to adjust the new maximum value of features in the weighted values dataset. Algorithm 1 provides the pseudocode of CCMMW, and the primary mechanism of CCMMW is divided into three steps as follows:

Step 1: Normalize the original data to avoid any dominating features. For this step, the proposed method used min-max (0,1) to rescale feature values to a new range between 0 and 1. Equation (1) is used as the normalization method for this step.

\begin{aligned} x_{i, n}^{^{'}} = \frac{x_{i, n} - min (x_{i})}{max (x_{i}) - min (x_{i})} \cdot (new_Max - new_Min) + new_Min \end{aligned}

(1)

Step 2: Calculate the correlation coefficient among the label feature and the independent features to use for weighting the values based on their correlation with the target. As in Eq. (2), The Pearson correlation coefficient has been chosen because of the numerical nature of the datasets involved in this study [55]. While normality is crucial for Pearson correlation, this study adopts a pragmatic approach by utilizing Pearson correlation without conducting individual normality tests for computational efficiency [56]. The decision is grounded in the acknowledgment that Pearson correlation is robust to deviations from normality, especially with large sample sizes. The analysis involves all available pairs in the datasets, recognizing the potential computational burden of testing each pair for normality. Future endeavors may include assessing individual pair normality and adjusting correlation methods accordingly, a step inspired by the overarching goal to understand and adapt to the intricacies of linear relationships between continuous variables [57, 58].

\begin{aligned} r = \frac{\sum_{i = 1}^{m} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{m} {(x_{i} - \bar{x})}^{2}} \sqrt{\sum_{i = 1}^{m} {(y_{i} - \bar{y})}^{2}}} \end{aligned}

(2)

Step 3: Use Eq. (2) to weigh the values and create a new weighted dataset.

\begin{aligned} v^{'} = (\frac{V_{i, j} - {mi}_{j}}{{ma}_{j} - {mi}_{j}} (n_ma - n_mi) + n_mi) + C (\frac{V_{i, j} - {mi}_{j}}{{ma}_{j} - {mi}_{j}} (n_ma - n_mi) + n_mi) \cdot Corr (V_{j}, V_{label}) \end{aligned}

(3)

where

v

’: the new weighted value;

{mi}_{j}

: minimum value of column

j

;

V_{i, j}

: each original value;

{ma}_{j}

: maximum value of column

j

;

C

: the parameter to adjust the new range; n_mi: new minimum value (0);

C o r r (V_{j}, V_{label})

: the CC among column

j

and column Label using Pearson; n_ma: new maximum value (1).

In the proposed work, we employed the CCMMW approach to impute missing values for each feature. After calculating the missing values across the entire dataset, features are arranged in descending order based on their correlation. We assume that the summation of the correlation for each feature serves as an indicator, where a higher correlation coefficient summation signifies a stronger correlation among features. An overview of the CCMMWI approach is provided before delving into each step of the proposed approach. As for the CCMMWI pseudocode as the Algorithm 2, assume that input $D$ is a complete dataset containing n records, each described by m features, $F_{1}, F_{2}, \dots, F_{m}$ . $R_{i}^{j}$ denotes the $j$ th feature of the $i$ th record in $D$ . $D_{m i s s i n g}$ is an incomplete dataset where the missing values are generated randomly using the MCAR mechanism and UD model which each feature has same missing rate. $D_{i n i t i a l}$ presents filled missing data in $D_{m i s s i n g}$ using meanImpute. To predict the missing values in $D_{m i s s i n g}$ , we calculated Corr using Eq. (2). ${C o r r}_{i}^{j}$ presents the CC between $F_{i}$ and $F_{j}$ . A higher summation of the CC of each feature would indicate that features have many highly correlated features. ${C o r r}_{sum}$ denotes the summation of each CC value with other features. ${att}_{i n d e x}$ presents the index of features in descending order based on summation values of each in ${C o r r}_{sum}$ , which started with features with higher summation values. The CCMMWI is a label-based method where the weight of features will be based on the target correlation for each feature. CCMMW is used to weight the $D_{i n i t i a l}$ to produce $D_{w e i g h t e d}$ and set $F_{i}$ as Y where $i \in {1, 2, 3, \dots, n}$ , and the rest of features as X. Next, split $D_{w e i g h t e d}$ into $X_{t r a i n i n g}$ and $X_{t e s t i n g}$ where $X_{t e s t i n g}$ include all instances with missing values in $F_{i}$ . Finally, fit $X_{t r a i n i n g}$ , $y_{t r a i n i n g}$ , in the model to predict the missing values using $X_{t e s t i n g}$ . Update estimated values in $D_{i m p u t e d}$ , and then repeat until the last feature with less summation of correlation is reached. Meanwhile, Algorithm 2 shows the algorithm flow.

Figure 2.

The flow of CCMMWI method.

4. Experiments

4.1 Datasets

This study employed ten UCI datasets listed in Table 2 to assess the suggested imputation method’s effectiveness. Only numeric datasets are utilized, which are in integer and continuous values.

Table 2
Datasets with various sizes

No Dataset Type of data # Instances # Features # Classes

1 Breast Cancer 1 Real 569 30 2

2 QSAR Real 1055 41 2

3 SONAR Real 208 60 2

4 PARKINSON Real 195 22 2

5 Wine Integer $+$ real 6463 12 2

6 Musk Int 6598 166 2

7 wholesale Int 440 8 2

8 Spam Real 4601 57 2

9 Heart 2 Integer $+$ real 1190 11 2

10 Vehicle Integer 846 18 4

No	Dataset	Type of data	# Instances	# Features	# Classes
1	Breast Cancer 1	Real	569	30	2
2	QSAR	Real	1055	41	2
3	SONAR	Real	208	60	2
4	PARKINSON	Real	195	22	2
5	Wine	Integer $+$ real	6463	12	2
6	Musk	Int	6598	166	2
7	wholesale	Int	440	8	2
8	Spam	Real	4601	57	2
9	Heart 2	Integer $+$ real	1190	11	2
10	Vehicle	Integer	846	18	4

4.2 Running the experiments

In this experiment, the proposed method is compared with MeanImp, k-NNI, EMI, and MICE imputation. Also, the evaluation section used three other imputation methods: CBRL, CBRC [53], and ExtraImpute [46]. The experiment steps are shown in Fig. 3.

Figure 3.

The experiment steps.

Based on Fig. 3, three evaluation metrics are considered to evaluate the generated imputed datasets. First, determine the error rate by using one error rate metric, to compare the original full dataset with the imputed dataset [59]. We evaluate the performance by calculating two statistics metrics, mean absolute error (MAE) and root mean squared error (RMSE), as in Eqs (4) and (5). MAE is the statistical error measure between paired observations that reflect the same phenomena. RMSE is the standard way in regression analysis to measure the quality of the model’s fit. Finally, the coefficient of determination ( $R^{2}$ ) is a statistical measure representing the proportion of variance in the dependent variable explained by the independent variable(s) in a regression model. It is computed as the ratio of the variance in predicted values to observed values. $R^{2}$ quantifies the goodness of fit of a regression model and ranges between 0 and 1. A value of 0 indicates that the model does not explain any variance in the dependent variable, while a value of 1 signifies that it explains all variances. A value between 0 and 1 indicates that the model does not describe all dependent variable variances. $R^{2}$ is often used to evaluate the strength and usefulness of a regression model. A high $R^{2}$ value indicates that the model can accurately predict the dependent variable based on the independent variable(s). $R^{2}$ is obtained by Eq. (6), where RSS is the sum of squared residuals and TSS is the total sum of squares which explains the degree of variation in the dependent variable.

\begin{aligned} MAE & = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - x_{i} | = \frac{1}{n} \sum_{i = 1}^{n} | e_{i} | \end{aligned}

(4)

\begin{aligned} RMSE & = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y})}^{2}} \end{aligned}

(5)

\begin{aligned} R^{2} & = 1 - \frac{R S S}{T S S} \end{aligned}

(6)

In this experiment, all codes were written in Python 3.7. Jupyter (Anaconda 3), and SciKitLearn Library has been used to implement all the techniques. Finally, all experimental results were performed on numeric datasets. For all purposes, the result was evaluated by the RMSE, MAE, and $R^{2}$ where the experiment was repeated ten times. Then, the results were averaged for further analysis. In both the MICE and CCMMWI methods, the Bayesian ridge regressor has been used to predict the missing values.

5. Analysis and discussion

In this section, the results are categorized into two types. First, we present the outcomes of RMSE, MAE, and $R^{2}$ based on each dataset. Second, we measure the average performance of each dataset to evaluate the overall effectiveness of the methods. Six missing rates, covering a range of categories (5%, 10%, 25%, 40%, 50%, and 65%), were employed. Each missing rate was repeated ten times.

5.1 Results based on RMSE

This section displays the results of applying various imputation techniques to datasets with varying levels of missing data, as measured by the RMSE. It reflects the discrepancy between the predicted and actual values in the data and is a commonly used metric for evaluating the performance of statistical models, including imputation techniques.

5.1.1 RMSE of each dataset

This section discusses the outcomes for each dataset where different missing rates are applied. The mean, EM, k-NNI, MICE, CBRL, CBRC, ExtraImpute (EXT.Imp), and CCMMWI techniques were used to impute the missing values.

Table 3
Average RMSE based on datasets

Dataset Mean EM k-NNI MICE CBRL CBRC EXT.Imp CCMMWI

Breast 0.0746 0.1038 0.0632 0.0343 0.0420 0.0437 0.0449 0.0343

QSAR 0.0679 0.0919 0.0561 0.0524 0.0554 0.0574 0.0509 0.0486

Sonar 0.1101 0.1543 0.0855 0.0739 0.0834 0.0868 0.0836 0.0697

Parkinson 0.0953 0.1320 0.0879 0.0689 0.0694 0.0729 0.0665 0.0624

Wine 0.0546 0.0764 0.0510 0.0952 0.0455 0.0456 0.0452 0.0437

Musk 0.1183 0.1602 0.0392 0.0402 0.0579 0.0590 0.0332 0.0309

Wholesale 0.1051 0.1466 0.1126 0.0712 0.0876 0.0906 0.0989 0.0743

Spambase 0.0298 0.0396 0.0298 0.0297 0.0279 0.0279 0.0283 0.0287

Heart 0.1558 0.2145 0.1549 0.1585 0.1446 0.1445 0.1519 0.1540

Vehicle 0.1009 0.1400 0.0702 0.0576 0.0658 0.0680 0.0642 0.0550

Dataset	Mean	EM	k-NNI	MICE	CBRL	CBRC	EXT.Imp	CCMMWI
Breast	0.0746	0.1038	0.0632	0.0343	0.0420	0.0437	0.0449	0.0343
QSAR	0.0679	0.0919	0.0561	0.0524	0.0554	0.0574	0.0509	0.0486
Sonar	0.1101	0.1543	0.0855	0.0739	0.0834	0.0868	0.0836	0.0697
Parkinson	0.0953	0.1320	0.0879	0.0689	0.0694	0.0729	0.0665	0.0624
Wine	0.0546	0.0764	0.0510	0.0952	0.0455	0.0456	0.0452	0.0437
Musk	0.1183	0.1602	0.0392	0.0402	0.0579	0.0590	0.0332	0.0309
Wholesale	0.1051	0.1466	0.1126	0.0712	0.0876	0.0906	0.0989	0.0743
Spambase	0.0298	0.0396	0.0298	0.0297	0.0279	0.0279	0.0283	0.0287
Heart	0.1558	0.2145	0.1549	0.1585	0.1446	0.1445	0.1519	0.1540
Vehicle	0.1009	0.1400	0.0702	0.0576	0.0658	0.0680	0.0642	0.0550

As depicted in Table 3, CCMMWI yields a lower RMSE value in the Breast dataset compared to MICE, with both returning 0.0343. In QSAR, Sonar, Parkinson, Wine, Musk, and Vehicle datasets, CCMMWI demonstrates lower RMSE values than all other methods. Conversely, MICE exhibits a lower RMSE value in the Wholesale dataset, while CBRL and CBRC achieve lower RMSE values in the Spambase dataset. Additionally, CBRC provides the best value in the Heart dataset (i.e., 0.1445). Overall, CCMMWI outperforms seven out of ten datasets. MICE excels in the Wholesale dataset and performs similarly to CCMMWI in the Breast dataset. In Spambase, CBRL and CBRC outperform other methods, with CBRC surpassing all imputation methods in the Heart dataset. The EM method returns higher RMSE values in all datasets, indicating it is considered the least effective method.

Table 4

Average RMSE based on missing rates

MR	Mean	EM	k-NNI	MICE	CBRL	CBRC	EXT.Imp	CCMMWI
5%	0.0384	0.0528	0.0277	0.0228	0.0248	0.0263	0.0218	0.0188
10%	0.0542	0.0739	0.0397	0.0340	0.0364	0.0379	0.0328	0.0280
25%	0.0856	0.1180	0.0656	0.0601	0.0602	0.0621	0.0571	0.0504
40%	0.1091	0.1499	0.0875	0.0817	0.0805	0.0828	0.0788	0.0716
50%	0.1218	0.1682	0.1025	0.0942	0.0932	0.0951	0.0941	0.0845
65%	0.1384	0.1931	0.1271	0.1160	0.1126	0.1136	0.1159	0.1076

5.1.2 RMSE of missing rates

According to the findings in Table 4, the CCMMWI method performs better than the other imputation techniques in terms of RMSE across a variety of missing rates, including minor, moderate, high, and extremely high missing rates. In this instance, the CCMMWI consistently yields lower RMSE values than the other imputation methods, indicating that it might be more accurate at imputing missing values and generating statistical estimates. The results indicate that achieving a low error rate corresponds to estimating values that closely align with the original data. Despite MICE, CBRL, and CBRC utilizing the same estimator model, Bayesian ridge, variations in error rates arise due to the contribution of features, which has been adjusted based on their correlation with the label feature.

5.2 Based on MAE results

This section displays the results of applying various imputation techniques to datasets with varying levels of missing data, as measured by the MAE. The MAE measures the difference between a dataset’s predicted and observed values. It is commonly employed to evaluate the performance of statistical models, including those used for imputing missing data.

Table 5
Average MAE based on datasets

Dataset Mean EM k-NNI MICE CBRL CBRC EXT.Imp CCMMWI

Breast 0.0326 0.0455 0.0261 0.0139 0.0169 0.0175 0.0178 0.0138

QSAR 0.0236 0.0364 0.0177 0.0177 0.0182 0.0185 0.0155 0.0165

Sonar 0.0520 0.0715 0.0391 0.0350 0.0387 0.0402 0.0385 0.0326

Parkinson 0.0417 0.0586 0.0370 0.0258 0.0280 0.0298 0.0259 0.0243

Wine 0.0234 0.0331 0.0215 0.0286 0.0189 0.0189 0.0185 0.0182

Musk 0.0570 0.0755 0.0134 0.0174 0.0240 0.0244 0.0088 0.0117

Wholesale 0.0365 0.0519 0.0347 0.0222 0.0282 0.0288 0.0287 0.0207

Spambase 0.0073 0.0141 0.0061 0.0070 0.0064 0.0064 0.0061 0.0072

Heart 0.0710 0.0913 0.0631 0.0673 0.0618 0.0619 0.0605 0.0644

Vehicle 0.0475 0.0638 0.0315 0.0241 0.0275 0.0285 0.0250 0.0223

Dataset	Mean	EM	k-NNI	MICE	CBRL	CBRC	EXT.Imp	CCMMWI
Breast	0.0326	0.0455	0.0261	0.0139	0.0169	0.0175	0.0178	0.0138
QSAR	0.0236	0.0364	0.0177	0.0177	0.0182	0.0185	0.0155	0.0165
Sonar	0.0520	0.0715	0.0391	0.0350	0.0387	0.0402	0.0385	0.0326
Parkinson	0.0417	0.0586	0.0370	0.0258	0.0280	0.0298	0.0259	0.0243
Wine	0.0234	0.0331	0.0215	0.0286	0.0189	0.0189	0.0185	0.0182
Musk	0.0570	0.0755	0.0134	0.0174	0.0240	0.0244	0.0088	0.0117
Wholesale	0.0365	0.0519	0.0347	0.0222	0.0282	0.0288	0.0287	0.0207
Spambase	0.0073	0.0141	0.0061	0.0070	0.0064	0.0064	0.0061	0.0072
Heart	0.0710	0.0913	0.0631	0.0673	0.0618	0.0619	0.0605	0.0644
Vehicle	0.0475	0.0638	0.0315	0.0241	0.0275	0.0285	0.0250	0.0223

5.2.1 MAE of datasets

According to Table 5, the CCMMWI consistently performs better than the other approaches in terms of MAE across various datasets. In this case, the CCMMWI consistently yields lower MAE values than the other imputation methods in six of ten datasets, indicating that it may be more accurate at imputing missing values and creating statistical estimates. Table 5 also shows that the EXT.Imp method outperforms the other methods in the QSAR, MUSK, Spambase, and Heart datasets. However, k-NNI provides the lower MAE in Spambase dataset sharing with ExtraImpute. In line with the RMSE results, the EM method consistently produces higher MAE values than the other methods in all datasets.

Table 6
Average MAE based on missing rates

MR Mean EM k-NNI MICE CBRL CBRC EXT.Imp CCMMWI

5% 0.0061 0.0083 0.0037 0.0028 0.0033 0.0036 0.0025 0.0025

10% 0.0121 0.0165 0.0074 0.0059 0.0069 0.0073 0.0055 0.0051

25% 0.0302 0.0414 0.0200 0.0171 0.0186 0.0192 0.0158 0.0148

40% 0.0484 0.0665 0.0339 0.0304 0.0320 0.0329 0.0285 0.0270

50% 0.0606 0.0835 0.0447 0.0402 0.0418 0.0428 0.0388 0.0360

65% 0.0783 0.1089 0.0643 0.0591 0.0585 0.0592 0.0560 0.0538

MR	Mean	EM	k-NNI	MICE	CBRL	CBRC	EXT.Imp	CCMMWI
5%	0.0061	0.0083	0.0037	0.0028	0.0033	0.0036	0.0025	0.0025
10%	0.0121	0.0165	0.0074	0.0059	0.0069	0.0073	0.0055	0.0051
25%	0.0302	0.0414	0.0200	0.0171	0.0186	0.0192	0.0158	0.0148
40%	0.0484	0.0665	0.0339	0.0304	0.0320	0.0329	0.0285	0.0270
50%	0.0606	0.0835	0.0447	0.0402	0.0418	0.0428	0.0388	0.0360
65%	0.0783	0.1089	0.0643	0.0591	0.0585	0.0592	0.0560	0.0538

5.2.2 MAE of missing rates

The results in Table 6 show the average MAE values for each imputation technique across the different missing rates in each dataset. Table 6 indicates that CCMMWI outperforms all other techniques under all missing-rate cases. Only with a 5% missing rate that EXT.Imp shows the lower MAE value sharing with CCMMWI.

5.3 Based on $R^{2}$ results

R-squared ( $R^{2}$ ) measures how well a model fits the data. It is a statistic that ranges from 0 to 1. In this section, we compared all results in $R^{2}$ values. Higher values will be considered as best performance.

Table 7
Average $R^{2}$ based on datasets

Dataset Mean EM k-NNI MICE CBRL CBRC EXT.Imp CCMMWI

Breast 0.6935 0.4023 0.7556 0.9106 0.8774 0.8703 0.8602 0.9080

QSAR 0.6828 0.4161 0.7495 0.7820 0.7728 0.7621 0.7922 0.8088

Sonar 0.6760 0.3606 0.7723 0.8144 0.7850 0.7741 0.7840 0.8290

Parkinson 0.6821 0.3867 0.7256 0.8495 0.8409 0.8198 0.8437 0.8671

Wine 0.7001 0.4077 0.7325 -1.4137 0.7926 0.7915 0.7838 0.7984

Musk 0.6768 0.4655 0.9505 0.9411 0.9028 0.9000 0.9613 0.9670

Wholesale 0.7030 0.4564 0.6857 0.7511 0.7650 0.7587 0.7249 0.7734

Spambase 0.6812 0.4404 0.6612 0.6659 0.7119 0.7112 0.6845 0.6878

Heart 0.7019 0.4237 0.6980 0.6690 0.7441 0.7437 0.7015 0.7063

Vehicle 0.6896 0.4035 0.8083 0.8741 0.8499 0.8428 0.8546 0.8860

Dataset	Mean	EM	k-NNI	MICE	CBRL	CBRC	EXT.Imp	CCMMWI
Breast	0.6935	0.4023	0.7556	0.9106	0.8774	0.8703	0.8602	0.9080
QSAR	0.6828	0.4161	0.7495	0.7820	0.7728	0.7621	0.7922	0.8088
Sonar	0.6760	0.3606	0.7723	0.8144	0.7850	0.7741	0.7840	0.8290
Parkinson	0.6821	0.3867	0.7256	0.8495	0.8409	0.8198	0.8437	0.8671
Wine	0.7001	0.4077	0.7325	-1.4137	0.7926	0.7915	0.7838	0.7984
Musk	0.6768	0.4655	0.9505	0.9411	0.9028	0.9000	0.9613	0.9670
Wholesale	0.7030	0.4564	0.6857	0.7511	0.7650	0.7587	0.7249	0.7734
Spambase	0.6812	0.4404	0.6612	0.6659	0.7119	0.7112	0.6845	0.6878
Heart	0.7019	0.4237	0.6980	0.6690	0.7441	0.7437	0.7015	0.7063
Vehicle	0.6896	0.4035	0.8083	0.8741	0.8499	0.8428	0.8546	0.8860

5.3.1

R^{2}

of dataset

According to Table 7, three of the evaluated methods obtained higher $R^{2}$ values in various datasets, whereas MICE provides higher $R^{2}$ with the Breast dataset. CBRL also attained higher $R^{2}$ with Spambase and Heart datasets. CCMMWI is the most frequently performed method, with higher results in seven out of ten databases. As we see in the MICE method with the Wine dataset, a negative value has been obtained. The negative value is obtained if the model is poorly fit for the data. The negative results occur when the model is overly complex and cannot capture the patterns in the data.

Table 8
Average $R^{2}$ based on missing rates

MR Mean EM k-NNI MICE CBRL CBRC EXT.Imp CCMMWI

5% 0.9516 0.9098 0.9713 0.9335 0.9763 0.9744 0.9798 0.9833

10% 0.9044 0.8224 0.9422 0.8484 0.9511 0.9469 0.9583 0.9645

25% 0.7618 0.5529 0.8403 0.5410 0.8687 0.8626 0.8767 0.8936

40% 0.6138 0.2762 0.7116 0.4593 0.7698 0.7579 0.7706 0.7943

50% 0.5181 0.0992 0.6175 0.4323 0.6962 0.6833 0.6811 0.7265

65% 0.3825 -0.1628 0.4408 0.2926 0.5634 0.5598 0.5281 0.5769

MR	Mean	EM	k-NNI	MICE	CBRL	CBRC	EXT.Imp	CCMMWI
5%	0.9516	0.9098	0.9713	0.9335	0.9763	0.9744	0.9798	0.9833
10%	0.9044	0.8224	0.9422	0.8484	0.9511	0.9469	0.9583	0.9645
25%	0.7618	0.5529	0.8403	0.5410	0.8687	0.8626	0.8767	0.8936
40%	0.6138	0.2762	0.7116	0.4593	0.7698	0.7579	0.7706	0.7943
50%	0.5181	0.0992	0.6175	0.4323	0.6962	0.6833	0.6811	0.7265
65%	0.3825	-0.1628	0.4408	0.2926	0.5634	0.5598	0.5281	0.5769

5.3.2

R^{2}

of missing rates

Table 8 presents the results of various imputation techniques applied to different datasets with different rates of missing data. The results show that the CCMMWI imputation method consistently produces the highest $R^{2}$ values across all missing rates of each dataset, indicating that it is the most effective technique for filling in missing data in these cases. It is supported by the higher $R^{2}$ values observed for CCMMWI compared to all other methods in all cases. Overall, the results of Table 8 suggest that CCMMWI is the superior imputation method for addressing missing data in the datasets examined. Normalized weighted values have a positive impact by utilizing the advantages of the normalization procedure that prevents any domination of values. Additionally, it maximizes the values to boost their contribution to creating a powerful estimator.

6. Conclusions

In this study, addressing missing data emerges as a crucial aspect of real-world data analysis, with the primary objective being the preservation of all observed data for subsequent analysis. Employing effective strategies for estimating missing data is paramount in overcoming this challenge. The authors conducted a thorough examination of various previously published approaches aimed at handling missing data, evaluating their efficacy across diverse datasets characterized by varying levels of missing values. The study introduces CCMMWI, a novel algorithm utilizing normalized weighted values based on correlation coefficient values to impute missing data. This innovative weighting strategy seeks to optimize the contribution of features by considering their relationships with the target feature. The study’s findings underscore the superior performance of CCMMWI over alternative methods, including Mean, EM, k-NNI, MICE, CBRL, CBRC, and ExtraImpute, as evidenced by accuracy measures and error metrics such as RMSE and MAE. Notably, CCMMWI achieves higher $R^{2}$ values compared to its counterparts, further confirming its effectiveness in imputing missing data. As for future research recommendations, implementing the proposed imputation algorithm on additional regression methods is advised, particularly in scenarios where only the Bayesian ridge regression method is employed for estimating missing values. Moreover, the potential applicability of the proposed algorithm in addressing other missing data mechanisms is highlighted, with the study currently focusing exclusively on MCAR. The authors express their intention to explore alternative standardization methods beyond the min-max approach in future research endeavours.

Footnotes

Acknowledgments

We are very grateful to Universiti Kebangsaan Malaysia for supporting this research, especially to the Data Mining and Optimization (DMO) Research Lab of Fakulti Teknologi dan Sains Maklumat (FTSM), Universiti Kebangsaan Malaysia. Massive appreciation to Universiti Kebangsaan Malaysia for providing grant code GGP-2020-032 for this research funding.

References

Khan

I.U.

Javaid

Taylor

C.J.

Gamage

K.A.

, Big data analytics for electricity theft detection in smart grids, in 2021 IEEE Madrid PowerTech (2021), 1–6.

Santos

M.S.

Pereira

R.C.

Costa

A.F.

Soares

J.P.

Santos

Abreu

P.H.

, Generating synthetic missing data: A review by missing mechanism, IEEE Access 7 (2019), 11651–11667.

S.F.

Chang

C.Y.

Lee

S.J.

, Time series forecasting with missing values, in 2015 1st International Conference on Industrial Networks and Intelligent Systems (INISCom) (2015), 151–156.

Camino

R.D.

Hammerschmidt

C.A.

State

, Improving missing data imputation with deep generative models, arXiv preprint arXiv:190210666. (2019).

Critical Data

M.I.T.

, Secondary analysis of electronic health records. Springer Nature (2016).

Chlioui

Abnane

Idri

, Comparing statistical and machine learning imputation techniques in breast cancer classification, in Computational Science and Its Application–ICCSA 2020: 20th International Conference, Cagliari, Italy, July 1-4, 2020, Proceedings, Part IV 20 (2020), pp. 61–76.

Yan

Yuan

Yang

, A Discrete Missing Data Imputation Method Based on Improved Multi-layer Perceptron, in 2021 11th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS) 1 (2021), pp. 480–484.

Raja

P.S.

Thangavel

K.J.S.C.

, Missing value imputation using unsupervised machine learning techniques, Soft Computing 24(6) (2020), 4361–4392.

Pires

I.M.

Hussain

Garcia

N.M.

Zdravevski

, Improving human activity monitoring by imputation of missing sensory data: experimental study, Future Internet 12(9) (2020), 155.

10.

Manimekalai

Kavitha

, Missing value imputation and normalization techniques in myocardial infarction, ICTACT Journal on Soft Computing 8(3) (2018), 8.

11.

Alshdaifat

E.A.

Alshdaifat

D.A.

Alsarhan

Hussein

El-Salhi

S.M.D.F.S.

, The effect of preprocessing techniques, applied to numeric features, on classification algorithms’ performance, Data 6(2) (2021), 11.

12.

Rajeswari

Thangavel

, The performance of data normalization techniques on heart disease datasets, International Journal of Advanced Research in Engineering and Technology 11(12) (2020), 2350–2357.

13.

Singh

, Investigating the impact of data normalization on classification performance, Applied Soft Computing 97 (2020), 105524.

14.

Ahsan

M.M.

Mahmud

Saha

P.K.

Gupta

K.D.

Siddique

, Effect of data scaling methods on machine learning algorithms and model performance, Technologies 9(3) (2021), 52.

15.

Sinsomboonthong

, Performance Comparison of New Adjusted Min-Max with Decimal Scaling and Statistical Column Normalization Methods for Artificial Neural Network Classification, International Journal of Mathematics and Mathematical Sciences 2022(1) (2022), 3584406.

16.

Henderi

Wahyuningsih

Rahwanto

, Comparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer, International Journal of Informatics and Information Systems 4(1) (2021), 13–20.

17.

Raju

V.G.

Lakshmi

K.P.

Jain

V.M.

Kalidindi

Padma

, Study the influence of normalization/transformation process on the accuracy of supervised classification, in 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT) (2020), pp. 729–735.

18.

Benhar

Idri

Fernández-Aleán

J.L.

, Data preprocessing for heart disease classification: A systematic literature review, Computer Methods and Programs in Biomedicine 195 (2020), 105635.

19.

Gelman

, Missing-data imputation, Data analysis using regression and multilevel/hierarchical models (2006), pp. 529–543.

20.

Pratama

Permanasari

A.E.

Ardiyanto

Indrayani

, A review of missing values handling methods on time-series data, in 2016 international conference on information technology systems and innovation (ICITSI) (2016), pp. 1–6.

21.

Van Buuren

, Flexible imputation of missing data. CRC press (2018).

22.

Yenduri

Iyengar

S.S.

, Performance evaluation of imputation methods for incomplete datasets, International Journal of Software Engineering and Knowledge Engineering 17(1) (2007), 127–152.

23.

Deb

Wee-Chung Liew

, A correlation based imputation method for incomplete traffic accident data, in PRICAI 2014: Trends in Artificial Intelligence: 13th Pacific Rim International Conference on Artificial Intelligence, Gold Coast, QLD, Australia, December 1–5, 2014. Proceedings 13 (2014), 905–912.

24.

Rahman

M.G.

Islam

M.Z.

, A decision tree-based missing value imputation technique for data preprocessing, in The 9th Australasian Data Mining Conference: AusDM 2011 (2011), pp. 41–50.

25.

Widaman

K.F.

, Best practices in quantitative methods for developmentalists: III. Missing data: What to do with or without them, Monographs of the Society for Research in Child Development (2006).

26.

Kang

, The prevention and handling of the missing data, Korean Journal of Anesthesiology 64(5) (2013), 402.

27.

Farhangfar

Kurgan

L.A.

Pedrycz

, A novel framework for imputation of missing values in databases, IEEE Transactions on Systems, Man and Cybernetics-Part A: Systems and Humans 37(5) (2007), 692–709.

28.

Dhevi

A.S.

, Imputing missing values using Inverse Distance Weighted Interpolation for time series data, in 2014 Sixth international conference on advanced computing (ICoAC) (2014), pp. 255–259.

29.

Eekhout

De Vet

H.C.

Twisk

J.W.

Brand

J.P.

de Boer

M.R.

Heymans

M.W.

, Missing data in a multiitem instrument were best handled by multiple imputation at the item score level, Journal of clinical epidemiology 67(3) (2014), 335–342.

30.

Lin

W.C.

Tsai

C.F.

, Missing value imputation: a review and analysis of the literature (2006–2017), Artificial Intelligence Review 53 (2020), 1487–1509.

31.

Bańbura

Modugno

, Maximum likelihood estimation of factor models on datasets with arbitrary pattern of missing data, Journal of Applied Econometrics 29(1) (2014), 133–160.

32.

Hassanat

A.B.

Abbadi

M.A.

Altarawneh

G.A.

Alhasanat

A.A.

, Solving the problem of the K parameter in the KNN classifier using an ensemble learning approach, arXiv preprint arXiv:14090919. (2014).

33.

García-Pedrajas

Del Castillo

J.A.R.

Cerruela-Garía

, A proposal for local k values for k-nearest neighbor rule, IEEE Transactions on Neural Networks and Learning Systems 28(2) (2015), 470–475.

34.

Chen

Shao

, Nearest neighbor imputation for survey data, Journal of Official Statistics 16(2) (2000), 113.

35.

Zhang

, Nearest neighbor selection for iteratively kNN imputation, Journal of Systems and Software 85(11) (2012), 2541–2552.

36.

Stekhoven

D.J.

Bühlmann

, MissForest – non-parametric missing value imputation for mixed-type data, Bioinformatics 28(1) (2012), 112–118.

37.

Muharemi

Logofătu

Leon

, Review on general techniques and packages for data imputation in R on a real world dataset, in Computational Collective Intelligence: 10th International Conference, ICCCI 2018, Bristol, UK, September 5–7, 2018, Proceedings, Part II (2018), pp. 386–395.

38.

Van Buuren

Groothuis-Oudshoorn

, mice: Multivariate imputation by chained equations in R, Journal of Statistical Software 45 (2011), 1–67.

39.

Chhabra

Vashisht

Ranjan

, A comparison of multiple imputation methods for data with missing values, Indian Journal of Science and Technology (2017).

40.

Rubin

D.B.

, Statistical matching using file concatenation with adjusted weights and multiple imputations, Journal of Business & Economic Statistics 4(1) (1986), 87–94.

41.

Batista

G.E.

Monard

M.C.

, An analysis of four missing data treatment methods for supervised learning, Applied Artificial Intelligence 17(5–6) (2003), 519–533.

42.

Little

R.J.

Rubin

D.B.

, Statistical Analysis with Missing Data 793 (2019).

43.

Platias

Petasis

, A comparison of machine learning methods for data imputation, in 11th Hellenic Conference on Artificial Intelligence (2020), pp. 150–159.

44.

Rahman

M.G.

Islam

M.Z.

, Fimus: A framework for imputing missing values using co-appearance, correlation and similarity analysis, Knowledge-Based Systems 56 (2014), 311–327.

45.

Sezer

Başeğmez

, An approach based on feature selection for missing value imputation, in International Conference on Intelligent and Fuzzy Systems (2021), pp. 945–950.

46.

Alabadla

Sidi

Ishak

Ibrahim

Affendey

L.S.

Hamdan

, ExtraImpute: A Novel Machine Learning Method for Missing Data Imputation, Journal of Advances in Information Technology (2022), url: https://api.semanticscholar.org/CorpusID:252774169.

47.

Tsai

C.F.

M.L.

Lin

W.C.

, A class center based approach for missing value imputation, Knowledge-Based Systems 151 (2018), 124–135.

48.

Nugroho

Utama

N.P.

Surendro

, Class center-based firefly algorithm for handling missing data, Journal of Big Data 8(1) (2021), 37.

49.

Liu

Lai

Zhang

, A hierarchical missing value imputation method by correlation-based K-nearest neighbors, In Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1 (2020), pp. 486–496.

50.

Sefidian

A.M.

Daneshpour

, Estimating missing data using novel correlation maximization based methods, Applied Soft Computing 91 (2020), 106249.

51.

Manna

Pati

S.K.

, Missing value imputation using correlation coefficient, in Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2020 (2020), 551–558.

52.

Razavi-Far

Cheng

Saif

Ahmadi

, Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems 187 (2020), 104805.

53.

Mostafa

S.M.

Eladimy

A.S.

Hamad

Amano

, CBRL and CBRC: Novel algorithms for improving missing value imputation accuracy based on Bayesian ridge regression, Symmetry 12(10) (2020), 1594.

54.

Mostafa

S.M.

Eladimy

A.S.

Hamad

Amano

, CBRG: A novel algorithm for handling missing data using bayesian ridge regression and feature selection based on gain ratio, IEEE Access 8 (2020), 216969–216985.

55.

Schober

Boer

Schwarte

L.A.

, Correlation coefficients: appropriate use and interpretation, Anesthesia & Analgesia 126(5) (2018), 1763–1768.

56.

Gogtay

N.J.

Thatte

U.M.

, Principles of correlation analysis, Journal of the Association of Physicians of India 65(3) (2017), 78–81.

57.

Witte

R.S.

Witte

J.S.

, Statistics, John Wiley & Sons, (2017).

58.

Bonett

D.G.

Wright

T.A.

, Sample size requirements for estimating Pearson, Kendall and Spearman correlations, Psychometrika 65 (2000), 23–28.

59.

Chai

Draxler

R.R.

, Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature, Geoscientific Model Development 7(3) (2014), 1247–1250.

Missing data imputation using correlation coefficient and min-max normalization weighting

Abstract

Keywords

1. Introduction

2. Related work

Table 1 Missing data percentage categories [25] Percentage of data missing Categories 1%–2% Negligible 5%–10% Minor 10%–25% Moderate 25%–50% High > 50% Excessive

3. Proposed method

4.1 Datasets

5.1 Results based on RMSE

5.1.1 RMSE of each dataset

5.2 Based on MAE results

5.3 Based on R 2 results

6. Conclusions

Footnotes

Acknowledgments

References

Table 1
Missing data percentage categories [25]

Percentage of data missing Categories

1%–2% Negligible

5%–10% Minor

10%–25% Moderate

25%–50% High

$>$ 50% Excessive

5.3 Based on $R^{2}$ results