Enhanced regression model using cluster based sampling techniques

Abstract

This paper aims to develop the methodology for enhancing the regression models using Cluster based sampling techniques (CST) to achieve high predictive accuracy and can also be used to handle large datasets. Hard clustering (KMeans Clustering) or Soft clustering (Fuzzy C-Means) to generate samples called clusters, which in turn is used to generate the Local Regression Models (LRM) for the given dataset. These LRMs are used to create a Global Regression Model. This methodology is known as Enhanced Regression Model (ERM). The performance of the proposed approach is tested with 5 different datasets. The experimental results revealed that the proposed methodology yielded better predictive accuracy than the non-hybrid MLR model; also, fuzzy C-Means performs better than the KMeans clustering algorithm for sample selection. Thus, ERM has potential to handle data with uncertainty and complex pattern and produced a high prediction accuracy rate.

Keywords

Clustering KMeans fuzzy c-means multiple linear regression regression sampling methods

1 Introduction

Regression Model (RM) is common and often used statistical-based predictive technique in many of the research fields. A multiple Linear Regression model is a tool based on statistics that models the linear relationship between a set of predictors and dependent variable. This model predicts a dependent variable based on the model [1 –3]. In general, choosing the right sample size for RM development becomes critical to achieve high prediction accuracy. Another problem is insufficient sample size, which leads to the development of a partial regression model rather than a generalized regression model for the entire dataset. Choosing the best data sampling techniques for a given dataset is the best solution for robust regression model development.

Data sampling is the process of using a small amount of data to obtain the overall characteristics of the whole dataset. This process allows for the selection of a subgroup of individual items from a population in order to assess the characteristics of the entire population. The frequently used sampling techniques in the researches are Sequence sampling, Random sampling, Cluster sampling, Systematic sampling, and Stratified sampling [10].

In Sequential sampling, a sequence of one or more individuals is taken from some part of the dataset to form a sample for analysis [11].

The Simple Random model is the most basic form of the probability sampling technique. In this model, each individual item who is to be studied has an equal chance of being selected for the study, and researchers use some random process to select members. It provides an unbiased and excellent estimate of the parameters [12].

Cluster sampling [12] is the probability sampling method in which the whole dataset is divided into groups (or also called clusters). These randomly chosen clusters are used for sampling. All individuals in the selected clusters are included in the sampling of the dataset. It is used to study large datasets.

In the Systematic Sampling model [12], the first unit of the sample is randomly selected, and the subsequent units are systematically selected. This is a comprehensive implementation of the probability model, in which each member of the team is selected at regular intervals to create a model.

In the Stratified sampling method [12], the total population is divided into smaller groups or strata for the sampling process. The strata are formed based on some common features in the dataset. In Oversampling, the minority class samples are repeated randomly, whereas in Undersampling, the majority class samples are removed randomly to balance the data [13]. In Reservoir sampling method [14], the individuals are inserted into the predefined reservoir which contains unknown size of the dataset.

The advantage of using sampling for regression model is that models can be developed by using subset of the data without affecting the characteristics and quality of the entire database. In this paper, cluster based sampling method is used for sampling data in order to get better prediction accuracy. K-Means and Fuzzy C-Means clustering techniques are used to create the optimal subset of data (i.e. Samples) for building the efficient regression model.

The main contributions of this paper can be summarized below:

An Enhanced Regression model (ERM) using clustered sampling methods is introduced in this paper by using hard and soft clustering techniques.

To improve the predictive accuracy of MLR models, a hybrid methodology is proposed in the proposed scheme by using MLR and popular clustering techniques (such as KMeans, Fuzzy C-Means, or a combination of both). This proposed ERM has less prone to prediction error.

Experimental results of the proposed methodology are presented in this paper to demonstrate its effectiveness over existing techniques.

The format of this research paper is structured as the following. Section 2 introduces the basics and background knowledge to understand the cluster based sampling techniques and metric to assess the performance of the proposed work. Next, the detailed description of proposed regression approach using cluster based sampling technique is given in Section 3. The discussion of results for various datasets is pinned in Section 4. Finally, Section 5 summarizes the proposed work with a possibility of its future extension.

2 Literature review

This section provides a brief reference to current works related to this research work. While modeling MLR, the minimum required sample size is determined [4]. Usually, the minimum sample size for almost all types of multivariate analyzes is determined using a standard thumb rule for MLR. The most commonly used principle for determining the minimum required sample size is based on the ratio between the number of cases and the number of independent variables [5, 6].

Tabasnik and Fidel have used the formula “number of 50 + 8 * factors” to calculate the minimum required sample size [7]. Gregory and Daniel (2008) have formulated a rule to identify the minimum sample size that is required for an MLR-based forecasting system [8]. Nowadays, it is possible to use software to identify the exact sample size required for the analysis of multivariate data. The authors of [9] calculated the minimum sample sizes required for the MLR model using Power and sample size software (PASS). Usually, most of the statisticians finalize the minimum sample sizes required for a dataset by their experience. However, researchers from other domains may opt for some simple principles or thumb rule to calculate it.

Albattah [15] explores the importance of sampling in the analysis of big data. The author believes that there is no need to handle the whole data if one can do this with fewer data. Experimental results have showed that the samples were produced with reduced data computational time. So, in the most of the cases it lead to better results. In the case of the sampling process, the sample size fixation is a very important factor, and a number of techniques have been used to determine sample size in the literature [16 –19].

The authors of [20] has proposed Divided Regression model for larger datasets in which sampling technique is used to divide larger dataset into small subset of data for reducing the computational burden.

Josien K et al [21] have used Fuzzy C-Means (FCM) for cluster-based sampling. The authors have evaluated sampling methods with respect to the representativeness and performance of the sampled data. For that purpose, they have used statistical measures (such as mean and variance), and performance measures (such as accuracy rate, precision, and recall).

In [22], multi-model modeling approach based on FCM and support vector regression proposed along with adaptive mutation particle swarm optimization. Their results showed that multi-model modeling approach solved the problems like wide ranges of operating condition, nonlinearity, and prediction difficulty.

In [23], the technique proposed by the authors has two main phases: clustering technique is used in the first phase, and then regression techniques are used for prediction of compressive strength on these clusters in the second phase. They have concluded that FCM with regression techniques gave minimum errors for concrete strength prediction.

The authors of [24], proposed an Enhanced Logistic Regression model for larger dataset using K-Means clustering technique and stated that their model has higher predictive accuracy after the sampling process.

Following a review of the literature, it was revealed that there is no benchmark sampling technique for multiple linear regression (MLR) modeling. Furthermore, there is no rule of thumb to identify the most appropriate sampling technique for the development of an MLR model to achieve the best results. Therefore, the objective of this work is to investigate the feasibility of finding optimal data samples for generating MLR models using KMeans and Fuzzy-C-Means algorithms or a combination of both.to reduce errors and improving prediction accuracy rate.

3 Methods and materials

3.1 Multiple linear regression

It is a popular technique for the prediction task. It uses several independent (explanatory) variables to predict the outcome of a dependent (response) variable. Its purpose is to model the linear relationship between the dependent variable and independent variables.

The matrix version of MLR is shown in Fig. 1. In Fig. 1 X is a matrix of independent variables with size n x 2. Y is a column vector of dependent variable of size n×1, β is a column vector consists of co-efficient values of size 2×1, and ɛ is column vector consists of error values of size 2 x 1. The pseudo code of MLR is given in Algorithm 1.

Fig. 1

Components of MLR.

Algorithm 1: MLR
Sub Module MLR (X, Y)
Step 1. Calculate Co-efficient matrix using
$B^{'} = [\begin{matrix} β_{0} \\ β_{0} \\ ⋮ \\ β_{k} \end{matrix}] = {(X^{'} X)}^{- 1} X^{'} Y$
where X’ is the transpose of X.
Step 2. The predicted value of Y for a given X is
Y′ = X - B′
Step 3. The residuals is defined as: res_i = Y - Y′
End

3.2 Cluster based sampling techniques

It is a type of probability sampling technique in which the entire dataset is divided into several clusters (groups). These clusters are known as sample units. These clusters have homogeneous properties and have an equal probability of being part of the sample.

3.3 Sampling using K-Means clustering algorithm

The KMeans clustering technique [25] is a simple and widely used clustering algorithm for partitional clustering type. Initially, k initial centroids are randomly fixed, where k is a user specified size for the number of clusters. Each tuple in the dataset is placed in the closest cluster i.e. intra-cluster distance should be minimal. Then update the centroid of each cluster after assignment of tuple to the respective cluster. Repeat the process and update steps until items of the clusters or centroid remains unchanged. Algorithm 2 shows the formal description of K-Means algorithm.

Algorithm 2: KMeans Clustering: KMeans(D, k)
Input: ‘D’, Dataset, ‘k’, number of clusters(samples) Output: ‘k’ clusters(samples)
Step 1: Initialize ‘k’ centroid or centers of the clusters randomly.
Step 2: Do
• Calculate the mean of all the individual
items in the cluster
• Assign each individual items to the closest
cluster
• Update the centroid of the clusters Until the
convergence of centroid of the ‘k’ clusters
Step 3: Output non-overlapped clusters as sample unit

3.4 Sampling using fuzzy C-Means clustering algorithm

In Fuzzy C-Means clustering [21], each tuple may belong to more than one clusters. The relations of a tuple to each cluster can be measured by using fuzzy membership values. A matrix containing all of these fuzzy membership values is called a partition matrix U. The range of elements in the partition matrix is from 0 to 1. (U) The sum of the membership values of a tuple is always 1. The pseudo code of FCM is given in Algorithm 3.

Algorithm 3: Fuzzy C-Means Clustering: fcm(D, k)
Input: ‘D’,Dataset, ‘k’, number of clusters(samples) Output: ‘k’ clusters(samples)
Step 1. Initialize cluster centers, set the number of cluster as k, fuzzy index as m, and maximum number of iteration max as t
Step 2. For t = 1 to max t
• Updating partition matrix using
$u_{ij} = \frac{1}{\sum_{k = 1}^{c} {(\frac{d_{ij}}{d_{kj}})}^{2 / (m - 1)}}$
• Updating cluster centers using
$c_{i} = \frac{\sum_{j = 1}^{n} u_{ij}^{m} X_{j}}{\sum_{j = 1}^{n} u_{ij}^{m}}$
End for
Step 3: Output the overlapped cluster as sample unit

3.5 Qualitative performance indicators

Table 1 shows the measures like Root Mean Squared Error (RMSE) and Correlation Co-efficient (R2) are used to evaluate the performance of the Regression model.

Table 1
The qualitative performance measures

Qualitative performance Formula Meaning

RMSE $RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(P_{i} - A_{i})}^{2}}$ It measures error between a predicted value and an actual value. Lower values of RMSE indicate better fit. Where n: number of data items; P: Predicted values; A: Actual values

R² $R^{2} = 1 - (\frac{First sum of error}{Second sum of error})$ R-squared, also known as the coefficient of determination. The range for R-squared is from 0 to 1. It measures the strength of the relationship between the model and the dependent variable between 0–100%.

Qualitative performance	Formula	Meaning
RMSE	$RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(P_{i} - A_{i})}^{2}}$	It measures error between a predicted value and an actual value. Lower values of RMSE indicate better fit. Where n: number of data items; P: Predicted values; A: Actual values
R²	$R^{2} = 1 - (\frac{First sum of error}{Second sum of error})$	R-squared, also known as the coefficient of determination. The range for R-squared is from 0 to 1. It measures the strength of the relationship between the model and the dependent variable between 0–100%.

4 Proposed approach: Enhanced Regression Model (ERM)

Fuzzy C-Means clustering [21], each tuple may belong to more than one clusters. The relations of a tuple to each cluster can be measured by using fuzzy membership values In this model, first step is to automate the sampling technique for building Regression model. The proposed approach uses the cluster based sampling techniques for sample selection (i.e. rows) for the dataset. The soft clustering technique called Fuzzy C-Means (FCM) is used to create overlapping samples (i.e. Clusters). The hard clustering technique called KMeans Clustering is used used to create non-overlapping samples (i.e. Clusters). Secondly, each sample is regressed independently, using Multiple Linear Regression (MLR) model to generate Local Regression Model (LRM). The pseudocode MLR is given in algorithm-1. Thirdly, select the best sampling technique using equation 1 to generate the suitable samples for the RM.

The sampling technique with a minimal ‘F’ value is selected as the best sampling technique. The weights w1 and w2 are used to varying the degree of importance of RMSE and R2 of the RM respectively. The range of these weights is 0 to 1 but w1 + w2 should be less than and equal to 1. $F (x) = (w 1 * f_{1} (x) + w 2 * f_{2} (x))$ $F_{1} (X) = \frac{\sum_{i = 1}^{nc} RMSE (LRM (Sampl e_{i}))}{nc}$

$F_{2} (X) = \frac{\sum_{i = 1}^{nc} RMSE (LRM (Sampl e_{i}))}{nc}$

In Equations (2) and (3), nc denotes number of clusters, RMSE(.) denotes the RMSE value, LRM denotes the Linear Regression Model, Sample_i denotes ith sample, R-Squared(.) denotes the R-squared value. Finally, mean of the co-efficient of LRM is used to generate Global Regression Model for entire dataset. Figure 2 shows the graphical representation of the proposed approach. Pseudo code of the proposed ERM is given Algorithm 4.

Fig. 2

Flowchart for proposed methodology.

Algorithm 4: Enhanced Regression Model(ERM)
Input: Data (x1, x2, x3, x4,..),set of independent variable
Output: Y, dependent variable, co-efficients of the MLR model.
Begin
Step 1: Data Pre-processing
Step 2: Divide the dataset D into Training and Testing Set
Step 3: For Training Dataset // Fit MLR model to the dataset
•Cluster 1 = KMeans(D, number of clusters)
•Cluster 2 = fcm(D, number of clusters)
•Local_Regression_Model (LRM)=MLR
(Clusters)
•Global_Regression_Model(GRM)=Mean(Co-
efficient of Local_Regression_Model)
End
Step 4: Predict the value using trained GRM
•Validate GRM using Testing data
•Evaluate the Performance of GRM using
RMSE, R2
End

The time complexity of KMeans algorithm is O(ncdi) and time complexity of FCM is O(ndc2i) [12 –14]. The time complexity proposed methodology is O(ndi(c + c2)). Here n indicates count of data items in the dataset, c indicates the count of clusters, d indicates the dataset dimensions and i denotes the iteration count. The time complexity proposed methodology is O(ndi(c + c2)).

5 Experimental results and discussion

5.1 Dataset description

The benchmark regression datasets are taken form UCI Machine Learning Repository: Data Sets website. Dataset 1 is affairs it contains 265 observation and 18 variables, Dataset 2 is bostonhousing contains 506 observations and 14 variables, Dataset 3 is ailerons contains 1754 observations and 40 variables, Dataset 4 is bostonhousingord contains 506 observations and 14 variables, and Dataset 5 is abaloneord contains 4177 observations and 11 variables are taken to study the performance of the proposed methodology. K-Means and fuzzy C-Means are used to create non-overlapped and overlapping partitional clusters with two different k values for five different datasets. In FCM, one records (samples) can belong to more than one cluster. Therefore, clusters in FCM have more number of samples than the KMeans algorithm. Table 2 shows intra-cluster distance among the samples of each cluster for KMeans clustering with K = 2 and K = 3 for five datasets.

Table 2
Intracluster distance of Clusters using KMeans

Datasets KMeans Cluster Distance nc = 2 KMeans Cluster Distance nc = 3

Cluster1 Cluster2 Cluster1 Cluster2 Cluster 3

Dataset1 2.0309 3.3818 1.9533 3.3975 2.0127

Dataset2 177.5425 99.6672 86.6228 95.0425 148.6258

Dataset3 186.2835 178.1267 183.6211 39.7705 175.3967

Dataset4 99.6400 177.5122 148.5934 86.5937 95.0340

Dataset5 1.0542 1.7706 1.4263 1.0590 1.4277

Datasets	KMeans Cluster Distance nc = 2	KMeans Cluster Distance nc = 3
Dataset1	2.0309	3.3818	1.9533	3.3975	2.0127
Dataset2	177.5425	99.6672	86.6228	95.0425	148.6258
Dataset3	186.2835	178.1267	183.6211	39.7705	175.3967
Dataset4	99.6400	177.5122	148.5934	86.5937	95.0340
Dataset5	1.0542	1.7706	1.4263	1.0590	1.4277

Table 3 shows intra-cluster distance among the samples of each cluster for FCM clustering with K = 2 and K = 3 for five datasets. The number of samples in each cluster of KMeans and FCM clustering techniques is shown in Tables 4 and 5 respectively. There is a slightly high value in the intracluster distance for clusters of FCM. This is due to the overlapping nature of the FCM clustering technique.

Table 3

Intracluster distance of Clusters using FCM

Datasets	FCM Cluster Distancenc = 2		FCM Cluster Distance nc = 3
	Cluster1	Cluster2	Cluster1	Cluster2	Cluster 3
Dataset1	2.3885	2.5780	2.2987	2.2987	2.3965
Dataset2	185.3292	190.6931	93.6841	75.1141	151.7577
Dataset3	107.7190	168.0013	153.0127	158.9691	118.5412
Dataset4	167.9695	107.6928	50.4372	102.5795	95.0340
Dataset5	1.4051	1.7409	0.9411	1.0726	1.6720

Table 4

Sample size for different dataset using KMeans

Datasets	Sample size for nc = 2		Sample size for nc = 3
	Cluster1	Cluster 2	Cluster1	Cluster 2	Cluster 3
Dataset1	204	61	50	155	60
Dataset2	149	357	313	38	155
Dataset3	3336	3818	3167	418	3569
Dataset4	357	149	155	313	38
Dataset5	1341	2836	1528	1342	1307

Table 5

Sample size for different dataset using FCM

Datasets	Sample size for nc = 2		Sample size for nc = 3
	Cluster1	Cluster 2	Cluster1	Cluster 2	Cluster1
Dataset1	180	100	165	165	85
Dataset2	369	144	109	272	137
Dataset3	4244	3573	1985	1983	3621
Dataset4	144	369	137	109	272
Dataset5	2359	2509	1305	1345	1263

Once clustering is complete, the MLR technique is applied to each individual cluster of KMeans and FCM to create a Local Regression Model (LRM). The performance metrics such RMSE and R-Squared value of each cluster are tabulated. Table 6 represents the RMSE value for each LRM that corresponds to the samples of clusters using the KMeans clustering technique with K = 2 and K = 3. Table 7 represents the RMSE value for each LRM that corresponds to the samples of clusters using the FCM clustering technique with K = 2 and K = 3.

Table 6

RMSE value of LRM with KMeans for different dataset

Datasets	Sample size for nc = 2		Sample size for nc = 3
	Cluster1	Cluster 2	Cluster1	Cluster1	Cluster 2
Dataset1	1.7966	0.5748	Dataset1	1.7966	0.5748
Dataset2	0.6576	0.8661	Dataset2	0.6576	0.8661
Dataset3	1.1521	1.2768	Dataset3	1.1521	1.2768
Dataset4	0.3534	11.3010	Dataset4	0.3534	11.3010
Dataset5	0.6282	0.8784	Dataset5	0.6282	0.8784

Table 7

RMSE value of LRM with FCM for different dataset

Datasets	Sample size for nc = 2		Sample size for nc = 3
	Cluster1	Cluster 2	Cluster1	Cluster1	Cluster 2
Dataset1	1.0444	0.6307	Dataset1	1.0444	0.6307
Dataset2	0.8739	0.6572	Dataset2	0.8739	0.6572
Dataset3	1.1587	1.2726	Dataset3	1.1587	1.2726
Dataset4	0.3738	11.5440	Dataset4	0.3738	11.5440
Dataset5	0.5526	0.8626	Dataset5	0.5526	0.8626

When comparing the RMSE values of each LRM of the KMeans and FCM clustering techniques, the LRM of the FCM clusters has a lower RMSE value than the KMeans clusters. Table 8 shows the mean RMSE value of LRMs for each dataset using KMeans clustering and FCM clustering techniques. Figure 2 displays the comparison of mean RMSE value for each dataset using KMeans and FCM clustering techniques. It is clearly seen that FCM clustering along with MLR has lower RMSE value for almost all dataset.

Table 8

R-Squared Value of LRM with KMeans for different dataset

Datasets	nc = 2		nc=3
	Cluster1	Cluster 2	Cluster1	Cluster1	Cluster 2
Dataset1	0.2528	0.1341	Dataset1	0.2528	0.1341
Dataset2	0.8035	0.6035	Dataset2	0.8035	0.6035
Dataset3	0.8053	0.7647	Dataset3	0.8053	0.7647
Dataset4	0.5948	0.2780	Dataset4	0.5948	0.2780
Dataset5	0.5071	0.3820	Dataset5	0.5071	0.3820

Table 8 displays the R-Squared value for each LRM that corresponds to the samples clusters of the KMeans clustering technique with K = 2 and K = 3. Table 9 shows RMSE value for each LRM that corresponds to the samples clusters of the FCM clustering technique with K = 2 and K = 3.When comparing the R-Squared values of each LRM of the KMeans and FCM clustering techniques, the LRM of the FCM clusters has a lower value than the KMeans clusters. It means that LRM for KMeans clusters has a higher correlation between actual and fitted value than that of FCM clustering.

Table 9

R-Squared Value of LRM with FCM for different dataset

Datasets	nc=2		nc=3
	Cluster1	Cluster 2	Cluster1	Cluster1	Cluster 2
Dataset1	0.01370	–0.0051	Dataset1	0.01370	–0.0051
Dataset2	0.5922	0.8064	Dataset2	0.5922	0.8064
Dataset3	0.8021	0.8006	Dataset3	0.8021	0.8006
Dataset4	0.6581	0.2667	Dataset4	0.6581	0.2667
Dataset5	0.3326	0.3248	Dataset5	0.3326	0.3248

From the above tables, it is observed that either the RMSE or R-Squared performance metric is not sufficient to judge the best sampling technique along with MLR. Therefore, the selection criteria function defined in the equation 1 is used in this work to identify the best cluster- based sampling technique along with MLR to generate the Global Regression Model (GRM) for each dataset.

Table 10 lists the ‘F’ value of all datasets for weights w1 is 0.5 and w2 is 0.5. For datasets 1 and 5, KMeans based sampling is the best sampling technique. For datasets 2, 3, and 4, FCM based sampling is the best sampling technique. Table 11 lists the best LRMs for weights w1 is 0.7 and w2 is 0.3. But there is no change in the results for datasets 1 and 5. So, KMeans based sampling is the best sampling technique for the datasets 1 and 5. For datasets 2, 3, and 4, FCM based sampling is the best sampling technique. Table 12 lists the best LRMs for weights w1 is 0.3 and w2 is 0. In this case, FCM is the best sampling technique for the datasets 1,4, and 5. KMeans is the best sampling technique for the datasets 2 and 3. The minimum ‘f’ value for all datasets is reached when only the RMSE of the model is considered. Therefore, for datasets 1,4, and 5, FCM- based LRMs are used for GRM development. Similarly, for datasets 2 and 3, KMeans-based LRMs are used GRM development. Table 13 shows the comparison of proposed GRM and existing MLR RMSE values for all datasets.

Table 10

F values of each dataset for w1 = 0.5 &w2 = 0.5

Dataset	KMeans Clustering		FCM Clustering		Best ‘F’ Value	Best Model
	nc=2	nc=3	nc=2	nc=3
Dataset1	3.1775	2.9584	116.6978	5.1632	2.9584	KMeans with nc = 3
Dataset2	1.0917	1.0940	1.0978	1.0289	1.0289	FCM with nc = 3
Dataset3	1.2442	1.2459	1.2318	1.2494	1.2318	FCM with nc = 2
Dataset4	4.0593	4.7265	4.0608	3.1375	3.1375	FCM with nc = 3
Dataset5	1.5014	1.5686	1.8749	1.7758	1.5014	KMeans with nc = 2

Table 11

F values of each dataset for w1 = 0.7 &w2 = 0.3

Dataset	KMeans Clustering		FCM Clustering		Best ‘F’ Value	Best Model
	nc=2	nc=3	nc=2	nc=3
Dataset1	2.3808	2.1236	70.3537	3.3042	2.1236	KMeans with nc = 3
Dataset2	0.9597	0.9405	0.9649	0.9033	0.9033	FCM with nc = 3
Dataset3	1.2323	1.2377	1.2253	1.2379	1.2253	FCM with nc = 2
Dataset4	4.7665	5.7629	4.8200	3.5215	3.5215	FCM with nc = 3
Dataset5	1.2022	1.2588	1.4080	1.3146	1.2021	KMeans with nc = 2

Table 12

F values of each dataset for w1 = 0.3 &w2 = 0.7

Dataset	KMeans Clustering		FCM Clustering		Best ‘F’ Value	Best Model
	nc=2	nc=3	nc=2	nc=3
Dataset1	1.1857	0.8714	0.8376	0.5157	0.5157	FCM with nc = 3
Dataset2	0.7619	0.7103	0.7656	0.7150	0.7103	KMeans with nc = 3
Dataset3	1.2145	1.2253	1.2157	1.2206	1.2145	KMeans with nc = 2
Dataset4	5.8272	7.3175	5.9589	4.0975	4.0975	FCM with nc = 3
Dataset5	0.7533	0.7941	0.7076	0.6228	0.6228	FCM with nc = 3

Table 13

Characteristics for GRM of all Datasets

Dataset	Best Sampling Model	No. of Clusters	Weightage	F-Value	RMSE	R-Squared
Dataset1	FCM	3	w1 = 0.7, w2 = 0.3	0.3915	0.5157	0.1019
Dataset2	KMeans	3	w1 = 0.7, w2 = 0.3	0.7002	0.7103	0.6767
Dataset3	FCM	3	w1 = 0.3, w2 = 0.7	1.0643	1.2206	0.7824
Dataset4	FCM	3	w1 = 0.3, w2 = 0.7	1.9293	4.0975	0.4592
Dataset5	FCM	3	w1 = 0.7, w2 = 0.3	0.5384	0.6228	0.3414

Figure 3 displays the comparison of mean RMSE value for each dataset using KMeans and FCM clustering techniques. It has seen that FCM clustering along with MLR has lower RMSE value for almost all dataset. Figure 4 displays the comparison of mean R-Squared value for each dataset using KMeans and FCM clustering techniques. It is evident that KMeans clustering along with MLR has higher value for almost all dataset. Table 14 shows the RMSE value and R-Squared value of GRM for all datasets.

Fig. 3

Comparison of RMSE value for all datasets.

Fig. 4

Comparison of R-Squared value for all datasets.

Table 14

Characteristics for GRM of all Datasets

Dataset	GRM RMSE	MLR RMSE
Dataset1	0.5157	2.7143
Dataset2	0.7103	1.2053
Dataset3	1.2145	2.5452
Dataset4	4.0975	5.0482
Dataset5	0.6228	6.0187

6 Conclusion

In this work, clustering based Sampling Technique has been hybridized with MLR technique to build Global Regression Model. For cluster based sampling techniques, KMeans clustering (hard) and Fuzzy C-Means clustering (soft) techniques has been used along with MLR technique to reduce errors and improve prediction accuracy rate. The empirical study has shown that if the optimum number of clusters can be created using either KMeans or FCM on a given dataset before applying the MLR, then it effectively reduces prediction errors. It has been found that FCM is a more efficient sampling technique than the KMeans based sampling technique when used for the MLR model because it reduces the prediction error. It is clearly evident that it is not easy to judge whether FCM or KMeans is the best cluster-based sampling technique used in conjunction with MLR. It is purely dependent on the nature of the dataset. Therefore, the proposed methodology has combined the benefits of both cluster-based sampling techniques for building the GRM.

On the basis of empirical study, it has been found that the computation time of GRM is slightly higher than KMeans and FCM algorithms, which is a limitation of the proposed method. But, the proposed GRM methodology achieves good RMSE on all datasets, which means that the GRM can select more representative samples to build the MLR model with higher accuracy. As the future scope of this work, swarm intelligence-based clustering techniques will be explored for sample selection for the regression model to reduce the computational complexity.

Footnotes

Acknowledgment

The first author acknowledges the UGC-Special Assistance Programme (SAP) for the financial support to her research under the UGC-SAP at the level of DRS-II (Ref.No.F.5-6/2018/DRS-II (SAP-II)), 26 July 2018 in the Department of Computer Science.

References

Cohen

, Cohen

, West

S.G.

, Aiken

L.S.

Applied multiple regression/correlation analysis for the behavioral sciences. 3rd ed. Mahwah: Erlbaum; 2003. Return to ref 1 in article.

Kutner

M.H.

, Nachtsheim

C.J.

, Neter

, Li

Applied linear statistical models. 5th ed. New York: McGraw Hill; 2005.

Montgomery

D.C.

, Peck

E.A.

, Vining

G.G.

Introduction to linear regression analysis. 5th ed. Hoboken: Wiley; 2012.

Maxwell

S.E.

, Sample size and multiple regression analysis, Psychological Methods 5 (2000), 434–458.

Pedhazur

E.J.

, Schmelkin

L.P.

Schmelkin, Measurement, design, and analysis: An integrated approach. Hillside, NJ: Lawrence Erlbaum. 1991.

Miller

D.E.

and Kunce

J.T.

, Prediction and statistical overkill revisited, Measurement and Evaluation in Guidance 6 (1973), 157–63.

Tabachnick

B.G.

, Fidell

L.S.

Using Multivariate Statistics. 6th ed. Boston: Pearson Education. 2013.

Gregory

T.K.

and Daniel

, Sample sizes when using multiple linear regression for prediction, Educational and Psychological Measurement 68 (2008), 431.

Bujang

M.A.

, Sa’at

and Sidik

T.M.A.

, Determination of minimum sample size requirement for multiple linear regression and analysis of covariance based on experimental and non-experimental studies, Epidemiol Biostat Public Health 14(3) (2017), e121171–e12117-9.

10.

Teddlie , Charles , Fen Yu , Mixed methods sampling: a typology with examples, Journal of Mixed Methods Research 1(1) (2007), 77–100. https://doi.org/10.1177/1558689806292430.

11.

James

, Chromy , Variance estimators for a sequential sample selection procedure, D. Krewski, R. Platek, .N.K. Rao (Eds.), Current Topics in Survey Sampling, Academic Press (1981), 329–347.

12.

Singh

A.S.

and Masuku

M.B.

, Fundamentals of applied research and sampling techniques, –, International Journal of Medical and Applied Sciences 2(4) (2012), 124–132.

13.

Jure Leskovec , Christos Faloutsos , In Proceedings of the Twelfth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA (2006), 631–636..

14.

Bee Wah Yap , Khatijahhusna Abd Rani , Hezlin Aryani Abd Rahman , Simon Fong , Zuraida Khairudin , Nik Nik Abdullah , An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the First International Conference on Advanced Data and Information Engineering, DaEng 2013, Kuala Lumpur, Malaysia (2013), 13–22.

15.

Waleed Albattah , The role of sampling in big data analysis. In Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, BDAW 2016, Blagoevgrad, Bulgaria (2016), 28:1–28:5.

16.

Ajay Singh

and Micah Masuku

, Sampling techniques & determination of sample size in applied statistics research: An overview, International Journal of Economics, Commerce and Management 2(11) (2014), 1–22.

17.

Ashwin Satyanarayana , Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. In IEEE 27th Canadian Conference on Electrical and Computer Engineering, CCECE 2014, Toronto, ON, Canada (2014), 1–6.

18.

George John

, Pat Langley , Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA (1996), 367–370.

19.

Prashant Singh , Joachim Van Der Herten , Dirk Deschrijver , Ivo Couckuyt , Tom Dhaene , A sequential sampling strategy for adaptive classification of computationally expensive data, Structural & Multidisciplinary Optimization 55(4) (2017), 1–14.

20.

Sunghae Jun , Seung-Joo Lee , Jea-Bok Ryu , A divided regression analysis for big data, International Journal of Software Engineering and Its Applications 9(5) (2015), 21–32.

21.

Josien

, Wang

, Liao

T.W.

, Triantaphyllou

, Liu

M.C.

An Evaluation of Sampling Methods for Data Mining with Fuzzy C-Means. In: Braha D. (eds) Data Mining for Design and Manufacturing. Massive Computing, vol 3. Springer, Boston, MA (2001). https://doi.org/10.1007/978-1-4757-4911-3_15.

22.

Zhang

, Cai

, Cheng

Multi-model quality prediction approach using fuzzy C-means clustering and support vector regression, Advances in Mechanical Engineering (2017). doi:10.1177/1687814017718474.

23.

Nagwani

N.K.

and Deo

S.V.

, Estimating the concrete compressive strength using hard clustering and fuzzy clustering based regression techniques, ScientificWorldJournal 2014 (2014), 381549. doi: 10.1155/2014/381549.

24.

Dhamodharavadhani

, Rathipriya

Enhanced logistic regression (ELR)model for big data. In: Fausto Pedro Garcia Marquez, editor. Handbook of Research on Big Data Clustering and Machine Learning. IGI Global (2020),152–76.

25.

Jain

A.K.

, Data clustering: 50 years beyond k-means, Pattern Recogn. Lett. 31(8) (2010), 651–666.