Abstract
This paper aims to develop the methodology for enhancing the regression models using Cluster based sampling techniques (CST) to achieve high predictive accuracy and can also be used to handle large datasets. Hard clustering (KMeans Clustering) or Soft clustering (Fuzzy C-Means) to generate samples called clusters, which in turn is used to generate the Local Regression Models (LRM) for the given dataset. These LRMs are used to create a Global Regression Model. This methodology is known as Enhanced Regression Model (ERM). The performance of the proposed approach is tested with 5 different datasets. The experimental results revealed that the proposed methodology yielded better predictive accuracy than the non-hybrid MLR model; also, fuzzy C-Means performs better than the KMeans clustering algorithm for sample selection. Thus, ERM has potential to handle data with uncertainty and complex pattern and produced a high prediction accuracy rate.
Introduction
Regression Model (RM) is common and often used statistical-based predictive technique in many of the research fields. A multiple Linear Regression model is a tool based on statistics that models the linear relationship between a set of predictors and dependent variable. This model predicts a dependent variable based on the model [1–3]. In general, choosing the right sample size for RM development becomes critical to achieve high prediction accuracy. Another problem is insufficient sample size, which leads to the development of a partial regression model rather than a generalized regression model for the entire dataset. Choosing the best data sampling techniques for a given dataset is the best solution for robust regression model development.
Data sampling is the process of using a small amount of data to obtain the overall characteristics of the whole dataset. This process allows for the selection of a subgroup of individual items from a population in order to assess the characteristics of the entire population. The frequently used sampling techniques in the researches are Sequence sampling, Random sampling, Cluster sampling, Systematic sampling, and Stratified sampling [10].
In Sequential sampling, a sequence of one or more individuals is taken from some part of the dataset to form a sample for analysis [11].
The Simple Random model is the most basic form of the probability sampling technique. In this model, each individual item who is to be studied has an equal chance of being selected for the study, and researchers use some random process to select members. It provides an unbiased and excellent estimate of the parameters [12].
Cluster sampling [12] is the probability sampling method in which the whole dataset is divided into groups (or also called clusters). These randomly chosen clusters are used for sampling. All individuals in the selected clusters are included in the sampling of the dataset. It is used to study large datasets.
In the Systematic Sampling model [12], the first unit of the sample is randomly selected, and the subsequent units are systematically selected. This is a comprehensive implementation of the probability model, in which each member of the team is selected at regular intervals to create a model.
In the Stratified sampling method [12], the total population is divided into smaller groups or strata for the sampling process. The strata are formed based on some common features in the dataset. In Oversampling, the minority class samples are repeated randomly, whereas in Undersampling, the majority class samples are removed randomly to balance the data [13]. In Reservoir sampling method [14], the individuals are inserted into the predefined reservoir which contains unknown size of the dataset.
The advantage of using sampling for regression model is that models can be developed by using subset of the data without affecting the characteristics and quality of the entire database. In this paper, cluster based sampling method is used for sampling data in order to get better prediction accuracy. K-Means and Fuzzy C-Means clustering techniques are used to create the optimal subset of data (i.e. Samples) for building the efficient regression model.
The main contributions of this paper can be summarized below:
An Enhanced Regression model (ERM) using clustered sampling methods is introduced in this paper by using hard and soft clustering techniques. To improve the predictive accuracy of MLR models, a hybrid methodology is proposed in the proposed scheme by using MLR and popular clustering techniques (such as KMeans, Fuzzy C-Means, or a combination of both). This proposed ERM has less prone to prediction error. Experimental results of the proposed methodology are presented in this paper to demonstrate its effectiveness over existing techniques.
The format of this research paper is structured as the following. Section 2 introduces the basics and background knowledge to understand the cluster based sampling techniques and metric to assess the performance of the proposed work. Next, the detailed description of proposed regression approach using cluster based sampling technique is given in Section 3. The discussion of results for various datasets is pinned in Section 4. Finally, Section 5 summarizes the proposed work with a possibility of its future extension.
Literature review
This section provides a brief reference to current works related to this research work. While modeling MLR, the minimum required sample size is determined [4]. Usually, the minimum sample size for almost all types of multivariate analyzes is determined using a standard thumb rule for MLR. The most commonly used principle for determining the minimum required sample size is based on the ratio between the number of cases and the number of independent variables [5, 6].
Tabasnik and Fidel have used the formula “number of 50 + 8 * factors” to calculate the minimum required sample size [7]. Gregory and Daniel (2008) have formulated a rule to identify the minimum sample size that is required for an MLR-based forecasting system [8]. Nowadays, it is possible to use software to identify the exact sample size required for the analysis of multivariate data. The authors of [9] calculated the minimum sample sizes required for the MLR model using Power and sample size software (PASS). Usually, most of the statisticians finalize the minimum sample sizes required for a dataset by their experience. However, researchers from other domains may opt for some simple principles or thumb rule to calculate it.
Albattah [15] explores the importance of sampling in the analysis of big data. The author believes that there is no need to handle the whole data if one can do this with fewer data. Experimental results have showed that the samples were produced with reduced data computational time. So, in the most of the cases it lead to better results. In the case of the sampling process, the sample size fixation is a very important factor, and a number of techniques have been used to determine sample size in the literature [16–19].
The authors of [20] has proposed Divided Regression model for larger datasets in which sampling technique is used to divide larger dataset into small subset of data for reducing the computational burden.
Josien K et al [21] have used Fuzzy C-Means (FCM) for cluster-based sampling. The authors have evaluated sampling methods with respect to the representativeness and performance of the sampled data. For that purpose, they have used statistical measures (such as mean and variance), and performance measures (such as accuracy rate, precision, and recall).
In [22], multi-model modeling approach based on FCM and support vector regression proposed along with adaptive mutation particle swarm optimization. Their results showed that multi-model modeling approach solved the problems like wide ranges of operating condition, nonlinearity, and prediction difficulty.
In [23], the technique proposed by the authors has two main phases: clustering technique is used in the first phase, and then regression techniques are used for prediction of compressive strength on these clusters in the second phase. They have concluded that FCM with regression techniques gave minimum errors for concrete strength prediction.
The authors of [24], proposed an Enhanced Logistic Regression model for larger dataset using K-Means clustering technique and stated that their model has higher predictive accuracy after the sampling process.
Following a review of the literature, it was revealed that there is no benchmark sampling technique for multiple linear regression (MLR) modeling. Furthermore, there is no rule of thumb to identify the most appropriate sampling technique for the development of an MLR model to achieve the best results. Therefore, the objective of this work is to investigate the feasibility of finding optimal data samples for generating MLR models using KMeans and Fuzzy-C-Means algorithms or a combination of both.to reduce errors and improving prediction accuracy rate.
Methods and materials
Multiple linear regression
It is a popular technique for the prediction task. It uses several independent (explanatory) variables to predict the outcome of a dependent (response) variable. Its purpose is to model the linear relationship between the dependent variable and independent variables.
The matrix version of MLR is shown in Fig. 1. In Fig. 1 X is a matrix of independent variables with size n x 2. Y is a column vector of dependent variable of size n×1, β is a column vector consists of co-efficient values of size 2×1, and ɛ is column vector consists of error values of size 2 x 1. The pseudo code of MLR is given in Algorithm 1.

Components of MLR.
It is a type of probability sampling technique in which the entire dataset is divided into several clusters (groups). These clusters are known as sample units. These clusters have homogeneous properties and have an equal probability of being part of the sample.
Sampling using K-Means clustering algorithm
The KMeans clustering technique [25] is a simple and widely used clustering algorithm for partitional clustering type. Initially, k initial centroids are randomly fixed, where k is a user specified size for the number of clusters. Each tuple in the dataset is placed in the closest cluster i.e. intra-cluster distance should be minimal. Then update the centroid of each cluster after assignment of tuple to the respective cluster. Repeat the process and update steps until items of the clusters or centroid remains unchanged. Algorithm 2 shows the formal description of K-Means algorithm.
Sampling using fuzzy C-Means clustering algorithm
Qualitative performance indicators
Table 1 shows the measures like Root Mean Squared Error (RMSE) and Correlation Co-efficient (R2) are used to evaluate the performance of the Regression model.
The qualitative performance measures
The qualitative performance measures
Fuzzy C-Means clustering [21], each tuple may belong to more than one clusters. The relations of a tuple to each cluster can be measured by using fuzzy membership values In this model, first step is to automate the sampling technique for building Regression model. The proposed approach uses the cluster based sampling techniques for sample selection (i.e. rows) for the dataset. The soft clustering technique called Fuzzy C-Means (FCM) is used to create overlapping samples (i.e. Clusters). The hard clustering technique called KMeans Clustering is used used to create non-overlapping samples (i.e. Clusters). Secondly, each sample is regressed independently, using Multiple Linear Regression (MLR) model to generate Local Regression Model (LRM). The pseudocode MLR is given in algorithm-1. Thirdly, select the best sampling technique using equation 1 to generate the suitable samples for the RM.
The sampling technique with a minimal ‘F’ value is selected as the best sampling technique. The weights w1 and w2 are used to varying the degree of importance of RMSE and R2 of the RM respectively. The range of these weights is 0 to 1 but w1 + w2 should be less than and equal to 1.
In Equations (2) and (3), nc denotes number of clusters, RMSE(.) denotes the RMSE value, LRM denotes the Linear Regression Model, Sample i denotes ith sample, R-Squared(.) denotes the R-squared value. Finally, mean of the co-efficient of LRM is used to generate Global Regression Model for entire dataset. Figure 2 shows the graphical representation of the proposed approach. Pseudo code of the proposed ERM is given Algorithm 4.

Flowchart for proposed methodology.
The time complexity of KMeans algorithm is O(ncdi) and time complexity of FCM is O(ndc2i) [12–14]. The time complexity proposed methodology is O(ndi(c + c2)). Here n indicates count of data items in the dataset, c indicates the count of clusters, d indicates the dataset dimensions and i denotes the iteration count. The time complexity proposed methodology is O(ndi(c + c2)).
Dataset description
The benchmark regression datasets are taken form UCI Machine Learning Repository: Data Sets website. Dataset 1 is affairs it contains 265 observation and 18 variables, Dataset 2 is bostonhousing contains 506 observations and 14 variables, Dataset 3 is ailerons contains 1754 observations and 40 variables, Dataset 4 is bostonhousingord contains 506 observations and 14 variables, and Dataset 5 is abaloneord contains 4177 observations and 11 variables are taken to study the performance of the proposed methodology. K-Means and fuzzy C-Means are used to create non-overlapped and overlapping partitional clusters with two different k values for five different datasets. In FCM, one records (samples) can belong to more than one cluster. Therefore, clusters in FCM have more number of samples than the KMeans algorithm. Table 2 shows intra-cluster distance among the samples of each cluster for KMeans clustering with K = 2 and K = 3 for five datasets.
Intracluster distance of Clusters using KMeans
Intracluster distance of Clusters using KMeans
Table 3 shows intra-cluster distance among the samples of each cluster for FCM clustering with K = 2 and K = 3 for five datasets. The number of samples in each cluster of KMeans and FCM clustering techniques is shown in Tables 4 and 5 respectively. There is a slightly high value in the intracluster distance for clusters of FCM. This is due to the overlapping nature of the FCM clustering technique.
Intracluster distance of Clusters using FCM
Sample size for different dataset using KMeans
Sample size for different dataset using FCM
Once clustering is complete, the MLR technique is applied to each individual cluster of KMeans and FCM to create a Local Regression Model (LRM). The performance metrics such RMSE and R-Squared value of each cluster are tabulated. Table 6 represents the RMSE value for each LRM that corresponds to the samples of clusters using the KMeans clustering technique with K = 2 and K = 3. Table 7 represents the RMSE value for each LRM that corresponds to the samples of clusters using the FCM clustering technique with K = 2 and K = 3.
RMSE value of LRM with KMeans for different dataset
RMSE value of LRM with FCM for different dataset
When comparing the RMSE values of each LRM of the KMeans and FCM clustering techniques, the LRM of the FCM clusters has a lower RMSE value than the KMeans clusters. Table 8 shows the mean RMSE value of LRMs for each dataset using KMeans clustering and FCM clustering techniques. Figure 2 displays the comparison of mean RMSE value for each dataset using KMeans and FCM clustering techniques. It is clearly seen that FCM clustering along with MLR has lower RMSE value for almost all dataset.
R-Squared Value of LRM with KMeans for different dataset
Table 8 displays the R-Squared value for each LRM that corresponds to the samples clusters of the KMeans clustering technique with K = 2 and K = 3. Table 9 shows RMSE value for each LRM that corresponds to the samples clusters of the FCM clustering technique with K = 2 and K = 3.When comparing the R-Squared values of each LRM of the KMeans and FCM clustering techniques, the LRM of the FCM clusters has a lower value than the KMeans clusters. It means that LRM for KMeans clusters has a higher correlation between actual and fitted value than that of FCM clustering.
R-Squared Value of LRM with FCM for different dataset
From the above tables, it is observed that either the RMSE or R-Squared performance metric is not sufficient to judge the best sampling technique along with MLR. Therefore, the selection criteria function defined in the equation 1 is used in this work to identify the best cluster- based sampling technique along with MLR to generate the Global Regression Model (GRM) for each dataset.
Table 10 lists the ‘F’ value of all datasets for weights w1 is 0.5 and w2 is 0.5. For datasets 1 and 5, KMeans based sampling is the best sampling technique. For datasets 2, 3, and 4, FCM based sampling is the best sampling technique. Table 11 lists the best LRMs for weights w1 is 0.7 and w2 is 0.3. But there is no change in the results for datasets 1 and 5. So, KMeans based sampling is the best sampling technique for the datasets 1 and 5. For datasets 2, 3, and 4, FCM based sampling is the best sampling technique. Table 12 lists the best LRMs for weights w1 is 0.3 and w2 is 0. In this case, FCM is the best sampling technique for the datasets 1,4, and 5. KMeans is the best sampling technique for the datasets 2 and 3. The minimum ‘f’ value for all datasets is reached when only the RMSE of the model is considered. Therefore, for datasets 1,4, and 5, FCM- based LRMs are used for GRM development. Similarly, for datasets 2 and 3, KMeans-based LRMs are used GRM development. Table 13 shows the comparison of proposed GRM and existing MLR RMSE values for all datasets.
F values of each dataset for w1 = 0.5 &w2 = 0.5
F values of each dataset for w1 = 0.7 &w2 = 0.3
F values of each dataset for w1 = 0.3 &w2 = 0.7
Characteristics for GRM of all Datasets
Figure 3 displays the comparison of mean RMSE value for each dataset using KMeans and FCM clustering techniques. It has seen that FCM clustering along with MLR has lower RMSE value for almost all dataset. Figure 4 displays the comparison of mean R-Squared value for each dataset using KMeans and FCM clustering techniques. It is evident that KMeans clustering along with MLR has higher value for almost all dataset. Table 14 shows the RMSE value and R-Squared value of GRM for all datasets.

Comparison of RMSE value for all datasets.

Comparison of R-Squared value for all datasets.
Characteristics for GRM of all Datasets
In this work, clustering based Sampling Technique has been hybridized with MLR technique to build Global Regression Model. For cluster based sampling techniques, KMeans clustering (hard) and Fuzzy C-Means clustering (soft) techniques has been used along with MLR technique to reduce errors and improve prediction accuracy rate. The empirical study has shown that if the optimum number of clusters can be created using either KMeans or FCM on a given dataset before applying the MLR, then it effectively reduces prediction errors. It has been found that FCM is a more efficient sampling technique than the KMeans based sampling technique when used for the MLR model because it reduces the prediction error. It is clearly evident that it is not easy to judge whether FCM or KMeans is the best cluster-based sampling technique used in conjunction with MLR. It is purely dependent on the nature of the dataset. Therefore, the proposed methodology has combined the benefits of both cluster-based sampling techniques for building the GRM.
On the basis of empirical study, it has been found that the computation time of GRM is slightly higher than KMeans and FCM algorithms, which is a limitation of the proposed method. But, the proposed GRM methodology achieves good RMSE on all datasets, which means that the GRM can select more representative samples to build the MLR model with higher accuracy. As the future scope of this work, swarm intelligence-based clustering techniques will be explored for sample selection for the regression model to reduce the computational complexity.
Footnotes
Acknowledgment
The first author acknowledges the UGC-Special Assistance Programme (SAP) for the financial support to her research under the UGC-SAP at the level of DRS-II (Ref.No.F.5-6/2018/DRS-II (SAP-II)), 26 July 2018 in the Department of Computer Science.
