Abstract
In this paper, a new approach is presented to the problem of the clustering regression models with imprecise quantities. In this approach, the response variable and the parameters of model are assumed to be the interval-valued fuzzy numbers. We introduce two indices to investigate the goodness-of-fit of such models based on the similarity measure and the squared errors. In addition to, the predictive ability of the proposed clustering models is evaluated by using the cross-validation method. Finally, the application of the proposed approach in modeling some soil characteristics is studied.
Keywords
Introduction
In classical regression analysis, we model a relationship between a dependent variable and some independent variables. It is supposed that the observed data are homogeneous and the parameters of model are as precise quantities. In practice, we should formulate and analyze the regression models in the case of heterogeneity of observations and to exist imprecise quantities. In this paper, the main goal is to apply the clustering techniques to regression analysis, when the response variable and the parameters of model are supposed to be as the interval-valued fuzzy numbers. In this approach, we present the one-stage generalized method to model interval-valued fuzzy linear regression. Also, the goodness-of-fit and the predictive ability of proposed models are investigated by some indices (similarity measure, squared errors, and cross-validation).
The topic of regression analysis in non-precise environments has been studied by many authors. Let us review some recent works in this topic. Arabpour and Tata [1], Chen and Hsueh [12], Coppi et al. [14], Kratschmer [30], Mohammadi and Taheri [32], Nasibov [34], Tutmez and Kaymak [43] investigated some approaches for estimating the coefficients of the linear regression model based on least squares method when the available data and/or the parameters of model are fuzzy. Several methods in linear regression model in fuzzy environment based on least-absolutes method are studied by Chachi and Taheri [11], Choi and Buckley [13], Kelkinnama and Taheri [27], Kim et al. [28], Taheri and Kelkinnama [40, 41]. Hasanpour et al. [24, 25] and Pourahmad et al. [37] presented some new approaches to formulate the linear regression models based on linear/non-linear/goal programming methods. Some approaches to model the fuzzy linear regression based on fuzzy clustering procedures are investigated by Yang and Ko [46] and Yang and Lin [47]. The regression analysis in intuitionistic/type-2 fuzzy environment studied by Arefi and Taheri [3], Hosseinzadeh et al. [26], Parvathi et al. [35], and Poleshchuk and Komarov [36].
This paper is organized as follows: In Section 2, we recall some preliminary concepts about interval-valued fuzzy sets. A new distance between interval-valued fuzzy numbers is defined in Section 3. In Section 4, based on clustering techniques, we present a least squares approach to construct a regression model when the dependent variable and the parameters of model are assumed to be as interval-valued fuzzy numbers. Inside, we introduce two indices to evaluate the goodness of fit of such regression models. Also, the predictive ability of the proposed clustering regression models is examined based on the cross-validation method. Application of the proposed approach to analyze some soil characteristics is presented in Section 5. In Section 6, we compare our method with six other methods in the topic of regression modeling in imprecise environment. A brief conclusion is provided in Section 7.
Preliminary concepts
In this section, we review some notations and preliminary concepts of interval-valued fuzzy sets. For more details, the reader is referred to Atanassov [5–7] and Atanassov and Gargov [8].
In the above definition, is the lower bound for degree of membership of x into , and is the lower bound for negation of membership of x into . Therefore, the degree of membership of x into the IVFS is characterized by the interval (see also [8, 20]).
Generally, the idea of intuitionistic fuzzy sets was introduced by Atanassov [5, 7]. Some well-known similar generalizations of a fuzzy set are, the so-called interval-valued fuzzy set theory, introduced by Sambuc [38] (see also Gorzalczany [20]), and the vague set theory, defined by Gau and Buehrer [18]. These approaches are in general not independent and there exist relationships among them. Sometimes they are even mathematically equivalent, however they have arisen on different ground and they have different semantics. For instance, Atanassov’s construct is isomorphic to interval-valued fuzzy sets and other similar notions, even if their interpretive settings and motivation are quite different, the latter capturing the idea of ill-known membership grade, while the former starts from the idea of evaluating degrees of membership and non-membership independently (for more details, see [5, 18]).
In especial case, is called a triangular IVFN (TIVFN) if L (x) = R (x) = max {0, 1 - x}, for all x ∈ [0, ∞). It is denoted by (see [22]).
We will denote the set of all LR-IVFNs of R by LR - IVFN (R).
Some arithmetic operations on LR-IVFNs are defined as follows (see [9, 42]).
Distance between interval-valued fuzzy numbers
In this section, we define a new distance between IVFNs, which is an extended version of the distance between fuzzy numbers introduced by Yang and Ko [45]. For more details on some other distances between IVFSs, see [7, 44].
,
,
,
.
Items (i), (ii), and (iii) are obviously held. We prove item (iv). We have
Hence, and the claim is proved.
Clustering interval-valued fuzzy regression
In this section, we introduce an approach to estimate the coefficients of the clustering multivariate linear regression models based on the interval-valued fuzzy output data. In the proposed models, we suppose that the coefficients of models are also considered as the interval-valued fuzzy quantities.
Therefore, we assume to have a set of observed data , j = 1, 2, . . . , n, where , are LR-IVFNs. The aim is to fit some clustering regression models with IVF coefficients to the data set, as follows
For simplicity, we consider some matrix forms as follows (x0j = 1, j = 1, . . . , n)
Estimation the model parameters
Let be a set of n interval-valued fuzzy subsets. Let c > 1 be a constant integer. A partition of G into c parts can be represented by the membership functions such that μ ij ∈ [0, 1] and for all j = 1, 2, . . . , n in G. Here, μ ij is the membership degree of jth observation into ith cluster.
For estimating the parameters of the clustering models, the sum of the squared errors based on the proposed distance is defined as follows:
Hence, we have
Considering , , , , , , and for i = 1, . . . , c, j = 1, . . . , n, and p = 0, 1, . . . , k, and using the matrix notations, the estimations of parameters are obtained as follows
Now, the parameters of the proposed models is obtained based on the following algorithm (we use the R software in my calculations).
Fix m = 2, c ∈ {2, . . . , n - 1}, and ɛ > 0;
For each i = 1, 2, . . . , c and j = 1, 2, . . . , n, choose the initial values in matrix μ, denoted by μ0, and the initial spreads in vectors , , , and ;
Based on equations (1) - (5) and for each i = 1, 2, . . . , c, calculate the new values , , , , and ;
If the values of spreads in vectors , , , and are negative, then they are substituted to zero;
Based on the presented values in items (A3) and (A4) and equation (6), update the values in matrix μ0 to the values in μ1;
Using , compare μ0 to μ1. If it is correct, then algorithm is stopped. Otherwise, set , , , α′ a 0 (i) = α′ a 1 (i) , and β′ a 0 (i) = β′ a 1 (i) and go to item (A3).
First, based on the proposed clustering regression models, we calculate the response values , ,..., under . Calculate d1, . . . , d
n
, and d
new
as , j = 1, . . . , n, and , respectively. Also, we order d1, d2, . . . , d
n
, denoted by d(1), d(2), . . . , d(n). Estimate the membership degrees , i = 1, . . . , c, as follows if d
new
≤ d(1), then
if d(j-1) < d
new
≤ d(j), then
if d(n) < d
new
, then
Finally, the response value is predicted as
To evaluate the goodness of fit of the proposed models, two indices are introduced based on the similarity measure between two IVFSs and also based on the distance between two IVFSs.
Cross-validation
Cross-validation [19, 29] is a model validation technique for assessing how the results of an analysis will generalize to a data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. It is worth highlighting that in a prediction problem, a model is usually given a data set of known data on which training is run (training data set), and a data set of unknown data against which the model is tested (testing data set).
To investigate the performance of the proposed clustering models based on the cross-validation method with the testing data set, we apply the following procedure: The jth observation is left out from the data set. Based on the remaining observations, we develop the new clustering interval-valued fuzzy regression models. Based on obtained models, the values are predicted. Based on the distances between the predicted values and the jth observed response value , the membership degrees μ1j, . . . , μ
cj
are estimated as follows
The value of jth observation is predicted as follows
Repeat items (1)– (5) to calculate . Calculate the similarity measures and distances between and , j = 1, . . . , n. Obtain the mean of similarity measures (MSM(1)) and the mean of distances (MD(1)) between the response observations and the predicted values as follows
Finally, if the value of MSM(1) (or MD(1)) of the testing data set is near to the value of MSM (or MD) of the model with full data set (in Definition 4.6), then the clustering regression models are the optimal models.
Note that if in between data set, we have some outliers, then the clustering linear regression methods can present the suitable models (even before removing of outliers), but if we can recognize them in between data set, then the obtained models after removing outliers are better. For recognizing outliers, we can use from the results of cross-validation as follows (see for example, Tables 4 and 5).
Application in soil characteristics
One of the classical problem in soil science is the measurement of physical, chemical and/or biological soil properties. The problem results from the difficulty, time, and cost of direct measurements. In this section, some clustering pedomodels are studied to develop some relationships between different chemical and physical soil properties. The study area is a part of Silakhor plain (situated in a province west of Iran). A total of 24 core samples were obtained from 0.0 to 25-cm depth. Different soil physical and chemical properties were measured using standard procedures [32]. But due to some impreciseness in related experimental environment, the observed response variables were reported as IVFNs (see Table 1).
Clustering regression models of CEC-OM-SAND
The data set in Table 1 shows cation exchange capacity (CEC) (as a triangular IVFN response variable), sand content percentage (SAND), and organic matter content (OM) (as two independent variables). Based on this data set, we want to model the relationship between the response variable (CEC) and the explanatory variables (SAND and OM) in three (c = 3) clustering models as follows
Estimation of the model parameters
Using the matrix forms given in Subsection 4.1, we have
Based on Algorithm 1 for m = 2, the parameters of clustering IVF regression models for c = 3 are obtained as follows:
Consequently, the optimal models for c = 3 are given as follows:
The membership degree of jth observation into ith cluster, i.e. μ ij , given in Table 2. For example, the 3th observation belongs to clusters 1– 3 with the membership degrees μ13 = 0.9887, μ23 = 0.0042, and μ33 = 0.0071, respectively. Based on Definition 4.2, the estimated values are obtained in second column of Table 3. Also, 3th and 4th columns of Table 3 show the goodness of fit of models based on the mean of similarity measures (MSM) and the mean of distance (MD).
Now, suppose that we observe the new values . Based on Definition 4.1, we obtain the response values for c = 1, 2, 3 as
Cross-validation and outliers
We apply the cross-validation to the CEC-OM-SAND models. The results are given in Table 4. Since the value of MSM(1) = 0.6349 (or MD(1) = 0.7720) is near to 0 (or near to 1), the predictive ability of the models is convenient.
Based on Remark 4.8, two observations (No.s 10 and 17) with the high amounts of (or low amounts of ) can consider as outliers. To investigate the effects of possible outliers on clustering models performance, these points were removed from data set. As it is shown in Table 5, after removing the outliers, the MSM(1) increases from 0.6349 to 0.7098, and the MD(1) decreases from 0.7720 to 0.5330, which indicate the improvement of the clustering models.
Discussion
In some real systems, we may encounter with the data that are heterogeneous and we wish to obtain some suitable regression models. In such situations, we may model them based on clustering regression. Clustering regression is a technique about the domain and the data set that improves the accuracy of classical regression by partitioning training space into subspaces. For study some approaches on this topic, see Ari and Gvenir [4], Lindgren and Ljung [31], and Motoyoshi et al. [33].
In addition to the heterogeneous observations, suppose that we encounter with some imprecise quantities. In such situations, we need to develop some suitable approaches for analyzing the regression models with the existence of these restrictions. Two approaches by Yang and Ko [46] and Yang and Lin [47] are presented to model the linear regression in fuzzy environments when the observations are heterogeneous. In the following, we review these two approaches.
In comparing with proposed approach in this paper, we can review six approaches as follows. The results of comparing are summarized in Table 6. Arefi and Taheri [3] presented a least squares approach to regression analysis when the input and output data and also the parameters of model are assumed to be the intuitionistic fuzzy numbers. Hosseinzadeh et al. [26] studied a fuzzy linear regression model with type-2 fuzzy output data and type-2 fuzzy coefficients based on the goal programming. Poleshchuk and Komarov [36] investigated a least squares regression model for interval type-2 fuzzy sets when the coefficients of model are assumed to be triangular fuzzy numbers. The basic idea is to determine aggregation intervals for type-1 fuzzy sets, the membership functions of whose are low and upper membership functions of interval type-2 fuzzy set. Parvathi et al. [35] studied a linear regression analysis based on a linear programming problem when the output data and the coefficients of model are the intuitionistic fuzzy numbers. Yang and Ko [46] applied the fuzzy clustering techniques to fuzzy simple regression analysis when the observations are heterogeneous. They presented the cluster-wise fuzzy regression analysis in two approaches: the two-stage weighted fuzzy regression and the one-stage generalized fuzzy regression. A least-squares linear regression analysis with the fuzzy inputs-outputs and the fuzzy parameters is studied by Yang and Lin [47]. Since in this approach, we meet the heterogeneous problem in observations, they use the cluster-wise fuzzy regression analysis.
Conclusions
In this paper, a new approach is presented to the problem of the clustering linear regression models. Some certain merits of the proposed approach are provided as follows: It is a new approach to formulate some clustering linear regression models based on a weighted least squares method with the weighted values (the membership degree of jth observation into ith cluster). It is an extended version of Yang and Ko’s [46] and Yang and Lin’s [47] approaches to the clustering linear regression models when the response variable and the parameters of models are assumed to be the interval-valued fuzzy numbers. To evaluate the goodness-of-fit of clustering regression models, some indices are provided based on the similarity measure and the squared errors. We have also introduced a new cross-validation method to evaluate the predictive ability of the proposed clustering models.
The extension of proposed approach can be investigated to formulate the clustering linear regression models in interval-valued fuzzy environment using the least absolutes method.
