Abstract
Regression model is a powerful analytical tool for estimating the relationships between explanatory variables and the response variable. Traditionally, it is often assumed that the data are observed precisely and characterized by crisp values. However, in many cases, those data are collected in an imprecise way and characterized in terms of uncertain variables. In this paper, the residual analysis of uncertain regression models is provided. Furthermore, an approach to obtain the forecast value and the confidence interval of the response variable for the new explanatory variables is given. Finally, a numerical example of the uncertain regression model is documented.
Introduction
In much work, we would like to know how the changes in some variables affect another variable. In this case, the variables are usually divided into explanatory variables and the response variable, and a forecast function is built to predict the value of the response variable by explanatory variables. Linear regression is a common method to derive a linear forecast function from the fitted straight line of explanatory variables and the response variable. Although a straight line relationship between explanatory variables and the response variable may not be exact, it can still be meaningful. The term “regression” was created by Galton [9] for a simple linear regression model, in which a fitted straight line was plotted to illustrate the relationship between parents’ height and children’s height. However, the work of Galton had only biological meaning, and later the concept of regression was introduced to the statistical domain by Yule [29].
In statistics, it is important to obtain a method for estimating the unknown parameters from given observations. The earliest approach to point estimation for the parameters is the principle of least squares, which was first published by Legendre [11] and developed by Gauss [10]. Similar to least squares, least absolute deviations, which was modified by Edgeworth [6, 7], can be applied to estimate a single value for the unknown parameters. Another common point estimation is maximum likelihood, which was widely popularized by Wilks [25]. Contrasted with the single value calculated by point estimation, interval estimation, which was first proposed by Neyman [20], is the use of data to calculate an interval for possible values of an unknown parameter and extensively applied to the estimate of regression models. Another important technique of statistics is hypothesis testing, such as t-test (Student [22]) and F-test (Fisher [8]). Furthermore, likelihood ratio test was showed to be the most powerful test by Neyman andPearson [19].
Traditionally, statisticians assume explanatory variables and the response variable can be observed in a precise way. But in many cases, the data cannot be precisely estimated. For example, the data of the factories’ carbon emission are collected in an imprecise way. As another example, the data of the social benefit of factories are also impossible to be precisely estimated. By handling the imprecise observations as fuzzy observed data, a fuzzy linear regression model with crisp explanatory variables and a fuzzy response variable was first proposed by Tanaka et al. [23]. Then a modified form of estimation for the parameters was suggested by Corral and Gil [3] through extending the maximum likelihood principle into the case with fuzzy observed data. Diamond [5] further introduced least squares fitting for crisp explanatory variables and the fuzzy response variable to estimate the unknown parameters of the regression models. Furthermore, Corral and Gil [4] worked on the problem of interval estimation with fuzzy observed data. In the case of fuzzy explanatory variables and a fuzzy response variable, the regression model was first employed by Sakawa and Yano [21]. Another application of fuzzy sets to the imprecise observations is statistical hypothesis testing, which was first discussed by Casals et al. [1, 2].
However, it was shown by many surveys that uncertainty theory is more fitted to model the data with imprecise observations given by the experts [16]. Thus we should take the imprecisely observed data as uncertain variables and describe them by uncertainty distributions (Liu [13]). The use for the uncertain observed data was developed by many scholars, such as Wen et al. [24], Lio and Liu [12], Nejad and Ghaffari-Hadigheh [18], Yao [27] and Yang and Liu [26]. Especially, uncertain regression analysis was presented to model the relationship between explanatory variables and the response variable with uncertain observed data. To estimate the unknown parameters in the uncertain regression models, the principle of least squares was suggested by Yao and Liu [28].
In this paper, we employ some regression models for analyzing the relationship between uncertain explanatory variables and the uncertain response variable. The rest of the paper is organized as follows: In Section 2, some preliminary knowledge of uncertainty theory is highlighted. In Section 3, some formulas are provided to estimate the parameters of the regression models based on uncertain observed data, and their residual analysis is given in Section 4. In Section 5, the confidence interval of the uncertain regression models is suggested, and in Section 6, a numerical example is provided to illustrate the application of the model. Finally, some conclusions are made in Section 7.
Preliminaries
Through a lot of surveys, Liu [17] showed that human beings always estimate a much wider range of values than the object actually takes. This conservative estimation for degrees of belief makes the distribution function deviate far from the frequency. This provides a motivation for Liu [13] to found uncertainty theory to deal with the cases relying on degrees of belief when the precise observations or measurements are difficult to perform. In this section, some basic concepts and theorems in uncertainty theory including uncertain measure, uncertain variable and expected value are reviewed.
Axiom 1. (Normality Axiom) M {Γ} =1 for the universal set Γ.
Axiom 2. (Duality Axiom) M {Λ} + M {Λ c } =1 for any event Λ.
Axiom 3. (Subadditivity Axiom) For every countable sequence of events Λ1, Λ2, ⋯, we have
The triplet (Γ, Ł, M) is called an uncertainty space. Furthermore, the product uncertain measure on the product σ-algebra Ł is defined by the following fourth axiom.
Axiom 4. (Product Axiom) (Liu [14]) Let (Γ
k
, Ł
k
, Mk) be uncertainty spaces for k = 1, 2, ⋯. The product uncertain measure M is an uncertain measure satisfying
As a real-valued function on the uncertainty space (Γ, Ł, M), uncertain variable is introduced to model the quantity with human uncertainty.
The uncertainty distribution Φ of an uncertain variable ξ is defined by Φ (x) = M {ξ ≤ x} for any real number x. An uncertainty distribution Φ (x) is said to be regular if it is a continuous and strictly increasing function with respect to x at which 0 < Φ (x) <1, and
If ξ is an uncertain variable with regular uncertainty distribution Φ (x), the inverse function Φ-1 (α) is called the inverse uncertainty distribution of ξ (Liu [15]).
An uncertain variable ξ is called linear if it has an uncertainty distribution
An uncertain variable ξ is called zigzag if it has an uncertainty distribution
An uncertain variable ξ is called normal if it has an uncertainty distribution
Assume that ξ1, ξ2, ⋯, ξ n are independent uncertain variables with regular uncertainty distributions Φ1, Φ2, ⋯, Φ n , respectively. Liu [15] showed that if f (x1, x2, ⋯, x n ) is a strictly monotonous function, then the inverse uncertainty distribution of the uncertain variable f (ξ1, ξ2, ⋯, ξ n ) can be calculated by the following theorems.
As the average value of an uncertain variable in the sense of uncertain measure, expected value can represent the size of the uncertain variable.
As another important feature for an uncertain variable, variance is defined as follows:
Let ξ be an uncertain variable with regular uncertainty distribution Φ. Then we have
The uncertainty distributions of the observed variables can be determined by the expert’s experience. To collect the experimental data of expert, Liu [15] suggested a method of questionnaire survey. We first ask the domain expert for a possible value x that the social benefit ξ of a certain company may take, and then question the expert “How likely is ξ less than or equal to x?”
Then denote the expert’s belief degree by α (say 0.4). An expert’s experimental data
To illustrate the process of determining the uncertainty distribution, we suppose that the social benefit is imprecise and a domain expert is invited to provide the experimental data. Then the consultation process can be as follows:
Q1: What do you think is the minimal value of the social benefit of the company? A1: 0.5 billion dollars. (an expert’s experimental data (0.5,0) is obtained) Q2: What do you think is the maximal value? A2: 1.2 billion dollars. (an expert’s experimental data (1.2,1) is obtained) Q3: What do you think is a likely value? A3: 0.8 billion dollars. Q4: To what degree do you think that the real value of the social benefit is less than 0.8 billion dollars? A4: 40%. (an expert’s experimental data (0.8,0.4) is obtained) Q5: Is there another value the social benefit may be? A5: 1 billion dollars. Q6: To what degree do you think that the real value is less than 1 billion dollars? A6: 80%. (an expert’s experimental data (1,0.8) is obtained)
Hence four expert’s experimental data of the imprecise social benefit of the company are obtained from the domain expert, i.e.
Take (0.5, 0) as (x1, α1), (0.8, 0.4) as (x2, α2), (1, 0.8) as (x3, α3) and (1.2, 1) as (x4, α4). Then an empirical uncertainty distribution is suggested by Liu [15] that the uncertainty distribution of the imprecise social benefit can be determined by
Essentially, it is a type of linear interpolation method.
Let (x1, x2, ⋯, x
p
) be a vector of explanatory variables, and let y be a response variable. Assume the relationship between (x1, x2, ⋯, x
p
) and y can be expressed by a function, f, and the model is generally given as
Suppose that there are a set of imprecisely observed data,
To obtain hbe *, Yao and Liu [28] suggested the least squares estimate of hbe in the regression model (6) is the solution of the minimization problem,
Assume the optimal solution of the minimization problem (8) is hbe *. Then the fitted regression model can be denoted by
For each i, it follows from Theorem 2.1 that the inverse uncertainty distribution of
Then from Equation (2), we obtain
Thus the minimization problem (13) is equivalent to
The theorem is verified.
Since the function
Then from Equation (2), we obtain
Thus the minimization problem (13.1) is equivalent to
The theorem is verified.
Since the function
Then from Equation (2), we obtain
Thus the minimization problem (13.2) is equivalent to
The theorem is verified.
In the regression model (6), there is a disturbance term, ∊, an increment by which the response variable y may fall off the regression. Similar to hbe, ∊ is also an unknown parameter, and in fact it is impossible to be discovered exactly since the term changes for each observation. Then we are interested in finding an estimation for ∊ from the given imprecisely observed data,
For each i, the difference between tyi and f (txi1, txi2, ⋯, txip|hbe *) represents the deviation of the response variable tyi and forecast variable f (txi1, txi2, ⋯, txip| hbe *). Thus we propose a definition as follows:
Then for each i (i = 1, 2, ⋯, n), the term
Now assume that the disturbance term ∊ is an uncertain variable. Then we use the average of the expected values of residuals, i.e.,
Then the estimated expected value of the disturbance term ∊ is
The theorem follows from Equations (1) and (2) immediately.
Then the estimated expected value of the disturbance term ∊ is
The theorem follows from Equations (1) and (2) immediately.
Then the estimated expected value of the disturbance term ∊ is
The theorem follows from Equations (1) and (2) immediately.
Suppose (tx1, tx2, ⋯, txp) is a vector of new explanatory variables, where tx1, tx2, ⋯, txp are uncertain variables with regular uncertainty distributions Φ1, Φ2, ⋯, Φ p , respectively. It is useful to forecast the response variable for the new explanatory vector by the given imprecisely observed data (txi1, txi2, ⋯, txip, tyi), i = 1, 2, ⋯, n. For example, a new factory is founded and its social benefit is required to be forecasted. Taking social benefit as a response variable, average quality of the production, monthly salary of employees and carbon emission as explanatory variables, an uncertain regression model can be built from the data of existing factories. According to the model, the social benefit of the new factory can be forecasted by the information of production, salary and carbon emission, and used to judge whether setting up the new factory is reasonable.
Although the relationship between uncertain explanatory variables and the uncertain response variable should be complicated, it is still valuable to apply linear regression model for the data. Now suppose the fitted linear regression model is
A single value of y should be estimated from the forecast uncertain variable, and it is natural to define the forecast value of y as
Then the uncertainty distribution,
The forecast value, μ, is a point estimation of y. However, it is not convincing to claim that the value of y is always a precise value. Hence the confidence interval is proposed to estimate y. Although some precision is given up when applying confidence interval, we can gain some confidence and assurance that our inference must be correct. Taking α (e.g., 95%) as a confidence level, we are interested in finding the minimum value b such that
In this section, we consider an example to show how the regression model to be applied to forecast the response for a new explanatory vector with imprecise observations, and the calculation for the 95% confidence interval is also given.
Suppose (txi1, txi2, txi3, tyi), i = 1, 2, ⋯, 24 are a set of imprecisely observed data, where txi1, txi2, txi3, tyi are independent uncertain variables with linear uncertainty distributions, Φi1, Φi2, Φi3, Ψ i , respectively. The data are provided in Table 1.
Imprecisely Observed Data where Ł (a, b) Represents Linear Uncertain Variable
Imprecisely Observed Data where Ł (a, b) Represents Linear Uncertain Variable
To forecast the response for a new explanatory vector, we employ the linear regression model
From Theorem 3.1, Equation (41) can be changed to an equivalent form, i.e.,
Hence the fitted linear regression model is
By applying Equations (26) and (27), i.e.,
For the confidence level α = 95%, if we suppose further that the disturbance term ∊ is a normal uncertain variable, then
Here
Since the observed data are often collected in an imprecise way, this paper introduced some uncertain regression models for handling the uncertain observed data reasonably. In order to study the disturbance term in the models, the concepts of i-th residual and the residual analysis of the models were proposed. Furthermore, it is necessary to provide an estimation when a vector of new explanatory variables is given. Hence the forecast value and the confidence interval of the response variable with respect to the new explanatory variables were presented, and a numerical example was provided to illustrate the calculation for the unknown parameters, the estimated expected value and variance of the disturbance term, the forecast value and the confidence interval in terms of some given observeddata.
For the future work, the hypothesis testing for the unknown parameters in the uncertain regression model will be studied, and the concept of multiple correlation coefficient will be proposed for assessing the regression fit with imprecise observations.
Footnotes
Acknowledgments
This work was supported by National Natural Science Foundation of China Grant No. 61573210.
