Residual and confidence interval for uncertain regression model with imprecise observations

Abstract

Regression model is a powerful analytical tool for estimating the relationships between explanatory variables and the response variable. Traditionally, it is often assumed that the data are observed precisely and characterized by crisp values. However, in many cases, those data are collected in an imprecise way and characterized in terms of uncertain variables. In this paper, the residual analysis of uncertain regression models is provided. Furthermore, an approach to obtain the forecast value and the confidence interval of the response variable for the new explanatory variables is given. Finally, a numerical example of the uncertain regression model is documented.

Keywords

Regression analysis uncertainty theory uncertain variable residual confidence interval

1 Introduction

In much work, we would like to know how the changes in some variables affect another variable. In this case, the variables are usually divided into explanatory variables and the response variable, and a forecast function is built to predict the value of the response variable by explanatory variables. Linear regression is a common method to derive a linear forecast function from the fitted straight line of explanatory variables and the response variable. Although a straight line relationship between explanatory variables and the response variable may not be exact, it can still be meaningful. The term “regression” was created by Galton [9] for a simple linear regression model, in which a fitted straight line was plotted to illustrate the relationship between parents’ height and children’s height. However, the work of Galton had only biological meaning, and later the concept of regression was introduced to the statistical domain by Yule [29].

In statistics, it is important to obtain a method for estimating the unknown parameters from given observations. The earliest approach to point estimation for the parameters is the principle of least squares, which was first published by Legendre [11] and developed by Gauss [10]. Similar to least squares, least absolute deviations, which was modified by Edgeworth [6, 7], can be applied to estimate a single value for the unknown parameters. Another common point estimation is maximum likelihood, which was widely popularized by Wilks [25]. Contrasted with the single value calculated by point estimation, interval estimation, which was first proposed by Neyman [20], is the use of data to calculate an interval for possible values of an unknown parameter and extensively applied to the estimate of regression models. Another important technique of statistics is hypothesis testing, such as t-test (Student [22]) and F-test (Fisher [8]). Furthermore, likelihood ratio test was showed to be the most powerful test by Neyman andPearson [19].

Traditionally, statisticians assume explanatory variables and the response variable can be observed in a precise way. But in many cases, the data cannot be precisely estimated. For example, the data of the factories’ carbon emission are collected in an imprecise way. As another example, the data of the social benefit of factories are also impossible to be precisely estimated. By handling the imprecise observations as fuzzy observed data, a fuzzy linear regression model with crisp explanatory variables and a fuzzy response variable was first proposed by Tanaka et al. [23]. Then a modified form of estimation for the parameters was suggested by Corral and Gil [3] through extending the maximum likelihood principle into the case with fuzzy observed data. Diamond [5] further introduced least squares fitting for crisp explanatory variables and the fuzzy response variable to estimate the unknown parameters of the regression models. Furthermore, Corral and Gil [4] worked on the problem of interval estimation with fuzzy observed data. In the case of fuzzy explanatory variables and a fuzzy response variable, the regression model was first employed by Sakawa and Yano [21]. Another application of fuzzy sets to the imprecise observations is statistical hypothesis testing, which was first discussed by Casals et al. [1, 2].

However, it was shown by many surveys that uncertainty theory is more fitted to model the data with imprecise observations given by the experts [16]. Thus we should take the imprecisely observed data as uncertain variables and describe them by uncertainty distributions (Liu [13]). The use for the uncertain observed data was developed by many scholars, such as Wen et al. [24], Lio and Liu [12], Nejad and Ghaffari-Hadigheh [18], Yao [27] and Yang and Liu [26]. Especially, uncertain regression analysis was presented to model the relationship between explanatory variables and the response variable with uncertain observed data. To estimate the unknown parameters in the uncertain regression models, the principle of least squares was suggested by Yao and Liu [28].

In this paper, we employ some regression models for analyzing the relationship between uncertain explanatory variables and the uncertain response variable. The rest of the paper is organized as follows: In Section 2, some preliminary knowledge of uncertainty theory is highlighted. In Section 3, some formulas are provided to estimate the parameters of the regression models based on uncertain observed data, and their residual analysis is given in Section 4. In Section 5, the confidence interval of the uncertain regression models is suggested, and in Section 6, a numerical example is provided to illustrate the application of the model. Finally, some conclusions are made in Section 7.

2 Preliminaries

Through a lot of surveys, Liu [17] showed that human beings always estimate a much wider range of values than the object actually takes. This conservative estimation for degrees of belief makes the distribution function deviate far from the frequency. This provides a motivation for Liu [13] to found uncertainty theory to deal with the cases relying on degrees of belief when the precise observations or measurements are difficult to perform. In this section, some basic concepts and theorems in uncertainty theory including uncertain measure, uncertain variable and expected value are reviewed.

Definition 2.1. (Liu [13]) Let Ł be a σ-algebra on a nonempty set Γ. A set function M : Ł → [0, 1] is called an uncertain measure if it satisfies the following axioms:

Axiom 1. (Normality Axiom) M {Γ} =1 for the universal set Γ.

Axiom 2. (Duality Axiom) M {Λ} + M {Λ^c} =1 for any event Λ.

Axiom 3. (Subadditivity Axiom) For every countable sequence of events Λ₁, Λ₂, ⋯, we have $ℳ {⋃_{i = 1}^{\infty} Λ_{i}} \leq \sum_{i = 1}^{\infty} ℳ {Λ_{i}} .$

The triplet (Γ, Ł, M) is called an uncertainty space. Furthermore, the product uncertain measure on the product σ-algebra Ł is defined by the following fourth axiom.

Axiom 4. (Product Axiom) (Liu [14]) Let (Γ_k, Ł _k, _Mk) be uncertainty spaces for k = 1, 2, ⋯. The product uncertain measure M is an uncertain measure satisfying $ℳ {\prod_{k = 1}^{\infty} Λ_{k}} = ⋀_{k = 1}^{\infty} ℳ_{k} {Λ_{k}}$ where Λ_k are arbitrarily chosen events from Ł_k for k = 1, 2, ⋯, respectively.

As a real-valued function on the uncertainty space (Γ, Ł, M), uncertain variable is introduced to model the quantity with human uncertainty.

Definition 2.2. (Liu [13]) An uncertain variable ξ is a measurable function from an uncertainty space (Γ, Ł, M) to the set of real numbers such that for any Borel set B of real numbers, the set ${ξ \in B} = {γ \in Γ | ξ (γ) \in B}$ is an event.

The uncertainty distribution Φ of an uncertain variable ξ is defined by Φ (x) = M {ξ ≤ x} for any real number x. An uncertainty distribution Φ (x) is said to be regular if it is a continuous and strictly increasing function with respect to x at which 0 < Φ (x) <1, and $lim_{x \to - \infty} Φ (x) = 0, lim_{x \to \infty} Φ (x) = 1 .$

If ξ is an uncertain variable with regular uncertainty distribution Φ (x), the inverse function Φ^-1 (α) is called the inverse uncertainty distribution of ξ (Liu [15]).

An uncertain variable ξ is called linear if it has an uncertainty distribution $Φ (x) = {\begin{matrix} 0, & if x \leq a \\ (x - a) / (b - a), & if a < x \leq b \\ 1, & if x > b \end{matrix}$ denoted by Ł (a, b), where a and b are real numbers satisfying a < b, and the inverse uncertainty distribution of linear uncertain variable Ł (a, b) is $Φ^{- 1} (α) = (1 - α) a + α b .$

An uncertain variable ξ is called zigzag if it has an uncertainty distribution $Φ (x) = {\begin{matrix} 0, & if x \leq a \\ (x - a) / [2 (b - a)], & if a < x \leq b \\ (x + c - 2 b) / [2 (c - b)], & if b < x \leq c \\ 1, & if x > c \end{matrix}$ denoted by Z (a, b, c), where a, b and c are real numbers satisfying a < b < c, and the inverse uncertainty distribution of zigzag uncertain variable Z (a, b, c) is $Φ^{- 1} (α) = {\begin{matrix} (1 - 2 α) a + 2 α b, & if α < 0.5 \\ (2 - 2 α) b + (2 α - 1) c, & if α \geq 0.5 . \end{matrix}$ (1)

An uncertain variable ξ is called normal if it has an uncertainty distribution $Φ (x) = {(1 + exp (\frac{π (e - x)}{\sqrt{3} σ}))}^{- 1}, x \in ℛ$ denoted by N (e, σ), where e and σ are real numbers satisfying σ > 0, and the inverse uncertainty distribution of normal uncertain variable N (e, σ) is $Φ^{- 1} (α) = e + \frac{σ \sqrt{3}}{π} ln \frac{α}{1 - α} .$ (2)

Definition 2.3. (Liu [14]) The uncertain variables ξ₁, ξ₂, ⋯, ξ_n are said to be independent if $ℳ {⋂_{i = 1}^{n} (ξ_{i} \in B_{i})} = ⋀_{i = 1}^{n} ℳ {ξ_{i} \in B_{i}}$ for any Borel sets B₁, B₂, ⋯, B_n of real numbers.

Assume that ξ₁, ξ₂, ⋯, ξ_n are independent uncertain variables with regular uncertainty distributions Φ₁, Φ₂, ⋯, Φ_n, respectively. Liu [15] showed that if f (x₁, x₂, ⋯, x_n) is a strictly monotonous function, then the inverse uncertainty distribution of the uncertain variable f (ξ₁, ξ₂, ⋯, ξ_n) can be calculated by the following theorems.

Theorem 2.1. (Liu [15]) Let ξ₁, ξ₂, ⋯, ξ_n be independent uncertain variables with regular uncertainty distributions Φ₁, Φ₂, ⋯, Φ_n, respectively. If f is strictly increasing with respect to ξ₁, ξ₂, ⋯, ξ_m and strictly decreasing with respect to ξ_m+1, ξ_m+2, ⋯, ξ_n, then ξ = f (ξ₁, ξ₂, ⋯, ξ_n) is an uncertain variable with inverse uncertainty distribution $\begin{matrix} Ψ^{- 1} (α) & = & f (Φ_{1}^{- 1} (α), \dots, Φ_{m}^{- 1} (α), \\ Φ_{m + 1}^{- 1} (1 - α), \dots, Φ_{n}^{- 1} (1 - α)) . \end{matrix}$

As the average value of an uncertain variable in the sense of uncertain measure, expected value can represent the size of the uncertain variable.

Definition 2.4. (Liu [13]) Let ξ be an uncertain variable. Then the expected value of ξ is defined as $E [ξ] = \int_{0}^{+ \infty} ℳ {ξ \geq x} d x - \int_{- \infty}^{0} ℳ {ξ \leq x} d x$ provided that at least one of the two integrals is finite.

As another important feature for an uncertain variable, variance is defined as follows:

Definition 2.5. (Liu [13]) Let ξ be an uncertain variable with finite expected value e. Then the variance of ξ is $V [ξ] = E [(ξ - e)^{2}] .$

Let ξ be an uncertain variable with regular uncertainty distribution Φ. Then we have $E [ξ] = \int_{0}^{1} Φ^{- 1} (α) α,$ (3) $E [ξ^{2}] = \int_{0}^{1} (Φ^{- 1} (α))^{2} α,$ (4) $V [ξ] = \int_{0}^{1} (Φ^{- 1} (α) - e)^{2} α .$ (5)

2.1 How to determine uncertainty distributions

The uncertainty distributions of the observed variables can be determined by the expert’s experience. To collect the experimental data of expert, Liu [15] suggested a method of questionnaire survey. We first ask the domain expert for a possible value x that the social benefit ξ of a certain company may take, and then question the expert “How likely is ξ less than or equal to x?”

Then denote the expert’s belief degree by α (say 0.4). An expert’s experimental data $(x, α) = (1, 0.4)$ (6) is thus obtained. Repeating the questionnaire survey, we acquire a sequence of expert’s experimental, i.e., $(x_{1}, α_{1}), (x_{2}, α_{2}), \dots, (x_{n}, α_{n}) .$ (7)

To illustrate the process of determining the uncertainty distribution, we suppose that the social benefit is imprecise and a domain expert is invited to provide the experimental data. Then the consultation process can be as follows:

Q1: What do you think is the minimal value of the social benefit of the company? A1: 0.5 billion dollars. (an expert’s experimental data (0.5,0) is obtained) Q2: What do you think is the maximal value? A2: 1.2 billion dollars. (an expert’s experimental data (1.2,1) is obtained) Q3: What do you think is a likely value? A3: 0.8 billion dollars. Q4: To what degree do you think that the real value of the social benefit is less than 0.8 billion dollars? A4: 40%. (an expert’s experimental data (0.8,0.4) is obtained) Q5: Is there another value the social benefit may be? A5: 1 billion dollars. Q6: To what degree do you think that the real value is less than 1 billion dollars? A6: 80%. (an expert’s experimental data (1,0.8) is obtained)

Hence four expert’s experimental data of the imprecise social benefit of the company are obtained from the domain expert, i.e. $(0.5, 0), (0.8, 0.4), (1, 0.8), (1.2, 1) .$

Take (0.5, 0) as (x₁, α₁), (0.8, 0.4) as (x₂, α₂), (1, 0.8) as (x₃, α₃) and (1.2, 1) as (x₄, α₄). Then an empirical uncertainty distribution is suggested by Liu [15] that the uncertainty distribution of the imprecise social benefit can be determined by $Φ (x) = {\begin{matrix} 0, & if x < x_{1} \\ α_{i} + \frac{(α_{i + 1} - α_{i}) (x - x_{i})}{x_{i + 1} - x_{i}}, & \begin{matrix} if x_{i} \leq x & \leq x_{i + 1}, \\ 1 & \leq i < 4 \end{matrix} \\ 1, & if x > x_{4} . \end{matrix}$

Essentially, it is a type of linear interpolation method.

3 Uncertain regression models

Let (x₁, x₂, ⋯, x_p) be a vector of explanatory variables, and let y be a response variable. Assume the relationship between (x₁, x₂, ⋯, x_p) and y can be expressed by a function, f, and the model is generally given as $y = f (x_{1}, x_{2}, \dots, x_{p} |) + ∊$ (8) where hbe is a vector of unknown parameters, and ∊ is a disturbance term.

Suppose that there are a set of imprecisely observed data, $(i 1, i 2, \dots, i p, i), i = 1, 2, \dots, n$ (9) where _txi1, _txi2, ⋯, _txip, _tyi are uncertain variables with uncertainty distributions Φ_i1, Φ_i2, ⋯, Φ_ip, Ψ_i, i = 1, 2, ⋯, n, respectively. We are interested in obtaining the vector of unknown parameters, hbe. Therefore, an estimation, ^{hbe *}, for hbe should be given based on the imprecisely observed data (7).

To obtain ^{hbe *}, Yao and Liu [28] suggested the least squares estimate of hbe in the regression model (6) is the solution of the minimization problem, $\min_{} \sum_{i = 1}^{n} E [(i - f {(i 1, i 2, \dots, i p |)}^{2}] .$ (10)

Assume the optimal solution of the minimization problem (8) is ^{hbe *}. Then the fitted regression model can be denoted by $y = f (x_{1}, x_{2}, \dots, x_{p} |) .$ (11)

Theorem 3.1. Suppose that (_txi1, _txi2, ⋯, _txip, _tyi), i = 1, 2, ⋯, n are a set of imprecisely observed data, where _txi1, _txi2, ⋯, _txip, _tyi are independent uncertain variables with regular uncertainty distributions Φ_i1, Φ_i2, ⋯, Φ_ip, Ψ_i, i = 1, 2, ⋯, n, respectively. Then the least squares estimate of β₀, β₁, ⋯, β_p in the linear regression model $y = β_{0} + \sum_{j = 1}^{p} β_{j} x_{j} + ∊$ (12) is the optimal solution of the following problem:

$\begin{matrix} min_{β_{0}, β_{1}, \dots, β_{p}} \sum_{i = 1}^{n} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0} \\ - \sum_{j = 1}^{p} β_{j} ϒ_{ij}^{- 1} (α, β_{j}))^{2} \end{matrix} α$ (13) where $ϒ_{ij}^{- 1} (α, β_{j}) = {= 3 pt \begin{matrix} Φ_{ij}^{- 1} (1 - α), & if β_{j} \geq 0 \\ Φ_{ij}^{- 1} (α), & if β_{j} < 0 \end{matrix}$ (14) for i = 1, 2, ⋯, n and j = 1, 2, ⋯, p.

Proof. The least squares estimate of β₀, β₁, ⋯, β_p in the linear regression model is actually the optimal solution of the minimization problem, $\min_{β_{0}, β_{1}, \dots, β_{p}} \sum_{i = 1}^{n} E [{(i - β_{0} - \sum_{j = 1}^{p} β_{j} i j)}^{2}] .$ (15)

For each i, it follows from Theorem 2.1 that the inverse uncertainty distribution of $i - β_{0} - \sum_{j = 1}^{p} β_{j} i j$ is $F_{i}^{- 1} (α) = Ψ_{i}^{- 1} (α) - β_{0} - \sum_{j = 1}^{p} β_{j} ϒ_{ij}^{- 1} (α, β_{j}) .$

Then from Equation (2), we obtain $\begin{array}{l} E & [{({\tilde{y}}_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} {\tilde{x}}_{i j})}^{2}] \\ = \int_{0}^{1} {(Ψ_{i}^{- 1} (α) β_{0} - \sum_{j = 1}^{p} β_{j} ϒ_{i j}^{- 1} (α, β_{j}))}^{2} d α \end{array}$

Thus the minimization problem (13) is equivalent to $\begin{matrix} min_{β_{0}, β_{1}, \dots, β_{p}} \sum_{i = 1}^{n} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0} \\ - \sum_{j = 1}^{p} β_{j} ϒ_{ij}^{- 1} (α, β_{j}))^{2} \end{matrix} α .$

The theorem is verified.

Theorem 3.2. Suppose that (_txi, _tyi), i = 1, 2, ⋯, n are a set of imprecisely observed data, where _txi, _tyi are independent uncertain variables with regular uncertainty distributions Φ_i, Ψ_i, i = 1, 2, ⋯, n, respectively. Then the least squares estimate of β₀, β₁ and β₂ in the asymptotic regression model $y = β_{0} - β_{1} exp (- β_{2} x) + ∊, β_{1} > 0, β_{2} > 0$ (16) is the optimal solution of the following problem: $\begin{matrix} min_{β_{0}, β_{1}, β_{2}} \sum_{i = 1}^{n} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0} \\ + {β_{1} exp (- β_{2} Φ_{i}^{- 1} (1 - α)))}^{2} \end{matrix} α .$

Proof. The least squares estimate of β₀, β₁ and β₂ in the asymptotic regression model is actually the optimal solution of the minimization problem, $\min_{β_{0}, β_{1}, β_{2}} \sum_{i = 1}^{n} E [{(i - β_{0} + β_{1} \exp (- β_{2} i))}^{2}] .$ (17)

Since the function $i - β_{0} + β_{1} \exp (- β_{2} i)$ (18) is strictly increasing with respect to _tyi and strictly decreasing with respect to _txi for each i, it follows from Theorem 2.1 that the inverse uncertainty distribution of function (16) is $G_{i}^{- 1} (α) = Ψ_{i}^{- 1} (α) - β_{0} + β_{1} exp (- β_{2} Φ_{i}^{- 1} (1 - α)) .$ (19)

Then from Equation (2), we obtain $\begin{matrix} E [{(i - β_{0} + β_{1} \exp (- β_{2} i))}^{2}] \\ = \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0} \\ + β_{1} \exp (- β_{2} Φ_{i}^{- 1} (1 - α)))^{2} \end{matrix} α .$

Thus the minimization problem (13.1) is equivalent to $\begin{matrix} min_{β_{0}, β_{1}, β_{2}} \sum_{i = 1}^{n} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0} \\ + {β_{1} exp (- β_{2} Φ_{i}^{- 1} (1 - α)))}^{2} \end{matrix} α .$

The theorem is verified.

Theorem 3.3. Suppose that (_txi, _tyi), i = 1, 2, ⋯, n are a set of imprecisely observed data, where _txi, _tyi are independent uncertain variables with regular uncertainty distributions Φ_i, Ψ_i, i = 1, 2, ⋯, n, respectively. Then the least squares estimate of β₁ and β₂ in the Michaelis-Menten regression model $y = \frac{β_{1} x}{β_{2} + x} + ∊, β_{1} > 0, β_{2} > 0$ (20) is the optimal solution of the following problem: $min_{β_{1}, β_{2}} \sum_{i = 1}^{n} \int_{0}^{1} {(Ψ_{i}^{- 1} (α) - \frac{β_{1} Φ_{i}^{- 1} (1 - α)}{β_{2} + Φ_{i}^{- 1} (1 - α)})}^{2} α .$ (21)

Proof. The least squares estimate of β₁ and β₂ in the Michaelis-Menten regression model is actually the optimal solution of the minimization problem, $\min_{β_{1}, β_{2}} \sum_{i = 1}^{n} E [{(i - \frac{β_{1} i}{β_{2} + i})}^{2}] .$ (22)

Since the function $i - \frac{β_{1} i}{β_{2} + i}$ (23) is strictly increasing with respect to _tyi and strictly decreasing with respect to _txi for each i, it follows from Theorem 2.1 that the inverse uncertainty distribution of function (19) is $H_{i}^{- 1} (α) = Ψ_{i}^{- 1} (α) - \frac{β_{1} Φ_{i}^{- 1} (1 - α)}{β_{2} + Φ_{i}^{- 1} (1 - α)} .$ (24)

Then from Equation (2), we obtain $\begin{matrix} E [{(i - \frac{β_{1} i}{β_{2} + i})}^{2}] \\ = \int_{0}^{1} {(Ψ_{i}^{- 1} (α) - \frac{β_{1} Φ_{i}^{- 1} (1 - α)}{β_{2} + Φ_{i}^{- 1} (1 - α)})}^{2} \end{matrix} α .$

Thus the minimization problem (13.2) is equivalent to $min_{β_{1}, β_{2}} \sum_{i = 1}^{n} \int_{0}^{1} {(Ψ_{i}^{- 1} (α) - \frac{β_{1} Φ_{i}^{- 1} (1 - α)}{β_{2} + Φ_{i}^{- 1} (1 - α)})}^{2} α .$

The theorem is verified.

4 Residual analysis

In the regression model (6), there is a disturbance term, ∊, an increment by which the response variable y may fall off the regression. Similar to hbe, ∊ is also an unknown parameter, and in fact it is impossible to be discovered exactly since the term changes for each observation. Then we are interested in finding an estimation for ∊ from the given imprecisely observed data, $(i 1, i 2, \dots, i p, i), i = 1, 2, \dots, n .$ (25)

For each i, the difference between _tyi and f (_txi1, _txi2, ⋯, _txip|^{hbe *}) represents the deviation of the response variable _tyi and forecast variable f (_txi1, _txi2, ⋯, _txip| ^{hbe *}). Thus we propose a definition as follows:

Definition 4.1. Let (_txi1, _txi2, ⋯, _txip, _tyi), i = 1, 2, ⋯, n be a set of imprecisely observed data, and suppose the fitted regression model is $y = f (x_{i 1}, x_{i 2}, \dots, x_{i p} | *) .$ (26)

Then for each i (i = 1, 2, ⋯, n), the term $i = i - f (i 1, i 2, \dots, i p |)$ (27) is called the i-th residual.

Now assume that the disturbance term ∊ is an uncertain variable. Then we use the average of the expected values of residuals, i.e., $\hat{e} = \frac{1}{n} \sum_{i = 1}^{n} E [i]$ (28) to estimate the expected value of the disturbance term ∊, and $2 = \frac{1}{n} \sum_{i = 1}^{n} E [{(i - \hat{e})}^{2}]$ (29) to estimate the variance, where ${\hat{∊}}_{i}$ are the i-th residuals, i = 1, 2, ⋯, n, respectively.

Theorem 4.1. Let (_txi1, _txi2, ⋯, _txip, _tyi), i = 1, 2, ⋯, n be a set of imprecisely observed data, where _txi1, _txi2, ⋯, _txip, _tyi are independent uncertain variables with regular uncertainty distributions Φ_i1, Φ_i2, ⋯, Φ_ip, Ψ_i, i = 1, 2, ⋯, n, respectively, and let the fitted linear regression model be $y = β_{0}^{*} + \sum_{j = 1}^{p} β_{j}^{*} x_{j} .$ (30)

Then the estimated expected value of the disturbance term ∊ is

$\begin{matrix} \hat{e} & = & \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0}^{*} \\ - \sum_{j = 1}^{p} β_{j}^{*} ϒ_{ij}^{- 1} (α, β_{j}^{*})) \end{matrix} α$ (31) and the estimated variance is

$\begin{matrix} 2 & = & \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0}^{*} \\ - \sum_{j = 1}^{p} β_{j}^{*} ϒ_{i j}^{- 1} (α, β_{j}^{*}) - \hat{e})^{2} \end{matrix} α$ (32) where $ϒ_{ij}^{- 1} (α, β_{j}^{*}) = {\begin{matrix} Φ_{ij}^{- 1} (1 - α), & if β_{j}^{*} \geq 0 \\ Φ_{ij}^{- 1} (α), & if β_{j}^{*} < 0 \end{matrix}$ (33) for i = 1, 2, ⋯, n and j = 1, 2, ⋯, p.

Proof. For each i, it follows from Theorem 2.1 that the inverse uncertainty distribution of $i - β_{0}^{*} - \sum_{j = 1}^{p} β_{j}^{*} i j$ (34) is $F_{i}^{- 1} (α) = Ψ_{i}^{- 1} (α) - β_{0}^{*} - \sum_{j = 1}^{p} β_{j}^{*} ϒ_{ij}^{- 1} (α, β_{j}^{*}) .$ (35)

The theorem follows from Equations (1) and (2) immediately.

Theorem 4.2. Let (_txi, _tyi), i = 1, 2, ⋯, n be a set of imprecisely observed data, where _txi, _tyi are independent uncertain variables with regular uncertainty distributions Φ_i, Ψ_i, i = 1, 2, ⋯, n, respectively, and let the fitted asymptotic regression model be $y = β_{0}^{*} - β_{1}^{*} exp (- β_{2}^{*} x), β_{1}^{*} > 0, β_{2}^{*} > 0 .$ (36)

Then the estimated expected value of the disturbance term ∊ is $\begin{matrix} \hat{e} & = & \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0}^{*} \\ + β_{1}^{*} exp (- β_{2}^{*} Φ_{i}^{- 1} (1 - α)) \end{matrix} α$ and the estimated variance is $\begin{matrix} 2 & = & \frac{1}{n} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0}^{*} \\ + β_{1}^{*} \exp (- β_{2}^{*} Φ_{i}^{- 1} (1 - α)) - \hat{e})^{2} \end{matrix} α .$

Proof. Since the function $i - β_{0}^{*} + β_{1}^{*} \exp (- β_{2}^{*} i), β_{1}^{*} > 0, β_{2}^{*} > 0$ is strictly increasing with respect to _tyi and strictly decreasing with respect to _txi for each i, it follows from Theorem 2.1 that its inverse uncertainty distribution is $G_{i}^{- 1} (α) = Ψ_{i}^{- 1} (α) - β_{0}^{*} + β_{1}^{*} exp (- β_{2}^{*} Φ_{i}^{- 1} (1 - α)) .$

The theorem follows from Equations (1) and (2) immediately.

Theorem 4.3. Let (_txi, _tyi), i = 1, 2, ⋯, n be a set of imprecisely observed data, where _txi, _tyi are independent uncertain variables with regular uncertainty distributions Φ_i, Ψ_i, i = 1, 2, ⋯, n, respectively, and let the fitted Michaelis-Menten regression model be $y = \frac{β_{1}^{*} x}{β_{2}^{*} + x}, β_{1}^{*} > 0, β_{2}^{*} > 0 .$ (37)

Then the estimated expected value of the disturbance term ∊ is $\hat{e} = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - \frac{β_{1}^{*} Φ_{i}^{- 1} (1 - α)}{β_{2}^{*} + Φ_{i}^{- 1} (1 - α)}) α$ (38) and the estimated variance is $2 = \frac{1}{n} \int_{0}^{1} {(Ψ_{i}^{- 1} (α) - \frac{β_{1}^{*} Φ_{i}^{- 1} (1 - α)}{β_{2}^{*} + Φ_{i}^{- 1} (1 - α)} - \hat{e})}^{2} α .$ (39)

Proof. Since the function $i - \frac{β_{1}^{*} i}{β_{2}^{*} + i}, β_{1}^{*} > 0, β_{2}^{*} > 0$ (40) is strictly increasing with respect to _tyi and strictly decreasing with respect to _txi for each i, it follows from Theorem 2.1 that its inverse uncertainty distribution is $H_{i}^{- 1} (α) = Ψ_{i}^{- 1} (α) - \frac{β_{1}^{*} Φ_{i}^{- 1} (1 - α)}{β_{2}^{*} + Φ_{i}^{- 1} (1 - α)} .$ (41)

The theorem follows from Equations (1) and (2) immediately.

5 Forecast value and confidence interval

Suppose (_tx1, _tx2, ⋯, _txp) is a vector of new explanatory variables, where _tx1, _tx2, ⋯, _txp are uncertain variables with regular uncertainty distributions Φ₁, Φ₂, ⋯, Φ_p, respectively. It is useful to forecast the response variable for the new explanatory vector by the given imprecisely observed data (_txi1, _txi2, ⋯, _txip, _tyi), i = 1, 2, ⋯, n. For example, a new factory is founded and its social benefit is required to be forecasted. Taking social benefit as a response variable, average quality of the production, monthly salary of employees and carbon emission as explanatory variables, an uncertain regression model can be built from the data of existing factories. According to the model, the social benefit of the new factory can be forecasted by the information of production, salary and carbon emission, and used to judge whether setting up the new factory is reasonable.

Although the relationship between uncertain explanatory variables and the uncertain response variable should be complicated, it is still valuable to apply linear regression model for the data. Now suppose the fitted linear regression model is $y = β_{0}^{*} + \sum_{j = 1}^{p} β_{j}^{*} x_{j},$ (42) and the disturbance term ∊ has estimated expected value $\hat{e}$ and variance ^hhsi2, and is independent of _tx1, _tx2, ⋯, _txp. Then the forecast uncertain variable of y with respect to _tx1, _tx2, ⋯, _txp is determined by $\hat{y} = β_{0}^{*} + \sum_{j = 1}^{p} β_{j}^{*} j + .$ (43)

A single value of y should be estimated from the forecast uncertain variable, and it is natural to define the forecast value of y as $μ = β_{0}^{*} + \sum_{j = 1}^{p} β_{j}^{*} E [j] + \hat{e},$ (44) that is, the expected value of the forecast uncertain variable $\hat{y}$ . Furthermore, if the disturbance term ∊ is assumed to have a normal uncertainty distribution $N (\hat{e},)$ , then the inverse uncertainty distribution of $\hat{y}$ is determined by ${\hat{Ψ}}^{- 1} (α) = β_{0}^{*} + \sum_{j = 1}^{p} β_{j}^{*} ϒ_{j}^{- 1} (α, β_{j}^{*}) + Φ^{- 1} (α)$ (45) where $ϒ_{j}^{- 1} (α, β_{j}^{*}) = {\begin{matrix} Φ_{j}^{- 1} (α), & if β_{j}^{*} \geq 0 \\ Φ_{j}^{- 1} (1 - α), & if β_{j}^{*} < 0 \end{matrix}$ (46) for j = 1, 2, ⋯, p, and Φ^-1 (α) is the inverse uncertainty distribution of ∊, i.e., $Φ^{- 1} (α) = \hat{e} + \frac{\sqrt{3}}{π} ln \frac{α}{1 - α} .$ (47)

Then the uncertainty distribution, $\hat{Ψ}$ , of $\hat{y}$ can be obtained by ${\hat{Ψ}}^{- 1}$ .

The forecast value, μ, is a point estimation of y. However, it is not convincing to claim that the value of y is always a precise value. Hence the confidence interval is proposed to estimate y. Although some precision is given up when applying confidence interval, we can gain some confidence and assurance that our inference must be correct. Taking α (e.g., 95%) as a confidence level, we are interested in finding the minimum value b such that $\hat{Ψ} (μ + b) - \hat{Ψ} (μ - b) \geq α .$ (48) Since $ℳ {μ - b \leq \hat{y} \leq μ + b} \geq \hat{Ψ} (μ + b) - \hat{Ψ} (μ - b),$ (49) it follows that $ℳ {μ - b \leq \hat{y} \leq μ + b} \geq α$ . Thus the α confidence interval of y is suggested as [μ - b, μ + b], which can be abbreviated as $μ \pm b$ (50) and we have a chance of α to cover y with our confidence interval.

6 Numerical example

In this section, we consider an example to show how the regression model to be applied to forecast the response for a new explanatory vector with imprecise observations, and the calculation for the 95% confidence interval is also given.

Suppose (_txi1, _txi2, _txi3, _tyi), i = 1, 2, ⋯, 24 are a set of imprecisely observed data, where _txi1, _txi2, _txi3, _tyi are independent uncertain variables with linear uncertainty distributions, Φ_i1, Φ_i2, Φ_i3, Ψ_i, respectively. The data are provided in Table 1.

Table 1
Imprecisely Observed Data where Ł (a, b) Represents Linear Uncertain Variable

i $\tilde{y}$ ${\tilde{x}}_{i 1}$ ${\tilde{x}}_{i 2}$ ${\tilde{x}}_{i 3}$

1 Ł(33,36) Ł(3,4) Ł(9,10) Ł(6,7)

2 Ł(46,49) Ł(5,6) Ł(33,36) Ł(6,7)

3 Ł(38,41) Ł(5,6) Ł(18,20) Ł(7,8)

4 Ł(41,44) Ł(4,5) Ł(31,34) Ł(7,8)

5 Ł(40,43) Ł(5,6) Ł(20,22) Ł(6,7)

6 Ł(37,40) Ł(6,7) Ł(13,15) Ł(5,6)

7 Ł(52,55) Ł(7,8) Ł(47,50) Ł(8,9)

8 Ł(30,33) Ł(3,4) Ł(5,6) Ł(5,6)

9 Ł(39,42) Ł(6,7) Ł(25,28) Ł(6,7)

10 Ł(40,43) Ł(5,6) Ł(30,33) Ł(4,5)

11 Ł(40,43) Ł(5,6) Ł(33,36) Ł(4,5)

12 Ł(38,41) Ł(4,5) Ł(25,28) Ł(5,6)

13 Ł(48,51) Ł(7,8) Ł(40,43) Ł(7,8)

14 Ł(44,47) Ł(6,7) Ł(35,38) Ł(7,8)

15 Ł(33,36) Ł(3,4) Ł(21,24) Ł(4,5)

16 Ł(45,48) Ł(4,5) Ł(34,37) Ł(8,9)

17 Ł(34,37) Ł(6,7) Ł(7,8) Ł(5,6)

18 Ł(43,46) Ł(8,9) Ł(23,26) Ł(7,8)

19 Ł(35,38) Ł(3,4) Ł(15,17) Ł(5,6)

20 Ł(35,38) Ł(4,5) Ł(23,26) Ł(3,4)

21 Ł(36,39) Ł(5,6) Ł(27,30) Ł(4,5)

22 Ł(38,41) Ł(4,5) Ł(35,38) Ł(6,7)

23 Ł(42,45) Ł(6,7) Ł(39,42) Ł(5,6)

24 Ł(31,34) Ł(4,5) Ł(11,13) Ł(6,7)

i	$\tilde{y}$	${\tilde{x}}_{i 1}$	${\tilde{x}}_{i 2}$	${\tilde{x}}_{i 3}$
1	Ł(33,36)	Ł(3,4)	Ł(9,10)	Ł(6,7)
2	Ł(46,49)	Ł(5,6)	Ł(33,36)	Ł(6,7)
3	Ł(38,41)	Ł(5,6)	Ł(18,20)	Ł(7,8)
4	Ł(41,44)	Ł(4,5)	Ł(31,34)	Ł(7,8)
5	Ł(40,43)	Ł(5,6)	Ł(20,22)	Ł(6,7)
6	Ł(37,40)	Ł(6,7)	Ł(13,15)	Ł(5,6)
7	Ł(52,55)	Ł(7,8)	Ł(47,50)	Ł(8,9)
8	Ł(30,33)	Ł(3,4)	Ł(5,6)	Ł(5,6)
9	Ł(39,42)	Ł(6,7)	Ł(25,28)	Ł(6,7)
10	Ł(40,43)	Ł(5,6)	Ł(30,33)	Ł(4,5)
11	Ł(40,43)	Ł(5,6)	Ł(33,36)	Ł(4,5)
12	Ł(38,41)	Ł(4,5)	Ł(25,28)	Ł(5,6)
13	Ł(48,51)	Ł(7,8)	Ł(40,43)	Ł(7,8)
14	Ł(44,47)	Ł(6,7)	Ł(35,38)	Ł(7,8)
15	Ł(33,36)	Ł(3,4)	Ł(21,24)	Ł(4,5)
16	Ł(45,48)	Ł(4,5)	Ł(34,37)	Ł(8,9)
17	Ł(34,37)	Ł(6,7)	Ł(7,8)	Ł(5,6)
18	Ł(43,46)	Ł(8,9)	Ł(23,26)	Ł(7,8)
19	Ł(35,38)	Ł(3,4)	Ł(15,17)	Ł(5,6)
20	Ł(35,38)	Ł(4,5)	Ł(23,26)	Ł(3,4)
21	Ł(36,39)	Ł(5,6)	Ł(27,30)	Ł(4,5)
22	Ł(38,41)	Ł(4,5)	Ł(35,38)	Ł(6,7)
23	Ł(42,45)	Ł(6,7)	Ł(39,42)	Ł(5,6)
24	Ł(31,34)	Ł(4,5)	Ł(11,13)	Ł(6,7)

To forecast the response for a new explanatory vector, we employ the linear regression model $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{3} + ∊,$ (51) to solve the minimization problem (8), i.e.,

$\begin{matrix} \min_{β_{0}, β_{1}, β_{2}, β_{3}} \sum_{i = 1}^{n} E [(i - (β_{0} + β_{1} i 1 \\ + β_{2} i 2 + β_{3} i 3))^{2}] . \end{matrix}$ (52)

From Theorem 3.1, Equation (41) can be changed to an equivalent form, i.e., $\begin{matrix} min_{β_{0}, β_{1}, β_{2}, β_{3}} \sum_{i = 1}^{24} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0} - β_{1} ϒ_{i 1}^{- 1} (α, β_{1}) \\ - β_{2} ϒ_{i 2}^{- 1} (α, β_{2}) - β_{3} ϒ_{i 3}^{- 1} (α, β_{3}))^{2} \end{matrix} α$ where $ϒ_{ij}^{- 1} (α, β_{j}) = {\begin{matrix} Φ_{ij}^{- 1} (1 - α), & if β_{j} \geq 0 \\ Φ_{ij}^{- 1} (α), & if β_{j} < 0 \end{matrix}$ (53) for i = 1, 2, ⋯, 24 and j = 1, 2, 3. Then we can obtain the least squares estimate

$\begin{matrix} (β_{0}^{*}, β_{1}^{*}, β_{2}^{*}, β_{3}^{*}) \\ = (21.5196, 0.8678, 0.3110, 1.0053) . \end{matrix}$ (54)

Hence the fitted linear regression model is $y = 21.5196 + 0.8678 x_{1} + 0.3110 x_{2} + 1.0053 x_{3} .$ (55)

By applying Equations (26) and (27), i.e., $\begin{matrix} \hat{e} & = & \frac{1}{24} \sum_{i = 1}^{24} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0}^{*} - β_{1}^{*} Φ_{i 1}^{- 1} (1 - α) \\ - β_{2}^{*} Φ_{i 2}^{- 1} (1 - α) - β_{3}^{*} Φ_{i 3}^{- 1} (1 - α)) \end{matrix} α$ and $\begin{matrix} 2 & = & \frac{1}{24} \sum_{i = 1}^{24} \int_{0}^{1} (Ψ_{i}^{- 1} (α) - β_{0}^{*} - β_{1}^{*} Φ_{i 1}^{- 1} (1 - α) \\ - β_{2}^{*} Φ_{i 2}^{- 1} (1 - α) - β_{3}^{*} Φ_{i 3}^{- 1} (1 - α) - \hat{e})^{2} \end{matrix} α,$ we obtain the estimated expected value and variance of the disturbance term ∊ are $\hat{e} = 0.0000, 2 = 5.6305,$ (56) respectively. Now suppose $(1, 2, 3) \sim (Ł (5, 6), Ł (28, 30), Ł (6, 7))$ (57) is a new uncertain explanatory vector. When _tx1, _tx2, _tx3, ∊ are independent, we obtain the forecast uncertain variable of the response variable y is

$\begin{matrix} \hat{y} & = & 21.5196 + 0.86781 + 0.31102 \\ + 1.00533 +, \end{matrix}$ (58) and the forecast value of y is 41.8460 by using the Equation (33), i.e., $\begin{matrix} μ = E [\hat{y}] & = & β_{0}^{*} + β_{1}^{*} E [1] + β_{2}^{*} E [2] \\ + β_{3}^{*} E [3] + \hat{e} . \end{matrix}$

For the confidence level α = 95%, if we suppose further that the disturbance term ∊ is a normal uncertain variable, then $b = 5.9780$ (59) is the minimum value to hold the Equation (37), i.e., $\hat{Ψ} (μ + b) - \hat{Ψ} (μ - b) \geq 95 % .$ (60)

Here $\hat{Ψ}$ is the uncertainty distribution of $\hat{y}$ and determined by $\begin{matrix} {\hat{Ψ}}^{- 1} (α) & = & β_{0}^{*} + β_{1}^{*} (5 (1 - α) + 6 α) + β_{2}^{*} (28 (1 - α) \\ + 30 α) + β_{3}^{*} (6 (1 - α) + 7 α) + Φ^{- 1} (α) \end{matrix}$ where Φ^-1 (α) is the inverse uncertainty distribution of normal uncertain variable $(\hat{e},)$ . Thus the 95% confidence interval of the response variable y is $41.8460 \pm 5.9780 .$ (61)

7 Conclusion

Since the observed data are often collected in an imprecise way, this paper introduced some uncertain regression models for handling the uncertain observed data reasonably. In order to study the disturbance term in the models, the concepts of i-th residual and the residual analysis of the models were proposed. Furthermore, it is necessary to provide an estimation when a vector of new explanatory variables is given. Hence the forecast value and the confidence interval of the response variable with respect to the new explanatory variables were presented, and a numerical example was provided to illustrate the calculation for the unknown parameters, the estimated expected value and variance of the disturbance term, the forecast value and the confidence interval in terms of some given observeddata.

For the future work, the hypothesis testing for the unknown parameters in the uncertain regression model will be studied, and the concept of multiple correlation coefficient will be proposed for assessing the regression fit with imprecise observations.

Footnotes

Acknowledgments

This work was supported by National Natural Science Foundation of China Grant No. 61573210.

References

Casals

M.R.

, Gil

M.A.

and Gil

, On the use of Zadeh’s probabilistic definition for testing statistical hypotheses from fuzzy information, Fuzzy Sets and Systems 20 (1986), 175–190.

Casals

M.R.

, Gil

M.A.

and Gil

, The fuzzy decision problem: An approach to the problem of testing statistical hypotheses with fuzzy information, European Journal of Operational Research 27 (1986), 371–382.

Corral

and Gil

M.A.

, The minimum inaccuracy fuzzy estimation: An extension of the maximum likelihood principle, Stochastica 8 (1984), 63–81.

Corral

and Gil

M.A.

, A note on interval estimation with fuzzy data, Fuzzy Sets and Systems 28 (1988), 209–215.

Diamond

, Fuzzy least squares, Information Sciences 46 (1988), 141–157.

Edgeworth

F.Y.

, On observations relating to several quantities, Hermathena 6 (1887), 279–285.

Edgeworth

F.Y.

, On a new method of reducing observations relating to several quantities, Philosophical Magazine 25 (1888), 184–191.

Fisher

R.A.

Statistical Methods for Research Workers, Oliver and Boyd. Edinburgh. 1925.

Galton

, Regression towards mediocrity in hereditary stature, Journal of the Anthropological Institute 15 (1885), 246–263.

10.

Gauss

C.F.

Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections, Sumtibus Frid Perthes. Hamburg, 1809.

11.

Legendre

A.M.

, New Methods for the Determination of the Orbits of Comets, Firmin Didot, Paris, 1805.

12.

Lio

and Liu

, Uncertain data envelopment analysis with imprecisely observed inputs and outputs, Fuzzy Optimization and Decision Making (2017). DOI: 10.1007/sl 0700-017-9276-x

13.

Liu

Uncertainty Theory, 2nd edn. Springer-Verlag, Berlin. 2007.

14.

Liu

, Some research problems in uncertainty theory, Journal of Uncertain Systems 3 (2009), 3–10.

15.

Liu

, Uncertainty Theory: A Branch of Mathematics for Modeling Human Uncertainty, Springer-Verlag, Berlin. 2010.

16.

Liu

, Why is there a need for uncertainty theory, Journal of Uncertain Systems 6 (2012), 3–10.

17.

Liu

Uncertainty Theory, 4th edn. Springer-Verlag, Berlin, 2015.

18.

Nejad

Z.M.

and Ghaffari-Hadigheh

, A novel DEA model based on uncertainty theory, Annals of Operations Research DOI: 10.1007/s10479-017-2652-7.

19.

Neyman

and Pearson

E.S.

, On the problem of the most efficient tests of statistical hypotheses, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 231 (1933), 289–337.

20.

Neyman

, Outline of a theory of statistical estimation based on the classical theory of probability, Philosophical Transactions of the Royal Society A 236 (1937), 333–380.

21.

Sakawa

and Yano

, Multiobjective fuzzy linear regression analysis for fuzzy input-output data, Fuzzy Sets and Systems 63 (1992), 191–206.

22.

Student, The probable error of a mean, Biometrika 6 (1908), 1–25.

23.

Tanaka

, Uejima

and Asai

, Linear regression analysis with fuzzy model, IEEE Transactions on Systems, Man, and Cybernetics 12 (1982), 903–907.

24.

Wen

M.L.

, Zhang

Q.Y.

, Kang

and Yang

, Some new ranking criteria in data envelopment analysis under uncertain environment, Computers and Industrial Engineering 110 (2017), 498–504.

25.

Wilks

S.S.

, The large-sample distribution of the likelihood ratio for testing composite hypotheses, Annals of Mathematical Statistics 9 (1938), 60–62.

26.

Yang

X.F.

and Liu

, Technical Report, Uncertain time series analysis with imprecise observations 2017.

27.

Yao

, Uncertain statistical inference models with imprecise observations, IEEE Transactions on Fuzzy Systems DOI: 10.1109/TFUZZ.2017.2666846

28.

Yao

and Liu

, Uncertain regression analysis: An approach for imprecise observations, Soft Computing DOI: 10.1007/s00500-017-2521-y.

29.

Yule

G.U.

, On the theory of correlation, Journal of the Royal Statistical Society 60 (1897), 812–854.