Abstract
The least squares estimation can fully consider the given data and minimize the sum of squares of the residuals, and it can solve the linear regression equation of the imprecisely observed data effectively. Based on the least squares estimation and uncertainty theory, we first proposed the slope mean model, which is to calculate the slopes of expected value and each given data, and the average value of these slopes as the slope of the linear regression equation, substituted into the expected value coordinates, and we can get the linear regression equation. Then, we proposed the deviation slope mean model, which is a very good model and the focus of this paper. The idea of the deviation slope mean model is to calculate the slopes of each given data deviating from the regression equation, and take the average value of these slopes as the slope of the regression equation. Substituted into the expected value coordinate, we can get the linear regression equation. The deviation slope mean model can also be extended to multiple linear regression equation, we transform the established equations into matrix equation and use inverse matrix to solve unknown parameters. Finally, we put forward the hybrid model, which is a simplified model based on the combination of the least squares estimation and deviation slope mean model. To illustrate the efficiency of the proposed models, we provide numerical examples and solve the linear regression equations of the imprecisely observed data and the precisely observed data respectively. Through analysis and comparison, the deviation slope mean model has the best fitting effect. Part of the discussion, we are explained and summarized.
Keywords
Introduction
Regression analysis is one of the main methods to process and predict data in probability theory and mathematical statistics. The absolute error and residual sum of squares are the main indexes to judge the fitting degree of regression equation. At present, the least square method is the best way to solve unknown parameters in linear regression equation because it can fully consider the given data and minimize the sum of squares of errors. As the further research of regression models by experts and scholars, many effective methods for solving regression equations have been obtained. For example, in order to solve linear and nonlinear regression equations with high accuracy, Tkachenko and Izonin [1–3] proposed a non-iterative method. However, some practical problems can only obtain imprecise information, or approximate range of information. At this time, the traditional regression analysis models are encountered difficulties. In order to find the relationship between imprecisely observed data, Liu [6] introduced the uncertainty theory into regression analysis.
Liu [4] founded uncertainty theory in 2007 and gradually improved [5–9]. Uncertainty theory can deal with the uncertainty problems well based on the normality axiom, duality axiom, subadditivity axiom and product axiom. With the deepening of the research of experts and scholars, the uncertainty theory has been further improved and applied to many fields in practice [17–21]. The study of uncertain statistics was started by Liu [6] in 2010, and Liu [6] solved the problem of how to construct the uncertainty distribution. In 2012, Chen and Ralescu [12] estimated the distance between Tianjin and Beijing, and the questionnaire method of uncertain statistics proved to be very effective. In order to estimate the unknown parameters in the uncertainty distribution, Liu [6] proposed the least squares principle in 2010. Many experts and scholars have systematically studied uncertain statistics. For example, Wang and Peng [13] proposed a method of moments for estimating unknown parameters. Guo, Wang and Gao [14] proposed an uncertain linear regression model in 2014, and Wang, Li and Guo [15] put forward a new Uncertain regression model in 2020. In 2018, Yao and Liu [10] proposed the least squares estimate to solve the unknown parameters of uncertain regression equation, and Song and Fu [11] put forward a least square method of uncertain multiple linear regression. Fang, Liu and Huang [23] presented uncertain Johnson-Schumacher growth model with imprecise observations and k-fold cross-validation test in 2019.
When the observed data is imprecisely, the uncertain regression analysis is a very effective tool to study the relationship between variables. The linear uncertain regression model is an important part of the uncertain regression analysis and a common model for dealing with imprecisely observed data. In the case of many variables, the solution of unknown parameters in linear regression equation is a difficult problem. The least squares estimation proposed by Yao and Liu [10] is the best method to solve the linear regression at present. The least squares estimate can fully consider the given data and minimize the residual sum of squares of the regression equation. The least squares estimation can also predict the data effectively and obtain the confidence interval. On the basis of previous studies [24], we proposed slope mean model, deviation slope mean model and hybrid model. These three models are interrelated and can be used separately. Slope mean model and hybrid model are only used to solve one-dimensional linear regression equation, but the deviation slope mean model is extended to multiple linear regression. The deviation slope mean model can not only solve the linear regression equation of imprecisely observed data, but also be used to solve the linear regression equation of precisely observed data, which is the focus of this paper. The main organization of this article is as follows: In Section 2, we mainly introduced the uncertainty theory and give formulas for solving the unknown parameters of the linear regression equation. In Section 3, we proposed three models for solving unknown parameters of uncertain linear regression equation, namely slope mean model, deviation slope mean model and hybrid model. We listed two examples and compared with the least squares estimation in Section 4. Numerical examples show that slope mean model, deviation slope mean model and hybrid model can solve linear regression equation effectively, and the deviation slope mean model is superior to the least squares estimation in absolute error. Finally, we summarize and point out the future research direction.
Uncertain regression model
In 2007, Liu [5] founded the uncertainty theory based the three axioms of Normal Axiom, Duality Axiom and Subadditivity Axiom. In 2009, Liu [7] perfected the uncertainty theory through the axiom of product. The uncertainty theory defines the uncertain variables and the uncertainty distribution, and the inverse uncertainty distribution is used to solve the expected value. Readers interested in uncertainty theory can read the Reference [7]. In this section, we introduce the uncertain regression equation and least squares estimation in detail, and give the solution formula of unknown parameters of one-dimensional linear regression equation by using expected value.
Assumed that (x1, x2, ⋯, x n ) are a vector of explanatory variables, and y be a response variable. If the functional relationship between (x1, x2, ⋯, x n ) and y can be expressed by
In particular, Liu [9] called
In the traditional regression model, the variable (x1, x2, ⋯, x p , y) always are assumed to be precise data. However, in many cases, the observed data are imprecise and have the characteristics of uncertain variables.
Now supposed that we have a set of imprecisely observed data,
Based on the imprecisely observed data (3), Yao-Liu [10] proposed the least squares estimation of unknown parameter
If the minimization solution is
Let the disturbance term ɛ is uncertain variable, its expected value and variance can be estimated as
and
When the imprecisely observed data are (
and
Denoted
and
In the uncertainty theory [9], the calculation formulas of the expected value of the uncertain variable are
and
According to Equations (12) and (13), the specific calculation form of Equations (10) and (11) are
and
After we get the estimated values
In this section, we always assumed that
The slope mean model
The idea of the slope mean model is first calculates the expected values of given imprecisely observed data, then calculates the average value of the slopes between each imprecise data and the expected value, and takes it as the slope of the regression equation. Finally, substitute into the expected value coordinates, and we can get the fitting regression equation.
The main steps of the slope mean model are as follows.
Step 1. We calculates the expected values of given imprecisely observed data
Step 2. We calculate the slope of the expected value and each imprecise data
When the denominator of the Equation (17) is 0, the fraction is meaningless, the data is discarded, and m is correspondingly reduced. We call such data singular data.
Step 3. We take the average value of Equation (17) as the estimated value of β1 and then calculate the estimated value of β0. The formulas are
and
Step 4. According to Equation (12), the Equations (18) and (19) are converted to
and
After we get the estimated values
The idea of the deviation slope mean model is to calculate the average value of the slopes of each imprecisely observed data deviating from the regression equation and as the slope of the regression equation. After generation into the mean value coordinates, we can get the regression equation. The deviation slope mean model can also be understood as the slope of the intercept coordinates and each imprecisely observed data above and below of the regression equation, subtracting the slope of the regression equation, the sum of which is zero.
We supposed that (
If
According to Equation (12), Equation (13) and the expected value theorem [22], we can take the expected values of both sides of the Equation (24), turn it into a real coefficient equation, and we get
Add the m equations in Equation (25), and we get
Equation (26) is transformed into
If we substituted Equation (27) into Equation (23), we get
The calculation results of
In the slope mean model, the denominator is sometimes 0, and the same imprecisely observed data as
The basic idea and calculation process of the deviation slope mean model of multiple linear regression are the same as that of one-dimensional linear regression. However, due to the number of independent variables, the calculation is quite troublesome, it often resort to software in practical applications.
We assumed that
We assumed that there is a linear functional relationship between uncertain variables
Supposed that the multiple linear regression equation of m equations and n + 1 unknown parameters satisfies the following equations,
We calculate the expected values of both sides of Equation (30), and then Equation (30) can be converted to
According to the matrix theory of linear algebra [16], the matrix form of the Equation (31) is expressed as
Denoted
Then Equation (32) can be expressed as
Multiply both sides of Equation (36) left by
Here,
The solution of Equation (38) are the estimated values of the unknown parameters β1, β2, ⋯ , β n .
Add up the m Equations of Equation (31) and we get
Equation (39) can be transformed into
Denoted
Then Equation (40) can be expressed as
Then we can substitute Equation (38) into Equation (43) and we get the solution of
This is the deviation slope mean model of multiple linear regression, which shows that the deviation slope mean model is also applicable to multiple linear regression. The calculated result of
The hybrid model first find the intercept by uses the least squares estimation, and then brings it into the deviation slope mean model, so as to solve the regression equation. In some cases, the residual sum of squares obtained by the hybrid model is better.
Supposed the intercept obtained by the least squares estimation is b, we substitute b into Equations (24) and (25), then we have
Add the m formulas in Equation (45), and we get
We can take the expected values of both sides of the Equation (46), turn it into a real coefficient equation, and we get
The calculated result of
In this part, two examples are used to verified the feasibility of the proposed model. One is an example of imprecisely observed data, the other is an example of precisely observed data, because precisely observed data can be considered a special case of imprecisely observed data.
Example of imprecisely observed data
In order to verified the feasibility of the models proposed in this paper, we provide an example of imprecisely observed data. Furthermore, we numerically analyzed the estimated expected values and variances of the disturbance terms by using the methods of References [25] and [26], and calculated the predicted values and confidence intervals of a new imprecisely observed data.
Assuming that (
Imprecisely observed data (Linear uncertainty distribution)
Imprecisely observed data (Linear uncertainty distribution)
The fitting linear regression equations obtained by least squares estimation, slope mean model, deviation slope mean model and hybrid model are shown in the Table 2.
The linear regression equations
As can be seen from Table 2, the fitting equations of the least square estimation, deviation slope mean model and the hybrid model are not significantly different, the fitting effect should be similar. However, the fitting equation of slope mean model is quite different from the other three models, and the fitting effect is relatively poor.
The estimated expected values
The estimated expected values of the disturbance terms
It can be seen from Table 3 that the estimated expected values of disturbance terms of the least squares estimation, slope mean model, deviation slope mean model and hybrid model are all 0.0000, which indicates that the fitting effect of four models is good.
The estimated variances
The estimated variances of the disturbance terms
It can be seen from Table 4, the estimated variance of the least squares estimation, deviation slope mean model and hybrid model is not significantly different, and the discrete effect is relatively close, with good estimation effect. The slope mean model has the largest variance, the largest dispersion and the worst estimated effect.
Now, we assumed that
The forecast values
It can be seen from Table 5, the forecast values of the least squares estimation, deviation slope mean model and hybrid model is not significantly different, and the discrete effect is relatively close, with good estimation effect.
If we take the confidence level α = 95%, and the interference term is subject to an uncertain normal distribution
The confidence intervals
It can be seen from Table 6 that the confidence intervals of the least squares estimation and the hybrid model are not significantly different, and the interval length is relatively close. The confidence interval length of deviation slope mean model is the smallest and the effect is better.
In general, the three models proposed by us can solve the regression equation of imprecisely observed data, and the least squares estimation, deviation slope mean model and hybrid model have similar fitting effect.
In this part, we listed an example of precisely observed data, and verified the feasibility of the models proposed in this paper. In terms of absolute error and residual sum of squares, the models have a good effect.
Assuming that (x i ,y i ), i = 1, 2, ⋯ , 10 are precisely observed data provided in Table 7.
Precisely observed data
Precisely observed data
The linear regression equations are obtained by using the least squares estimation and the three models proposed in this paper are shown in the Table 8.
The linear regression equations
From the point of view of the regression equation, the least square estimation, deviation slope mean model and the hybrid model equation are similar, and the fitting effect should be basically the same.
Table 9 shows that the slope mean model has the largest absolute error and the worst fitting effect. The least squares estimation is similar to the hybrid model in absolute error. The deviation slope mean model has the smallest absolute error and is superior to the other three models.
The absolute errors
Table 10 shows that the residual sum of squares of the least squares estimation, the deviation slope mean model and the hybrid model are the same.
The residual sum of squares
In general, the least squares estimation and the hybrid model have similar fitting effect. In terms of absolute error, the deviation slope mean model is the best.
Interpretation of linear regression models
Regression equation is for the general population, and is a summary and description of the general characteristics of the population. In practical studies, population distribution is often impossible to obtain, and population parameters can only be estimated through sample data. The slope mean model and hybrid model can solve the one-dimensional linear regression model and the deviation slope mean model can solve the multiple linear regression equation.
In the regression equation, we call the difference between the estimated value and the observed value residual. The smaller the residual sum of squares, the better the fitting effect of the model. The absolute error is the absolute value of the difference between the estimated value and the observed value in the regression equation, which can be used to judge how close they are. Therefore, absolute error is an important index to evaluate regression model.
Conclusion
Based on the uncertainty theory, we proposed the slope mean model, deviation slope mean model and hybrid model to solve the uncertain linear regression equation. Numerical examples show that the three models can solve the unknown parameters of the linear regression equation well, and the deviation slope mean model can do a better job in absolute error. However, when there are many variables, the model encounters difficulties. Therefore, we hope to use the computer to solve the unknown parameter solution to solve the multiple regression equation.
Footnotes
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 11701338 and in part by the Natural Science Foundation of Shandong Province under Grant ZR2014GL002.
