Analysis of debt-paying ability of real estate enterprises based on fuzzy mathematics and K-means algorithm

Abstract

Currently, there is a certain fluctuation in the real estate industry, so it is particularly important to analyze the solvency of real estate enterprises. In order to find a reliable model suitable for studying the difference in house prices, this study collects the research data through data collection, and uses the K-means clustering method to construct the corresponding model as a basic research in combination with the machine learning research method. At the same time, this paper compares the analysis effects of several common machine learning models and finds the advantages and disadvantages of these methods through mathematical statistics. In addition, combined with practice, this paper constructs a nonlinear generalized additive model, and based on machine learning technology, validates the validity of the model based on data analysis, the collected predictors. In view of the improvement of the solvency of real estate enterprises, diversified operation of real estate enterprises can maintain reasonable cash flow and make up for the defect of poor liquidity of real estate. Furthermore, this paper uses the stability method to find the optimal model. In addition, the generalized additive model effectively reveals the complex nonlinear relationship between continuous predictors and house prices. Through research, it can be seen that the nonlinear generalized additive model based on machine learning can play an important role in real estate industry forecasting and has certain theoretical reference significance for subsequent related research.

Keywords

Real estate generalized additive model machine learning K-means Algorithm

1 Introduction

With the development of China’s real estate market, real estate market transactions such as buying and selling, mortgages and other activities are more and more extensive, so that the real estate evaluation work that is necessarily involved in market transactions has received more general attention. At present, the real estate industry is faced with the problem of how to explore more scientific and effective evaluation methods. In the actual assessment, the three traditional evaluation method markets are often used in China, and the comparison method, cost method and income method rely heavily on the evaluator’s experience. At the same time, the evaluation of each suite requires a lot of time and effort from professional evaluators, and the cost of the application is relatively high. In a small number of studies using model valuation, domestic scholars have adopted multiple linear regression methods, and less introduced other emerging real estate assessment techniques and methods [1].

Real estate assessment can be divided into single assessment and batch assessment based on the number of assessment targets. The single assessment is an assessment of a single real estate, while the bulk assessment is an assessment of a range of real estate at the same time. At present, China’s real estate appraisal mainly uses traditional evaluation methods such as market comparison method, cost method and income method to conduct single evaluation. With the frequent activities of real estate transactions, evaluation problems such as real estate mortgage value assessment, house demolition price assessment, property tax base assessment, and real estate asset verification have gradually become the concerns of people [2]. If the traditional single evaluation method is still used, the evaluation needs cannot be met. Today, there is an urgent need for an efficient, scientific, objective, and accurate assessment method to evaluate real estate. A batch evaluation method that combines mathematical statistics, makes full use of computer technology, and can simultaneously evaluate multiple real estates with similar characteristics can effectively improve the traditional real estate evaluation methods [3].

In recent years, foreign scholars have begun to apply the method of machine learning to the field of real estate appraisal. When conducting real estate price assessment, they understand the real estate price from the perspective of the demand of home buyers and use this as a theoretical basis to evaluate the price of real estate [4] In the theoretical process of using feature prices, previous studies mostly used multiple linear regression to perform regression prediction. However, when using multiple linear regression, the simple linear function form used has a greater impact on the evaluation. On the one hand, it relies on artificial linear assumptions. On the other hand, the choice of the form of the function before building the model makes the construction of the model more dependent on artificial assumptions, which will cause larger errors [5].

This paper will apply the latest research results of machine learning to establish regression prediction of characteristic variables and prices and establish a real estate price evaluation model. At the same time, this paper hopes to establish a real estate valuation research model that relies less on human assumptions, simpler application and higher precision.

2 Related work

Some scholars in the world have studied the real estate appraisal earlier and introduced various new real estate appraisal techniques and methods such as GIS (Geographic Information System), neural network, support vector machine and random forest into real estate appraisal. In the prediction of real estate housing prices, scholars at home and abroad have carried out a series of extensive research on housing price forecasting [6]. Because foreign housing prices research started earlier, data collection is more convenient, and time lasts longer. Therefore, foreign research on housing prices began as early as in the 1990s. The study can be traced back to DiPasquale and Wheaton (1994), and he explored the dynamic mechanism of house prices in the United States in the 1980s, and for the first time used macroeconomic variables to estimate housing prices and found that macroeconomic variables can improve the accuracy of house price forecasts [7]. Brownet a1 used a time-varying parameter model to predict the quarterly housing prices in the United Kingdom from 1968 to 1992, and the prediction effect was improved compared with the traditional time series model [8]. Crawford and Fratantoni used the ARIMA, GARCH, and transformation matrix methods to predict the quarterly housing prices for the five states in the United States from 1979 to 2001. The study found that the transformation matrix is relatively good, which can allow variable parameters to change [9], but for the prediction outside the sample, the traditional ARIMA method works well, so it is difficult to find a model with overall performance leading. Hadavandieta1 used the panel data fixed-effects model to study the annual housing prices in 20 regions of Iran since 2001. The study found that it has improved the prediction effect than the ordinary OLS method [10]. Rapach and Strauss sorted the quarterly housing prices for the top 20 cities in the United States for 1995-2006 and compared the differences between AR models and other time series model forecasts. The results show that for cities with faster housing prices, no model prediction can meet the requirements [11]. In the book, Ghyselseta1 summarizes the recent housing price forecasting literature and summarizes the shortcomings and problems of traditional forecasting models [12]. It is worth mentioning that Bork and Moiler recently used the 1976-2012 quarterly data from 50 US states for analysis, using a dynamic approach. The research found that the prediction accuracy of this method is more than 30% higher than the traditional time series method. Bork and Moiler are trying to apply factor analysis to house price forecasts, in data for 122 cities in the United States. The study found that a simple three-factor model achieved better results in the first phase of the sample. At the same time, the prediction performance of the three-factor model is still strong in the prediction of three to twelve phases ahead. In the prediction outside the sample, the three-factor model performs better than the common models such as autoregression [13]. In Panagiotidis and Printzis’ prediction of Greek housing prices, they mainly use the VECM model to determine that the proportion of mortgages and commercial housing is the main macroeconomic variable that affects the growth of housing prices in Greece [14]. Christoueta1 attempts to predict quarterly housing prices in 10 countries of the OECD through economic policy uncertainty. Combining VAR, BMA models, and static and dynamic dependencies, EPU is considered to play an important role in predicting house prices in any model. It is also confirmed that the BMA model contains more prediction information and can beat the traditional autoregressive model in predicting performance [15].

There is currently less literature on housing price research in China. Lu Jinzhu first analyzed the trend of housing prices in Wuhan from 2004 to 2005. Combined with the analytic hierarchy process, the Kalman filtering method was used to improve the traditional OLS method. The post-test results showed that the prediction error was less than 10% [16]. Yan et al. used gray prediction model and wavelet neural network error correction and traditional time series model to predict the overall growth trend of the national house price from 2006 to 2007 and estimated the trend of the national house price in the next quarter. The study found that in the fourth quarter of 2006 and the first quarter of 2007, the national commercial housing sales price will increase by 6.88% and 6.64%, respectively [17]. Li Daying established a national monthly house price model based on rough sets and wavelet neural networks. The method is similar to that of Yan Wei and others and predicts monthly house prices in 2005-2009. The results show that the method is more accurate than the linear regression method [18]. Hu Liu xing and Wu Jie fei focused on forecasting the housing prices in Shanghai Pudong New Area and adopted the grey system theory as the model basis. It is found that the fitting precision and prediction accuracy of the GM (1,1) model are better than the traditional one-way regression model, and the GM (1,1) performs better when the number of samples is small [19]. Lian Xiaoli studied the national commodity housing price index and comprehensively used models such as exponential smoothing and ARIMA to comprehensively compare the merits of each method [20]. Shen Ruina used the principal component analysis method to predict the annual house price of Shanghai from 1998 to 2012 and found that this method is more accurate than the ARIMA method [21]. Hou Puguang and Qiao Zequn’s research focused on the housing prices in Taiyuan City, Shanxi Province from 2001 to 2012, and based on the mixture of wavelet theory and ARIMA method, the original data and modeling predictions were decomposed [22 –24].

Comprehensive domestic research, the current research target for housing prices is national or individual cities, lacking comprehensive research in most cities. In addition, the sample time interval is shorter, and the frequency is lower [25, 17]. Although there are a variety of models to modify and improve the explanatory variables, the model still uses the traditional OLS and other methods to predict the final house price, and the prediction model only has variable coefficient changes in the time interval, and the variable settings are fixed [28]. This obviously does not apply to the unified monthly forecast system for housing prices in large and medium-sized cities across the country [29].

3 Theoretical research

3.1 K-means clustering

The basic idea of the K-means algorithm is: First, k initial points are randomly determined as centroids, and then each point is found to be closest to its centroid and assigned to the cluster corresponding to the centroid. The average of all the points in the cluster is then taken as the centroid of each cluster, which is continuously updated until it converges.

The process is as follows:

First, the k cluster centroids randomly selected from n samples are recorded as: (u₁, u₂, ⋯ u_k), Then, the following process is repeated until the convergence reaches [23]:

For each sample i, the class to which it belongs is calculated:

$Y i = \underset{j}{argmin} L_{2} (X_{i} {, u}_{j})$ Among them: $L_{2} (X_{i} {, X}_{j}) = {(\sum_{l = 1}^{n} {| X_{i l} {- X}_{j l} |}^{2})}^{\frac{1}{2}}$

Each class j is recalculated for the centroid of the class:

$u j = \frac{\sum_{n = 1}^{n} I (Y_{i} = j) X_{i}}{\sum_{n = 1}^{n} I (Y_{i} = j)}$

Among them, I (·) is an indicative function, and the L₂ (·) function has the same meaning as the Euclidean distance in the KNN regression to be discussed later.

3.2 OLS linear model approach

The K-means clustering algorithm belongs to unsupervised learning in machine learning. The biggest difference between clustering and classification is that the target of the classification is known in advance, while the clustering is different, and the category is not predefined. The clustering algorithm can be applied to almost all objects. The more similar the objects in the cluster class are, the larger the gap between the cluster classes is, and the better the clustering effect will be. The number k of clusters is given by the user himself, and each cluster is described by its center of mass, that is, the center of all points in the cluster.

The standard linear model is assumed to be [24]:

$\begin{matrix} Y_{i} = X_{i}^{T} β + ɛ_{i}; ɛ_{i} \sim iid; E (ɛ_{i}) = 0; \\ var (ɛ_{i}) = σ^{2}; i = 1 \dots n \end{matrix}$ (1)

Among them, ɛ_i is the error term, X_i ∈ R^p+1 is the observed predictor, and β ∈ R^p+1 is the p + 1-dimensional coefficient vector.

The OLS model will be introduced in detail from the parameter estimation of the linear model, the model assumptions, the model selection and the meaning of the parameters involved in the modeling process.

First, the most common way to fit a linear model is OLS:

$\hat{β} = min_{β} {(Y - X β)}^{T} (Y - X β)$ (2)

The estimated value of the parameter vector β can be obtained from Equation (2). ${\hat{β}}^{ols} = {(X^{T} X)}^{- 1} X^{T} Y$ .

The least squares estimate has many excellent properties:

First,

$E ({\hat{β}}^{ols}) = β$

$cov ({\hat{β}}^{ols}) = σ^{2} {(X^{T} X)}^{- 1}$

Second, e = Y - Xβ is the error vector, and the least squares estimate ${\hat{β}}^{ols}$ is used instead of β, so that the residual vector $\hat{e} = Y - X {\hat{β}}^{ols}$ can be obtained. In the statistics, $RSS = {\hat{e}}^{T} \hat{e}$ is used to measure the size of σ². Here, RSS (Residual Sum of Squares) is the sum of squared residuals, and its size reflects the degree of deviation or degree of fit between the actual data and the theoretical model (1). Therefore, it has the following properties:

RSS = Y^T (I - X (X^TX) ^-1X^T) Y

${\hat{σ}}^{2} = \frac{RSS}{n - p}$ is an unbiased estimate of σ²

Third, for the linear regression model (1), assuming e ∼ N (0, σ²I), then

$\hat{β} \sim N (β, σ^{2} {(X^{T} X)}^{- 1})$

$\frac{RSS}{σ^{2}} \sim x^{2} (n - p)$

$\hat{β}$ and RSS are independent of each other.

When the parameter estimation is obtained, the next step is to use the goodness of fit of the quantitative model to make further judgments, that is, to explain to what extent the linear model of the predictor explains the response variable. Therefore, the decision coefficient is such a standard:

$R^{2} = 1 - \frac{RSS}{SST}$ (3)

Among them, $SST = \sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2}$ is the sum of squares,

$\bar{Y} = \sum_{i = 1}^{n} Y_{i} / n^{\circ}$

R² represents the percentage of error that can be interpreted by the model as a percentage of the total error. It is generally believed that the larger R² is, the better the fitting effect of the model will be. However, as the number of variables increases, R² will continue to increase and will not decrease, so that even if there are meaningless variables, it will increase. In order to solve this problem, the adjusted decision coefficient (adjusted R-square) is generally used:

$R_{d^{2}} = 1 - \frac{n - 1}{n - p - 1} (\frac{RSS}{SST})$ (4)

However, merely obtaining model fitting results is not sufficient because they do not reflect the statistical significance of the coefficient estimates for these predictors. If the coefficient of a predictor variable is negative by the OLS method and the value is small, there are two possibilities for this phenomenon. First, the predictor does have a negative effect on the response variable, but the effect is relatively small. Second, it is possible that the predictor has no effect on the response variable, and the obtained estimate is completely caused by the random error, so it is necessary to strictly verify the significance of the model and each variable.

The test of the significance of the model is to test whether at least one predictor has an important explanatory effect on the response variable. The F test can be used at this time. Its null hypothesis and opposite hypothesis are as follows:

H₀ : β_i = 0 ∀ i ; H₁ : There is at least one β_i ≠ 0

The test statistics for the F test are:

$F = \frac{(SST - RSS) / - p}{RSS / - (n - p - 1)}$ (5)

If the null hypothesis is correct, the statistic will obey an F distribution with a degree of freedom of (p, n - p - 1). Given a significant level α, if F > F_1-α (p, n - p - 1), the null hypothesis is rejected, and the opposite hypothesis is accepted; otherwise, vice versa.

However, the F test does not tell us which predictors do have an effect on the response variables. Therefore, we need to check the significance of the predictors one by one. The inspection process uses the T test. For a given predictor X_j, the null hypothesis and the opposite hypothesis are as follows: H₀ : β_j = 0 ; H₁ : β_j ≠ 0. The test statistics of the T test are:

$t = \frac{{\hat{β}}_{j}}{\sqrt{v_{j} (X) \overset{⌢}{σ}}}$ (6)

Among them, v_j (X) is the j-th diagonal element of (X^TX) ^-1. If the assumption is correct, the statistic will obey the T distribution with degree of freedom (n - p - 1), given a significant level α. If |t| > t_1-α/-2 (n - p - 1), then the null hypothesis is rejected, and the opposite hypothesis is accepted; otherwise, vice versa.

Second, when discussing the OLS linear model, we need to make some statistical assumptions about the model:

Independence: Response variables of different samples are independent of each other

Linear: linear relationship between response variables and predictors

Same variance: the variance of the response variable does not change with the change of the predictor

These models are the premise for the establishment of the OLS model. If these assumptions are not true, the results of the model will be greatly deviated, and the generalization ability of the model will be poor.

When the response variable Y is not very linearly related to the predictor in the OLS model, and the error does not obey the normal distribution, we can consider carrying out logarithmic transformation for the response variable Y:

$\tilde{Y} = 1 nY$ (7)

In turn, $\tilde{Y}$ is satisfied $\tilde{Y} \sim X β + ɛ, ɛ \sim N (0, σ^{2} I)$ . That is, 1nY is seen as a separate variable. The ordinary least squares estimate can then be used to obtain an estimate of the parameter. Unlike the linear regression model, the parametric coefficient of the log-linear model represents the percentage of the response variable caused by the variation of the predictor, that is, the estimated value of the coefficient should be interpreted as “growth rate”.

Finally, the Japanese statistician Akaike is used to select variables based on the AIC criterion proposed by the principle of maximum likelihood estimation, which is a criterion for selecting models:

$\begin{matrix} AIC = n {log (\frac{RSS}{n}) + 1 + log (2 π)} \\ + 2 \times (q + 1) \end{matrix}$ (8)

Among them, q is the number of predictors selected into the model. When the selected predictor increases, the square of the fit residual and the RSS in Equation (8) decrease. Since the natural logarithm is a monotonically increasing function, all of the entire first term is reduced. However, the second term increases as the predictor of the selected model increases. When the effect of the variance reduction caused by the increase in the predictor variable is greater than the penalty due to the increase in the predictor variable, the value of AIC will gradually decrease. However, when the number of predictors reaches a certain number, and the penalty caused by the increase of the predictor is greater than the variance of the increase of the predictor, the value of AIC will gradually increase. Therefore, the principle of using AIC to select variables is that the model that minimizes AIC is the “optimal” model.

3.3 Nonlinear model method for generalized additive models

The generalized additive model is suitable for dealing with the complex nonlinear relationship between response variables and many predictors. The generalized additive model assumes that functions are additive, allowing each covariate to be used as an unrestricted smoothing function rather than just as a parametric function. The model is built by using a smoothing function on some or all of the predictors. In general, the generalized additive model is assumed to be:

$g (E (Y | X_{1} . . ., X_{p})) = β_{0} + \sum_{j = 1}^{p} f_{i} (X_{j}), \forall {Ef}_{i} = 0$ (9)

Among them, g is the connection function, β₀ is the intercept term, and f_i is the unknown smooth function j = 1, . . . , p. Since the response variable Y is continuous, we can use the identity connection function g (u) = u to further obtain an additivity model:

$Y_{i} = β_{0} + \sum_{j = 1}^{p} f_{j} (X_{j}) + ɛ_{i}, i = 1, . . ., n$ (10)

The choices for f_j include smooth spline smoothing, B-spline, local polynomial regression, natural cube splines, and more. For convenience, the predictor X is normalized to the [0, 1] interval. Assuming that is twice different and is a derivative integrable, they can be obtained by minimizing the sum of squared residuals:

$\sum_{i = 1}^{n} {[Y_{i} - β_{0} - \sum_{j = 1}^{p} f_{j} (X_{ij})]}^{2} + \sum_{j = 1}^{p} {λ_{j} \int_{0}^{1} [f_{j}^{n} (t_{j})]}^{2} {dt}_{j}$ (11)

Among them, λ_j is the adjustment parameter, which can balance the goodness of fit of the data and the smoothness of the function to be estimated. The function estimated in R in this paper is obtained by fitting the generalized additive model to the MGCV package. The general GCV degree is given by the following formula by minimizing the definition of GCV:

$GCV = \frac{RSS}{\frac{1}{n} [n - tr (A)]}$ (12)

Among them, $RSS = \sum_{i = 1}^{n} {[Y_{i} - \sum_{j = 1}^{p} f_{j}^{⌢} (X_{ij})]}^{2} A$ is a smoothing matrix that satisfies $\overset{⌢}{Y} = AY$ . The model can be further extended to the following semi-parametric modes in practical applications:

$Y_{i} = \sum_{j = 1}^{p} f_{j} (X_{ij}) + \sum_{j = q + 1}^{p} β_{j} X_{ij} + ɛ_{i}$ (13)

The advantage of the generalized additive model is that it can solve the highly nonlinear and non-monotonic relationship between the response variable and the predictor, and it is a data-based model. That is, the relationship between the response variable and the predictor is determined by the data rather than subjectively. At the same time, the generalized additive model can be modeled differently for different types of predictors. For example, Equation (14) can be used to linearly fit some predictors, while other variables can be fitted through smooth functions. Therefore, the generalized additive model is sometimes called the semi-parametric model, and the generalized additive model is highly flexible.

4 Preliminary analysis of the data

The data source of this article is the second-hand housing data published by a real estate software. The data collection time is May 2018, with a total of 16210 units, involving 2956 communities and 173 areas. The data includes 8 predictive variables, 4 of which are discrete variables and 4 are continuous variables (where the distance from the city center is calculated from latitude and longitude). The response variable is the unit price per unit area. The specific variables are as follows:

Response variable: price per unit area, the lowest price of the 16210 suite source is 18,300 yuan / m2 per square meter, and the highest price is 149,800 yuan / m2 per square meter; Predictor: House area (AREA), the smallest area is only 30.06 m2, the largest area is 299 m2; number of bedrooms: at least one bedroom, up to 5 bedrooms; number of living rooms (halls): some houses have no living room and up to three living rooms; Floor level (floor), which is a nominal variable, a total of three levels, high, middle, and low. In the latter modeling, the high-level is used as the benchmark. The district is a nominal variable., which has six urban districts, namely Chaoyang District, Dong cheng District, Feng tai District, Hai dian District, Shi jing shan District and Xicheng District. When modeling later, the Chaoyang District was chosen as the benchmark. Subway (subway), that is, whether the property is close to the subway. If it is close, it will be a subway house, otherwise it will be a non-subway house. In the data processing, 1 represents the neighboring subway, and 0 represents the farther away from the subway. 82.78% of the collected listing data is a subway house. School district, that is, whether the property is around the school, if it is around the school, it is a school district, otherwise it is a non-school district. In the data processing, 1 represents the school district, 0 represents the non-school district, and 30.31% of the collected houses are school districts. Distance from the city center (distance, DS): The closest distance to the city center is 0.41km, and the farthest distance is 85.79km.

The data sample includes 16210 houses with ample sample size. The primary issue we consider is to classify the listing. This paper uses the simple and fast K-means algorithm in the clustering algorithm to implement. The basic principle of clustering is: the gap within the group should be as small as possible, and the gap between groups should be as large as possible. We choose the area, the distance from the city center, and the house price to cluster. Fig. 1 shows the intra-group dispersion squared map under different clusters:

Fig.1

Intra-group dispersion squared under different clusters.

As can be seen from the above figure, when the sample as a whole is treated as a large class, the sum of the squares of the group classes will reach the maximum. As the number of cluster categories increases, the sum of squared deviations within the group gradually decreases. When the number of categories is the same as the sample size, there is an extreme situation, that is, when each sample is used as a class, the sum of the squares of the group classes is 0. It can be seen from the figure that when the number of clusters is more than 5 times, the sum of the squares of the group class deviations is very slow, so the clustering can be selected as k = 5. Table 1 shows the results of the sample divided into five categories, of which the second category is the least, the fourth category is less than half of the remaining categories, and the third category is the most:

Table 1

Sample clustering

Category	Sample size
1	5711
2	893
3	5922
4	3311
5	5236

Table 2 shows the results of K-means clustering, which compares the average area of various houses, distance from the city center, and house prices. As can be seen from the table: The first type of housing is about 79 square meters, and the area is the smallest of the four categories, and it is closest to the city center, and its price is the most expensive. The second type of housing is relatively large, but it is the farthest from the city center and its price is also the cheapest. The third type of housing is similar to the first category, which is located in the middle of the city center, and the price of the house is a bit expensive. The fourth category has the largest housing area and is located in the middle of the city center, and its price is slightly more expensive than the third category. The fifth type of housing area is slightly larger than the third category, and its distance from the city center is farther than the third category, and the price of the house is not very expensive compared to other categories. Table 3 shows contribution rate of each predictor to the goodness of fit of the model. Table 4 shows the contribution rate of each predictor to the goodness of fit of the model.

Table 2

Mean contingency table of K-means clustering results and variables studied

Category	AREA	Distance	Price
1	79.00	6.54	10.03
2	105.82	29.90	4.40
3	83.99	9.38	6.05
4	193.35	10.95	6.94
5	95.77	17.09	4.90

Table 3

Contribution rate of each predictor to the goodness of fit of the model

	Proportion	Accumulated percentage
City Proper	53.58%	53.58%
School	21.16%	74.74%
DS	19.94%	94.67%
Subway	3.95%	98.62%
AREA	0.68%	99.30%
Halls	0.30%	99.60%
Bedroom	0.14%	99.74%
Low	0.13%	99.87%
Middle	0.13%	100.00%

Table 4

Contribution rate of each predictor to the goodness of fit of the model

	Proportion	Accumulated percentage
City Proper	54.72%	54.72%
DS	20.67%	75.39%
School	18.47%	93.86%
Subway	4.85%	98.71%
AREA	0.52%	99.22%
Halls	0.38%	99.60%
Low	0.15%	99.75%
Middle	0.14%	99.89%
Bedroom	0.11%	100.00%

According to the comprehensive analysis of the above four table clustering results, we can get the following information of the listing: a. The first category belongs to the lot type. Although its area is small, the price of the house is very high; The second category belongs to the suburban type. The house price is not very high, the area is quite large, but the distance is very far; c. The third and fifth categories belong to the popular type. The size of the house is not large, the price is moderate, and it is in the middle of the city center; d. The fourth category belongs to the large unit. The average house area is more than 170 square meters, the number of halls and the number of rooms are relatively large, the price is slightly expensive, and it is located in the middle of the city.

5 Empirical evidence of model data

The importance of the predictors is ranked, and Fig. 2 is a graphical representation of how much the predictor contributes to R². We know that each predictor explains the model up to 53.58%. It can be seen from the Fig. 2 that the urban area explains most of the R², which is the most important. Next, the school district explained 21.16% of R², followed by importance. The remaining order is the distance from the city center, which explains 19.94%, the subway, which explains 3.95%, the area, which explains 0.68% and so on.

Fig.2

Plato diagram of the contribution rate of each predictor to the goodness of fit of the model.

Independence description: In this study, there are bound to be many houses from the same community, which makes it difficult to guarantee the independence of house prices. After all, there is market competition, and the price of one house affects the price of another house more or less. This is objective, so we will not discuss independence.

Linear test: It can be seen from the residual fit map of Fig. 3 that the green line is obviously not a straight line, so the linear condition of the model does not match.

Fig.3

Linear test.

Homogeneity test of variance: As can be seen in Fig. 4, the points are not randomly distributed around the best fit green curve. This shows that there is a serious heteroscedasticity problem, and empirical logarithmic transformation can improve severe heteroscedasticity.

Fig.4

Homogeneity test of variance.

It can be seen that the linear model hypothesis test is not ideal, so we can logarithmically process the explanatory variables for further analysis.

After the logarithmic transformation of the response variables, a logarithmic model is established to see how the various predictors change the variance of the model and make a hypothetical diagnosis of the model. First, the model coefficients are explained first, and it can be seen from the table below that the model is still significant (F test passed). Only the T test of the number of rooms was significant at 0.05, and the T test of other predictors was at a significant level of 0.001. At the same time, its fitness has increased. It is necessary to explain when interpreting the model. The coefficient estimates of the log-linear model should be interpreted as “growth rate”. That is, a predictor variable changes by one unit, and the response variable changes percentage. Therefore, in the case of controlling other variables to be unchanged, the following conclusions are made:

(1) For the forecasting variable of urban area, Fengtai District has the lowest unit price per unit area, and the highest unit price per square meter in Xicheng District is 29% higher than that per square meter in Chaoyang District; (2) For the forecast variable of the floor, the highest unit price per unit area is the lowest, the floor price per unit area is the highest, and the average price is 3.4% higher than the upper level; (3) For the forecast variable of the subway, the subway house is 9.7% higher than the unit price of the non-commercial house; (4) For the predictive variable of the school district, the school district housing is 17% more expensive than the unit area of the non-school district; (5) For the predicted variable of the number of rooms, the price per unit area will increase by 0.68% for each additional room; (6) For the forecast variable of the number of living rooms, the price per unit area will increase by 8.2% for each living room; (7) The increase in the area of the house will result in a decrease in the price per unit area; (8) For every kilometer to go to the city center, it will cost 1.5% more per square meter.

Second, the degree to which each predictor explains the variance of the model is examined. We know that the degree to which each predictor explains the model is 64.8%. It can be seen from the Fig. 5 that the urban area explains most of the R², which is the most important. Next, the distance from downtown explained 20.67% of R², followed by importance. The remaining order is the School district, which explains 18.47%, the subway, which explains 3.95%, the area, which explains 0.68% and so on.

Fig.5

Plato diagram of the contribution rate of each predictor to the goodness of fit of the model.

Finally, a hypothetical diagnosis of the logarithmic model was made. Linearity: From the residual fit plot of Fig. 6, it can be seen that the green line is a straight line, which confirms the linear hypothesis, and the log-linear model should be appropriate for the data set.

Fig.6

Linear test.

Variance homogeneity: The test is still significantly p = 0.0007, but it is much higher than the linear model. It can be seen from Fig. 7 that the points are not randomly distributed around the best fit green curve. This shows that there is still a heteroscedasticity problem, but the heteroscedasticity is alleviated, and the power transformation given by the code is 0.95, close to 1. Therefore, the logarithmic model has been ideal.

Fig.7

Homogeneity test of variance.

Research shows that the generalized additive model works better than the linear model. Under full data, the formula for the generalized additive model function in the MGCV package is:

$Y = β_{0} + \sum_{j = 1}^{9} β_{j} X_{j} + \sum_{j = 10}^{13} f_{i} (X_{j}) + ɛ$ (14)

The T test for the first nine predictors of the generalized additive model was highly significant at the significant level of 0.001, and the F test for the last four consecutive predictors was also highly significant at the significant level of 0.001 The model has a GCV of 1.7451 and the degree of fit is higher than that of the linear and logarithmic models (R² = 20.56). Therefore, the expected house price obtained from the generalized additive model is closer to the realistic actual house price than the OLS model.

6 Improving the solvency of real estate enterprises

Through the prediction error results, it can be concluded that the logarithmic model has less prediction error than the non-logarithmic model. In any model, there is a significant interaction between certain variables. Both the KNN regression model and the generalized additive model are better than the OLS model, indicating that the nonlinear model is more in line with the study of house price data.

From the goodness of fit of the model, the generalized additive model is more representative of the real situation of the change of house price with the predictor than the OLS linear model. The relationship between most continuous variables and house prices is not linear. Based on the comprehensive results and prediction errors and full model interpretation, the logarithmic generalized additive model is the best in the model.

At present, real estate enterprises are facing great financial pressure, so real estate enterprises should take appropriate measures to broaden financing channels and alleviate capital shortage. Firstly, real estate enterprises should increase equity capital and reduce the ratio of assets and liabilities, which will help to enhance the ability of enterprises to withstand various risks and ensure the sustainable development of real estate enterprises. Secondly, they should strengthen cooperation with construction contractors. Finally, real estate enterprises should make full use of investment trust as a financing method.

In order to adjusting the debt structure of real estate enterprises, reducing the proportion of short-term loans and advance receipts, and increasing the proportion of long-term loans, because long-term liabilities are relatively stable and will be repaid only after several accounting years in the future, so the pressure of short-term debt repayment is not great. Therefore, real estate enterprises should strengthen long-term debt financing, and the funds raised through long-term liabilities can be used to increase fixed assets. Expanding the scale of operation is conducive to the healthy and sustainable development of real estate enterprises, which need long-term capital. Diversified operation of real estate enterprises can maintain reasonable cash flow, make up for the shortcomings of poor liquidity of real estate and help to reduce the ratio of assets and liabilities. Relative diversification of business, such as property rent income, property sales income, etc. Income, hotel operating income and other business income are helpful to avoid the financial risks brought by the government’s restraint on the excessive growth of commodity house prices.

7 Conclusion

In this paper, the generalized additive model of non-parametric methods is used to study real estate data from a non-linear perspective. This paper analyzes the relevant parameter variables on the basis of data collection, and the data source of this paper is the second-hand housing data published by a real estate software. The data includes 8 predictors. Among them, 4 variables are discrete variables, and the other 4 variables are continuous variables (Among them, the distance from the city center is calculated from latitude and longitude). The response variable is the unit price per unit area. Through preliminary analysis, it is found that the linear model hypothesis test is not very ideal. Therefore, the explanatory variables are logarithmically processed for further analysis. After the logarithmic transformation of the response variables, a logarithmic model is established to see how the various predictors change the variance of the model, and the model is taken on a hypothetical diagnosis. By predicting the error results, it can be concluded that the logarithmic model has less prediction error than the non-logarithmic model, and there is a significant interaction between some variables under any one model. From the perspective of the goodness of fit of the model, the generalized additive model is more able to reflect the real situation of the change of house price with the forecast variable than the OLS linear model, and the relationship between most continuous variables and house prices is not linear. From the prediction error and the full model interpretation and results, the logarithmic generalized additive model is the best in the model.

Footnotes

Acknowledgments

This paper was supported by Key Projects for Outstanding Young and Middle-aged Key Talents of Universities and Colleges Visiting at Home and Abroad in 2016. (No. gxfxZD2016131).

References

Patel

, Shah

and Thakkar

, et al., Predicting stock and stock price index movement using Trend Deterministic Data Preparation and machine learning techniques, Expert Systems with Applications An International Journal 42(1) (2015), 259–268.

Park

and Bae

J.K.

, Using machine learning algorithms for housing price prediction: The case of Fairfax County, Virginia housing data, Expert Systems with Applications 42(6) (2015), 2928–2934.

Leon

M.C.

and Jukka

, Searching for big data: How incumbents explore a possible adoption of big data technologies, Scandinavian Journal of Management 34(2) (2018), 129–140.

Samiya

, Xiufeng

, Kashish

A.S.

and Mansaf

, A survey on scholarly data: From big data perspective. Information Processing & Management, 53(4) (2017), 923–944.

Xiao

, Dong

and Xu

, et al., Rational and self-adaptive evolutionary extreme learning machine for electricity price forecast, Memetic Computing 8(3) (2016), 223–233.

, Dai

and Tang

A novel decomposition ensemble model with extended extreme learning machine for crude oil price forecasting, Engineering Applications of Artificial Intelligence 47 (2016), 110–121.

Chen

, Liu

J.H.

and Wang

, et al., Design and Implement of Operational Rule Base Based on Machine Learning and Association Rule Mining, Applied Mechanics and Materials 734 (2015), 6.

Chunming

, Xi

, Zhikang

and Fei

, Big data issues in smart grid-A review, Renewable and Sustainable Energy Reviews 79(2) (2017), 1099–1107.

Subbu

Kalyan P.

and Athanasios

, Big Data for Context Aware Computing-Perspectives and Challenges, Big Data Research 10(7) (2017), 33–43.

10.

Johnson

, Price

and Khalifa

, et al., A method to combine target volume data from 3D and 4D planned thoracic radiotherapy patient cohorts for machine learning applications, Radiotherapy & Oncology Journal of the European Society 126 (2018), 355–360.

11.

Pyo

, Lee

and Cha

, et al., Predictability of machine learning techniques to forecast the trends of market index prices: Hypothesis testing for the Korean stock markets, Plos One 12(11) (2017), e0188107.

12.

Souillard-Mandar

, Davis

and Rudin

, et al., Learning classification models of cognitive conditions from subtle behaviors in the digital Clock Drawing Test, Machine Learning 102(3) (2016), 393–441.

13.

Kitsikoudis

, Sidiropoulos

and Hrissanthou

Machine Learning Utilization for Bed Load Transport in Gravel-Bed Rivers, Water Resources Management 28(11) (2014), 3727–3743.

14.

Gerlein

E.A.

, Mcginnity

and Belatreche

, et al., Evaluating machine learning classification for financial trading: an empirical approach, Expert Systems with Applications (2016), S0957417416000282.

15.

Chandwani

and Saluja

M.S.

, Stock Direction Forecasting Techniques: An Empirical Study Combining Machine Learning System with Market Indicators in the Indian Context, International Journal of Computer Applications 92(11) (2014), 8–17.

16.

Zhu

, Zhang

and Huang

, A sparse embedding and least variance encoding approach to hashing, IEEE Trans. Image Process 23(9) (2014), 3737–3750.

17.

, Zhu

, Cheng

and He

, Graph self-representation method for unsupervised feature selection, Neurocomputing 220(7) (2017), 130–137.

18.

Aytac

and Guran

M.C.

, The relationship between electricity consumption, electricity price and economic growth in Turkey: 1984–2007, Argum Oeconomica 27 (2) (2011), 101–123.

19.

Zhu

, Li

and Zhang

, Block-row sparse multiview multilabel learning for image classification, IEEE Trans. Cybern 46(2) (2016), 450–461.

20.

Chang

R.M.

, Kauffman

R.J.

and Kwon

, Understanding the paradigm shift to computational social science in the presence of big data, Decision Support Systems 63(14) (2014), 67–80.

21.

Shi

J.T.

, Liu

H.L.

and Xu

, et al., Chinese Sentiment Classifier Machine Learning Based on Optimized Information Gain Feature Selection, Advanced Materials Research 988 (2014), 511–516.

22.

Xiong

H.Y.

and Zhao

. An Image Retrieval Method Based on Machine Learning and SVM, Applied Mechanics and Materials 631–632 (2014), 4.

23.

Esteva

, Kuprel

, Novoa

, Ko

and Swetter

, Dermatologist-level classification of skin cancer with deep neural networks, Nature 542 (2017), 115–118.

24.

Fan

and Xiao

, A framework for knowledge discovery in massive building automation data and its application in building diagnostics, Automation in Construction 50 (2015), 81–90.

25.

Liang

, Hong

and Shen

, Occupancy data analytics and prediction: A case study, Build Environ 102(2) (2016), 179–192.

26.

Bai

J.R.

, Mu

S.G.

and Zou

G.Z.

, The Application of Machine Learning to Study Malware Evolution, Applied Mechanics and Materials 530–531 (2014), 875–878.

27.

Hierons

, Machine learning. Mitchell

Tom M.

. Published by McGraw-Hill, Maidenhead, U.K. International Student Edition, 1997, Software Testing Verification & Reliability 9(3) (2015), 191–193.

28.

Azamathulla

H.M.

, Ghani

A.A.

and Chang

C.K.

, et al., Machine Learning Approach to Predict Sediment Load – A Case Study, CLEAN – Soil Air Water 38(10) (2010), 969–976.

29.

Gao

and Lei

, A new approach for crude oil price prediction based on stream learning, Geoscience Frontiers 8(1) (2017), 183–187.