Abstract
Distributional structured additive regression provides a flexible framework for modelling each parameter of a potentially complex response distribution in dependence of covariates. Structured additive predictors allow for an additive decomposition of covariate effects with non-linear effects and time trends, unit- or cluster-specific heterogeneity, spatial heterogeneity and complex interactions between covariates of different type. Within this framework, we present a simultaneous estimation approach for multiplicative random effects that allow for cluster-specific heterogeneity with respect to the scaling of a covariate′s effect. More specifically, a possibly non-linear function f(z) of a covariate z may be scaled by a multiplicative and possibly spatially correlated cluster-specific random effect (1+αc). Inference is fully Bayesian and is based on highly efficient Markov Chain Monte Carlo (MCMC) algorithms.
We investigate the statistical properties of our approach within extensive simulation experiments for different response distributions. Furthermore, we apply the methodology to German real estate data where we identify significant district-specific scaling factors. According to the deviance information criterion, the models incorporating these factors perform significantly better than standard models without (spatially correlated) random scaling factors.
Keywords
Introduction
Classical regression models, such as generalized linear models (GLMs, see McCullagh and Nelder, 1989), generalized additive models (GAMs, see Hastie and Tibshirani, 1990; Wood, 2006) or structured additive regression models (STAR models, see Brezger and Lang, 2006, or Fahrmeir et al., 2013), assume that the distribution of a response variable y belongs to an exponential family and relate its mean to a number of covariates. A potential dependence of other moments of the response distribution, however, is neglected.
Generalized additive models for location, scale and shape (GAMLSS; see Rigby and Stasinopoulos, 2005 and Stasinopoulos et al., 2017) and its Bayesian version of distributional regression (see Klein et al., 2015) provide a more flexible framework. On the one hand, it is no longer restricted to the exponential family and on the other hand, it allows for modelling each parameter of the response distribution—and not only the mean—in dependence of a set of covariates. An alternative to distributional regression is quantile regression. A comparison between the two approaches with particular focus on real estate data is given in Razen et al. (2017). Here, distributional regression turned out to be favourable compared to quantile regression.
In many applications, the data consists of a number of different clusters. In general, it is not guaranteed that a covariate′s effect on a parameter of the response distribution—be it its mean or another parameter—is homogeneous across these clusters. In real estate data, for example, a frequently observed phenomenon is that the price effects of covariates vary from one spatial unit to another. However, completely different functional forms in each cluster are not common.
In order to deal with this challenge, Wechselberger et al. (2008) suggest the use of cluster-specific random scaling factors. In doing so, one still assumes homogeneity for the functional form of the response function but allows for heterogeneity with respect to its scaling. Lang et al. (2015) and Weber et al. (2015) have successfully applied this approach to store sales models, considerably improving the predictive validity of the models. In a real estate context, Brunauer et al. (2010) introduced spatial scaling factors to account for district-specific heterogeneity in rent prices.
Nevertheless, the present method has several drawbacks. First, the analyses so far are restricted to modelling the mean in Gaussian response distributions, ignoring both alternative response distributions and covariate effects on other moments. Second, due to identifiability reasons, one currently has to assume monotonicity for the response functions of the covariates, which in many applications is not justified a priori. Third, the present method only works for independent scaling factors, ignoring feasible correlations that particularly can be observed in applications with spatial clustering.
The aim of this article is to provide a general framework that allows the use of random scaling factors for arbitrary response functions in applications that go beyond the modelling of the response distribution′s mean. For this purpose, we extend the idea of random scaling factors to distributional regression models and develop identifiability constraints that no longer restrict the response functions to be monotone. Furthermore, we provide an alternative estimation technique for spatially correlated scaling factors based on Markov random fields, see, for example, Besag et al. (1991), Fahrmeir and Lang (2001) or Rue and Held (2005).
We investigate the properties of our approach in comprehensive simulation scenarios for different response distributions. We then apply the method to a German real estate dataset with almost 100 000 observations. We consider and compare different distributional regression models where house-specific attributes flexibly are estimated using P-splines that at the same time are scaled according to correlated, district-specific random scaling factors. The results are compared to standard models without scaling factors.
The remainder of the article is structured as follows: In Section 2, we present the methodology that subsequently is tested in different simulation scenarios in Section 3. Section 4 attends to the real estate data and the model specification before we present the results in Section 5. We conclude in Section 6 with an outlook on future research perspectives.
Methodology
Distributional regression models
Suppose we are given data on n observations in the form
where the functions
Using known basis functions
where
Defining the
In a Bayesian framework, overfitting of a particular function
where
We apply, for example, a Bayesian version of P-splines while modelling a smooth function
The amount of smoothness is governed by the variance parameter
As outlined in the introduction, in many applications, the data is clustered. Real estate data, for example, typically is clustered in spatial units (e.g., districts, states, etc.). Usually, there is no economic reason to assume homogeneous covariate effects across these units. In contrast, different consumer price sensitivities originating from varying levels of income, diverse value of land or different ways of construction suggest spatial heterogeneity in price response. Indeed, it is reasonable to assume the effects to have the same functional form but to vary with respect to the scaling of the function. Thus, in order to account for this kind of heterogeneity, we allow for cluster-specific random scaling factors for some or all of the non-linear functions
where
In the most basic case of independent scaling factors, the random effects
For the variance parameters
A priori, the parameters are not identifiable since there is an arbitrary multiplicative constant for the functions
Typically, the constant
Instead of making assumptions about the coefficients
This assumption is preferable for at least two reasons: First, we no longer need monotonicity constraints for the response functions
Then, the prior for the random effects is of the form (2.2) too.
Spatially correlated scaling factors
If we assume the scaling factors to be spatially correlated, we instead propose the use of a Markov random field prior for the random effects
where
with the precision matrix
For the variance parameter
Then, the prior for the correlated random effects again is a smoothing prior of the form (2.2).
For the sake of illustration, we first consider a Gaussian model with a single predictor for the mean parameter of the form (2.3).
The description of posterior inference is facilitated by rewriting the model equation in matrix notation. We obtain
where
Each of the
An alternative formulation in terms of the scaling parameter vectors
where
The varying coefficient representation (2.6) suggests the following (heuristic) two-stage estimation procedure:
Assume the covariate effects to be homogeneous over all clusters, that is, set the random effects Treat the estimated functions
Conceptually, this two-stage procedure works fine. However, simulations show that ignoring the random effects in stage
The full conditionals of the regression parameters
where
The full conditionals of the scaling factors are derived from (2.6) with
Here, the penalty matrix
In contrast to ‘usual’ varying coefficients terms, the diagonal matrices
Finally, the full conditionals of the variance parameters are inverse-gamma and are given by
We finally arrive at the following:
Update an unconstrained version of Update Update the variance parameters
For most updates of regression parameters
The basic idea of IWLS proposals is to determine an approximation of the full conditional that leads to a Gaussian proposal density with expectation and covariance matrix corresponding to the mode and the curvature at the mode at the current state of the Markov chain.
The resulting Metropolis–Hastings sampler has the advantage that the IWLS proposal automatically adapts to the form of the full conditional and thereby avoids manual tuning. Let
Taylor expansion leads to a Newton′s method type approximation and we can deduce a Gaussian working model, see Klein et al. (2015) for more details. We finally obtain the Gaussian proposal densities
where
Finally, we replace the Gibbs steps of the previous algorithm for updating a particular function
Simulation design
An extensive simulation study for four distributions (Binomial, Gaussian, gamma and Student′s-t) has been carried out. To keep the article in reasonable length, we present exemplary results for models with gamma responses.
The parameters
via the exponential link function. Here,
We additionally scale the random effects
Model specifications
Model specifications
For each model, we generate
Figures 1 and 2 show the average estimates of the effects
For both parameters, the scaled effects almost perfectly can be estimated in the correlated settings even with only
The average estimates of the functions
as well as of the smallest and largest district-specific effects
(solid) and the respective true effects (dashed) for the gamma models
,
,
and
. The first column shows the effects in the uncorrelated setting and the second column those in the correlated setting
The average estimates of the functions
as well as of the smallest and largest district-specific effects
(solid) and the respective true effects (dashed) for the gamma models
,
,
and
. The first column shows the effects in the uncorrelated setting and the second column those in the correlated setting
The average estimates of the functions
as well as of the smallest and largest district-specific effects
(solid) and the respective true effects (dashed) for the gamma models
,
,
and
. The first column shows the effects in the uncorrelated setting and the second column those in the correlated setting
For illustration, Figure 3 shows the geographical maps of the estimated random effects
Maps of the random effects
of the gamma model 2 for the uncorrelated (first row) and the correlated setting (second row). The first column shows the estimated effects, the second column shows the true effects and the third column shows the biases
In order to evaluate the performance of our simultaneous estimation approach, we refer to mean squared errors (MSEs). In each replication of our models, we calculate the MSE for the
We then reestimate all replications using the two-stage estimation procedure sketched in Section 2.3 and again calculate the corresponding MSEs.
For illustration, Figure 4 shows boxplots of the MSEs of the 250 replications in the Gamma models 1–3 in the correlated setting. As we can see, the simultaneous estimation approach (1) yields lower MSEs than the two-stage procedure (2) for all these models. Especially for the
MSEs for the
- and
-parameter of the gamma models in the correlated setting with
from the simultaneous estimation approach (‘1’) and the two-stage procedure (‘2’)
With respect to the variance of the random scaling factors, we find that the performance of the simultaneous estimation approach compared to the two-stage procedure substantially gets better the higher the variance is (results not shown).
The same findings hold for the gamma models in the uncorrelated seeting. The main reason is that ignoring the scaling factors in the first step of the two-stage procedure may lead to peculiar results for the average effect, in particular if the scaling factors have a large variance. Then, the scaled effects estimated in the second step obviously are biased as well.
We apply our methodology to a dataset of almost
Since the raw data initially was supply data, prices obviously were upward biased. Thus, F+B adjusted these prices by a transaction discount estimated from a regression model in order to provide realistic purchase prices. Dividing these adjusted prices by the floor area of the houses leads to the prices per square meter (
Continuous covariates: The floor area of the building (
Categorical covariates: The equipment of the house (
The standard model for explaining house prices is a log-Gaussian model with location parameter
In both models, the predictors
Since we assume the random scaling factors to be spatially correlated, we apply the respective estimation procedure based on Markov random field priors described in Section 2.
The estimation results for the models described in the previous section are based on a final MCMC run with
When analysing the results, one has to be aware that the parameters of the distributions do not exactly correspond to each other. For example, the mean of the gamma model immediately is given by the respective
Effect estimates
For the sake of illustration, we restrict the presentation of the results to the covariates being multiplied by random scaling factors, that is, the floor area, the plot area and the year of construction.
Mean effects
As expected, the mean effect of the floor area on house prices per square meter, averaged over all districts, is monotonically decreasing in both models (see first row of Figure 5, Panel [a]). Here, in order to get an impression of the magnitude of the effects and to make the results comparable, the other continuous covariates are held constant at mean level of attributes and the categorical variables are held at their mode level (which we will call the mean level). As we can see, the results of the two models almost coincide.
With respect to the scaling, we find that the effect of the floor area considerably differs between the districts. Panels (b) and (c) in the first row of Figure 5 show the scaled effects of the different districts for the two models together with the respective average effect. In the gamma model, for example, the effect of the floor area accounts for a variation between
Regarding the geographic distribution, the scaling factors of the mean parameter tend to be higher in Western Germany, while they are considerably lower in Eastern Germany (Panel [a] of Figure 6). In order to verify the significance of these differences, we refer to the posterior probabilities based on a nominal level of
For the plot area, the average mean effect over all districts is monotonically increasing up to an area of
According to these scaling factors, the magnitude of the mean effect of the plot area is much more pronounced in the Southwestern states than in the remaining parts of Germany (Panel [b] of Figure 6). This coincides with the regions that have a higher population density and where land therefore is a scarcer resource. The deviation from the average effect is significant in almost half of the districts, see Panel (e) of Figure 6.
In general, the results for the year of construction show an increasing effect on the mean of the house prices per square meter (third row of Figure 5, Panel[a]). However, there are three periods where the effect is negative: During and shortly after the First World War, construction naturally was cheaper. The same holds for the time of the Second world war. The third period showing a negative effect is the time after the latest financial crisis when the real estate market in Germany was in trouble. Indeed, a recovery started in recent years, but this is not yet covered by our data ending in
With the exception of the surrounding of Berlin, the mean effect of the year of construction is more pronounced in Eastern Germany than in most parts of Western Germany, see Panel (c) of Figure 6. The main reason is that the construction industry in Eastern Germany was much more affected by the reunification of
Mean effect of the floor area, plot area and year of construction evaluated at the mean level. [a]: Average effect over all districts for the two different models. [b], [c]: Scaled effects of the individual districts for the two models together with the respective average effect
Mean effect of the floor area, plot area and year of construction evaluated at the mean level. [a]: Average effect over all districts for the two different models. [b], [c]: Scaled effects of the individual districts for the two models together with the respective average effect
Random scaling factors and
% posterior probabilities of the floor area
(Panels [a] and [d]), plot area (Panels [b] and [e]) and year of construction (Panels [c] and [f]) for the mean parameter in the gamma model
For the shape parameter
For the shape parameter of the gamma model, the average marginal effect of the plot area is slightly decreasing, see Panel (b) of Figure 7. Again, there are considerable differences in the scaling of the effect over the districts with the corresponding scaling factors ranging from
Except for the last few years, the effect of the year of construction for the shape parameter in the gamma model looks similar to the one for the mean parameter (Panel [c] of Figure 7). The corresponding scaling factors range from
The effect is more pronounced in Central and Eastern Germany than in the Western parts, see Panels (c) and (e) of Figure 8, with significant differences from the average effect in about half of the districts.
Average marginal effect of the floor area, plot area and year of construction for the shape parameter in the gamma model together with simultaneous
% confidence bands
Average marginal effect of the floor area, plot area and year of construction for the shape parameter in the gamma model together with simultaneous
% confidence bands
Random scaling factors and corresponding posterior probabilities of the floor area (Panels [a] and [d]), plot area (Panels [b] and [e]) and year of construction (Panels [c] and [f]) for the shape parameter in the gamma model
In order to evaluate the performance of the models, we refer to the well known deviance information criterion (DIC) of Spiegelhalter et al. (2002), see also Klein et al. (2015), who have shown that the DIC also can be used to discriminate between different types of response distributions in distributional regression. As we can see from Table 2, the DIC of the models with scaling factors are by far lower than the DIC of models without scaling factors, and the gamma distribution clearly outperforms the log-Gaussian model. The findings based on the DIC are further confirmed by proper scoring rules as proposed originally by Gneiting and Raftery (2007) and transferred to distributional regression in Klein et al. (2015). We used the logarithmic score, quadratic score, spherical score and the continuous ranked probability score which again clearly favour the gamma model (results not shown).
DIC of the standard models and the extended models with random scaling factors
DIC of the standard models and the extended models with random scaling factors
This article presents a simultaneous estimation approach for Bayesian distributional regression models with either independent or spatially correlated random scaling factors and provides a comprehensive simulation study showing the accuracy of this approach. A comparison to an ordinary two-stage estimation procedure reveals considerable improvement particularly for models where the variance of the random scaling factors is large.
We apply our methodology to real estate data from Germany and identify district-specific random scaling factors that are significant in up to three-fourth of the districts. The results confirm expected spatial heterogeneity in the covariates’ effects and provide new insights into the valuation of house prices. Furthermore, allowing for district-specific random scaling factors significantly improves the performance of the models with respect to their DIC.
Several directions for further research are conceivable. First, we are already working on providing other spatial smoothing techniques (e.g. Kriging or tensor product splines) for estimating correlated scaling factors. Second, we aim to implement structured additive predictors for the scaling factors themselves, which would allow us to explain them, for example, by cluster-specific covariates. Third, in the application, we have obtained negative scaling factors for some districts. Notwithstanding that they have occurred only occasionally, negative scaling factors seem rather unlikely in most applications from an economic perspective. Thus, we intend to provide alternative scaling factors that are positive by default, for example by defining them by
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
This work was supported by funds of the Oesterreichische Nationalbank (Oesterreichische Nationalbank, Anniversary Fund, project number: 15309).
