Abstract
Residential choice behaviour is a complex process underpinned by both housing market restrictions and individual preferences, which are partly conscious and partly tacit knowledge. Due to several limitations, common survey methods cannot sufficiently tap into such tacit knowledge. Thus, this paper introduces an advanced knowledge elicitation process called SilverKnETs and combines it with data mining using random forests to elicit and operationalize this type of knowledge. For the application case of the city of Leipzig, Germany, our findings indicate that rent, location and type of housing form the three predictors strongly influencing the decision making in residential choices. Other explanatory variables appear to have a much lower influence. Random forests have proven to be a promising tool for the prediction of residential choices, although the design and scope of the study govern the explanatory power of these models.
Introduction
Managing complex human–environmental interactions requires understanding and knowledge about the dynamics of these systems. To do so, increasingly huge amounts of data are used, creating numerous opportunities and challenges for analysis (Einav and Levin, 2014). Data mining is becoming a focus point of research to cope with such large data sets. Data mining can be understood as: (i) a process of data-driven identification of novel patterns in data, i.e., knowledge discovery; and (ii) the elicitation of predictive models, i.e., knowledge formalization (Rokach and Maimon, 2015; Witten et al., 2011).
The use of big data has considerable potential for studying human-environmental interactions. However, analysing human behaviour with big data – according to Bharwani et al. (2015) a challenging task in complex, high- and multi-dimensional settings – has the drawback that a multitude of factors could potentially influence an actor’s decision, whilst information on the drivers underlying this decision are potentially not included in the data. Thus, it is hard to identify the reasoning behind a decision, e.g., in terms of individual preferences, choices and their underpinnings. These underpinnings are commonly referred to as tacit knowledge, which is unconscious, abstract, tied to personal experiences and often unvoiced (Raymond et al., 2010). Being implicit to individuals, tacit knowledge requires formalization to be put into operation.
Experimental settings have been used in the past in various fields such as psychology and economics to control for attributes and confounding factors. Methods such as choice experiments or conjoint analysis have also gained popularity in the study of human–environmental systems (Bartkowski et al., 2015). They do not explicitly ask for reasons underlying a decision, but rather infer them from choices made in hypothetical situations. Choice experiments thus derive stated preferences (SP) by asking participants to choose their preferred option from a given set of alternatives that are described with certain attributes (Gustafsson et al., 2001). This is in stark contrast to more common survey methods such as (semi-)structured interviews, surveys or online voting tools, which ask respondents for their motivation or preferences directly, e.g., based on Likert scale evaluation, thus requiring participants to verbally describe or quantify their preferences, implying that respondents are acutely aware of their motivations and attitudes that guide their decisions. In the case of unarticulated tacit knowledge, these requirements are typically not met (Raymond et al., 2010).
SP approaches are also widely used for modelling residential choice (Timmermans et al., 1992). McFadden (1978) described the residential choice issue as one of maximizing utility, where each alternative is described by attributes such as accessibility to work, shopping, and schools, neighbourhood quality, availability of public services, costs, number of rooms or provided appliances and where individuals’ perceptions of dwelling units impact their decision. Kim et al. (2003) highlight the importance of accessibility and transport-related attributes for residential location choice. Bhat and Guo (2004) assessed the impact of residential zones’ characteristics (size, population, density, etc.). Walker and Li (2007) studied the role of lifestyles in choice behaviour by using latent class choice modelling, and Tu et al. (2016) estimated the impact of urban green space availability. Stokenberga (2019) quantified the importance of informal support networks, whereas Ibraimovic and Masiero (2012) studied the effects of ethnic neighbourhood composition and ethnic preferences. Yu et al. (2017) analysed residential mobility as intertemporal location choices, i.e., changes in location over time due to dependencies of manifold time-invariant and time-variant factors.
Clearly, residential choice interacts in a complex way with various (spatial) determinants. It can be considered as a decision with long-term consequences that draws on a complex set of criteria including stated and tacit preferences and constraints. Furthermore, residential choice affects, directly and indirectly, market developments, the built environment, networks, urban form and land-use (Sener et al., 2011). Consequently, for stakeholders, residential choice modelling is an important tool for urban, transport and land-use planning (Kim et al., 2003), first and foremost for steering actual and future housing demand and supply (Stokenberga, 2019). However, the complexity of interactions in residential choice is difficult to conceptualize, especially as preferences may undergo rapid alterations.
Combining SP choice experiments with methods of statistical and/or machine learning allows patterns of tacit knowledge to be explored with the help of rigorous statistical techniques. Combining these two approaches is the core element of a computer-aided knowledge elicitation approach (KnETs) described by Bharwani (2006). Conceptually, the KnETs process presents a participatory knowledge elicitation approach, complementary to more conventional conjoint analysis that relies on choice experiments to identify the criteria motivating human decisions, and to uncover unvoiced prioritizations or unconscious evaluations (Bharwani, 2006).
The KnETs process includes the four principal stages following Wood and Ford (1993) and Wooten and Rowley (1995) as described in Figure S1 that combine qualitative and quantitative methods and result in a formal knowledge representation (Bharwani, 2006; Bharwani et al., 2015). KnETs uses a controlled, structured, interactive and iterative interviewing method that results in a ‘game’ being played, so that factors that can potentially influence a decision are constrained, and large datasets be gathered. The KnETs approach has been used in a variety of case studies including crop choice modelling in South African communities (Bharwani et al., 2005) and Cameroon (Bharwani et al., 2015) and the EU FP6 NeWater project on adaptive river basin management (Kemp-Benedict et al., 2010).
The study presented in this paper advances the KnETs approach in two innovative ways. From a technical perspective, it seeks to improve KnETs by proposing SilverKnETs as a new software tool to conduct computer-aided, iterative interviews building upon KnETs experiences. Methodically, KnETs is sought to be improved by using random forests (RFs) as an alternative machine-learning approach for the statistical analysis of interview outcomes. For showcasing, a case study to investigate residential choice behaviour is presented. As described above, residential housing choice, particularly in the urban space, is a prime example of dynamic and complex decision-making underpinned by personal preferences intertwined with social, economic, environmental and spatial aspects and interactions. Consequently, to enable planners to make informed decisions, the relevant preferences, motivations and criteria underpinning residential choices of market participants need to be identified and formalized. It is in this regard where data mining and particularly RFs are considered as a major methodological opportunity.
Advancing KnETs in form of SilverKnETs
Looking at Figure S1, the KnETs process is strongly dependent on software tool support in two contexts, i.e., the interview context and the data mining context. Originally, a JAVA-based survey tool developed by Michael Fischer, University of Kent, UK, provided this tool support (sourceforge.net/projects/knets/). Several limitations can be identified for this original version of KnETs including: (i) missing capabilities for the generation of random values; (ii) lack of internal means of scenario validation; (iii) lack of data models to distinguish between views – i.e., what is shown to the participant on screen, also in light of localization aspects – and export, i.e., how data are encoded internally for the facilitation of data mining; (iv) lack of conjoint interviews; (v) lack of web-based capabilities to conduct interviews remotely and in an unsupervised manner. Commercial products exist that overcome these limitations, however, at potentially inhibiting licensing costs, thus giving rise to a novel adaption of KnETs.
SilverKnETs is developed to be freely available and to address most of KnETs’ limitations. It features conjoint interviews and an internal scenario validation engine, which allows setting certain restrictions on the combinations of predictor values during scenario generation to include presumptions, limitations or hypotheses on the problem domain at hand. SilverKnETs also separates the view model from the data model, so that the way information is presented on the computer screen is independent from the way values are recorded for the subsequent application of data mining techniques, thereby enabling user-centric, localized surveys. SilverKnETs follows a loose-coupling approach, thus enabling the export of data into a standardized file format so that knowledge discovery and formalization can be conducted using any statistical software.
Application of SilverKnETs to elicit residential housing choice behaviour in the city of Leipzig
In this case study, SilverKnETs is used to elicit (tacit) knowledge-driving residential choices within the city of Leipzig, Germany. Leipzig represents a highly dynamic housing market with continuously high residential mobility (Welz et al., 2014; Wolff et al., 2016). A phase of considerable shrinkage from the 1960s until about the end of the 1990s was followed by a period of stabilization, succeeded by the onset of dynamic growth of 2% p.a. from 2010 onwards (Haase and Rink, 2015; Wolff and Haase, 2015). These rapidly shifting population dynamics pose difficulties for planners to estimate future residential demand in the city. Various questionnaire surveys and qualitative interviews were used repeatedly to gain knowledge on housing choices and migration behaviour in Leipzig (Grossmann et al., 2015; Haase et al., 2012a; Stadt Leipzig et al., 2016). However, as outlined above, these common survey methods may not sufficiently elicit underlying (tacit) preferences. Analysing the outcomes of this decision-making process in the form of patterns and changes in the housing market or net-migration flows using census data and municipal statistics may also omit factors of relevance. Consequently, the advanced KnETs process in the form of the SilverKnETs tool is reasonable for application to this case.
Preparatory domain exploration
Housing, neighbourhood and household attributes (predictors) with corresponding domain values.
GDR: German Democratic Republic; NR: not renovated; PR: partially renovated; FR: fully renovated.
Denotes class limits.
As shown in previous studies, the aforementioned attributes tend to interact with household characteristics, i.e., socio-demographic factors (Ettema, 2010). Consequently, predictor importance and individual preferences may vary considerably, e.g., between different groups of income or age (Angelini and Laferrère, 2012; Kim et al., 2003; Park and Kim, 2016), due to internal household dynamics (López-Ospina et al., 2016), or over time based on experiences, life plans, life events and (external) shocks (Bajari et al., 2013; Clark and Huang, 2003; Yu et al., 2017). To reflect on this heterogeneity in preferences, the household attributes income, employment status, qualification and age were also included in the case study (Table 1).
Data sampling using interactive interviews
The data sampling was carried out in the form of interactive interviews within a two-week period in March 2015 at various inner-city locations in Leipzig including the city centre, the university campus, residential areas of different type and large shopping malls. Additionally, a sample consisting of residents who recently changed place of residence was collected in the various local city offices of Leipzig where inhabitants register. The selection of sampling locations is based on former interview experiences with the aim of receiving a certain rate of response to allow for further analysis (Welz et al., 2014, 2017). In the interviews, scenarios were generated iteratively, and respondents had to accept or decline each alternative (Figure S2). Each scenario represents a potential apartment similar to an advert, created by randomly drawing factors from the set of domain-specific values for each predictor (Table 1), and controlled for by SilverKnETs to eliminate non-representative options.
Knowledge formalization and evaluation
Predictive data mining models for classification and/or prediction tasks include neural networks, Bayesian networks, support vector machines, single classification and regression trees (CART) as well as RFs, i.e., ensembles of CART (Breiman, 2001; Lausch et al., 2015; Wright et al., 2016). These methods are typically ascribed to supervised learning methods, i.e., they rely on a pre-specified target attribute that should be predicted by a set of independent predictors (Hastie et al., 2009; Rokach and Maimon, 2015). RF classifiers are widely used in research, such as in remote sensing, and were found to outperform CART and common regression methods (Antipov and Pokryshevskaya, 2012; Belgiu and Drăguţ, 2016; Rodriguez-Galiano et al., 2012). Thus, RFs promise to be a competitive and efficient machine-learning approach. Other advantages of RF are commonly seen as: (i) their model-agnostic nature; (ii) handling of large numbers of mixed – qualitative (categorical) and quantitative – predictors; (iii) their robustness to outliers and (iv) their capability of effectively dealing with very large data sets. Further benefits include their internal error estimate (out-of-bag error, OOB), and their internal estimate of variable importance (Breiman, 2001; Rodriguez-Galiano et al., 2012).
Due to this perceived superiority, this case study uses RF to evaluate their performance in residential choice prediction. In this regard, the presented case study seeks to improve KnETs methodically, with the latter relying on CART (Bharwani et al., 2015). To the knowledge of the authors, RFs have rarely been used in the residential choice context. Instead, regression methods – e.g., multinomial logit regression and nested logit regression models – are more commonly used (Yates and Mackay, 2006). Antipov and Pokryshevskaya (2012) have used RF to determine the importance of mixed predictors on housing prices in St. Petersburg, Russia, where RF outperformed other methods such as CART, neural networks, or multiple regression analysis.
Results
RF models
In the following, the performance of RF models that include only housing and neighbourhood attributes is compared with models that additionally account for heterogeneity in residential choice by including the household attributes listed in Table 1. To build the former models, a total of 7712 scenarios were used that have been sampled from 199 individual respondents; the median number of scenarios per respondent is 30. To build the latter models, only 7450 scenarios were used, excluding 262 cases from 21 unique respondents due to entirely missing household attributes.
RF generation was carried out in the R statistics software using packages ‘randomForest’ (Liaw and Wiener, 2002) and ‘randomForestSRC’ (Ishwaran et al., 2008). For RF training, cases were randomly split into a training data set (80%) and a test data set (20%). The majority class, i.e., no cases, indicating the rejection of a scenario, outnumbers the minority class, i.e., the class of interest, in a ratio of approximately 1:12, with 7144 no cases to 586 yes cases. To deal with this imbalanced data, as suggested by Chen et al. (2004), a downsampling approach has been used on the majority class in the training set. All RFs were grown using 600 trees, a number deemed sufficient to let the OOB error converge towards the estimated true prediction error.
Variable importance
Comparison of predictor importance.
RandomForest model, importance given as relative score in relation to most important predictor.
RandomForestSRC model.
Numbers in squared brackets indicate the rank of the variable, with 1 being equal to most important.
Predictor interactions
In the following, using randomForestSRC, the probability for the prediction of a specific outcome (class) is investigated more closely for the three most-important predictors, i.e., rent, location and house type. For this, we employ the ensemble class probability p. Here, ensemble class probability refers to the predicted probability for a given class for a covariate of interest
Figure 1 shows this variance of p for the prediction of the minority class, Variance of predicted ensemble class probabilities for the minority class for the predictors rent (left), location (location), and house type (right), and corresponding uncertainty, indicated by the proportions of TN and FP (positive reference outcomes), and FN and TP (negative reference outcomes). (a) Ensemble class probability of the randomForestSRC model excluding household attributes. (b) Uncertainty of randomForestSRC model excluding household attributes. (c) Ensemble class probability of randomForestSRC model including household attributes. (d) Uncertainty of randomForestSRC model including household attributes. (e) Ensemble class probability of randomForestSRC model including household attributes, overlaid with the ensemble probabilities broken down on groups of income as most-important sociodemographic predictor. TN: true negative; FP: false positive; FN: false negative; TP: true positive; NR: not renovated; PR: partially renovated; FR: fully renovated.
Apparently, the interquartile range (IQR) differs substantially across factors. A higher IQR indicates a greater variance of
Here, possible interaction effects, i.e., the effect of a given predictor level on the preference of another, become apparent, e.g., (un-)preferred house types being associated with certain (un-)preferred locations. Grünau, e.g., is a large German Democratic Republic (GDR) housing estate built during the 1970s and 1980s, with a negative image (Grossmann et al., 2015).
High IQR could also indicate heterogeneous decision-making, where predictors have a strong influence on the individual decision outcomes, and where not only housing and neighbourhood attributes interact with each other, but where residential choice is further underpinned by household attributes. Hence, including household attributes as additional predictors may improve the predictive performance by decreasing the variance of p. Figure 1(c) visualizes the respective ensemble class probabilities for the corresponding randomForestSRC model. For most factors, IQR has been reduced, and the median probabilities
Model accuracy and uncertainty analysis
Comparison of random forest (RF) accuracy measures.
OOB: out-of-bag error.
Looking at Table 3, the OOB error is similar across all RF models. This is also true for the success rate and specificity. Recall is similar to the success rate for models excluding household attributes, but slightly higher for models including them. Precision is comparatively low for all models, which indicates that TP are contrasted by a considerable number of FP, giving rise to uncertainty regarding positive predictions being made. Similar to recall, including household attributes slightly improves the model performance.
The previously introduced accuracy measures are subsequently also determined on a per-factor basis to uncover factor levels for which uncertainty is particularly high. This is done by recording TP, FP, TN and FN for each covariate of interest, using 100 iteratively grown randomForestSRC models. Figure 1 visualizes the resulting mean fractions of TP and FN – both corresponding to true positive reference outcomes in the test sets – as well as TN and FP, both representing true negative reference outcomes, for the most-important predictors.
As discussed above, uncertainty results from false predictions, i.e., FP and FN. In Figure 1, FPs are shown in light red, and FN in light green. Looking at Figure 1(b) and 1(d), the share of FN is comparatively low across all factors and models, thus rendering FN the minor contributor to uncertainty. The number of FP is clearly higher, especially for lower rents, locations in the Centre and the South of Leipzig, and fully-renovated detached as well as fully or partially-renovated Wilhelminian-style houses. This is particularly true for RF models not considering household attributes as shown in Figure 1(b). In comparison, looking at Figure 1(d), models including household attributes seem to perform better especially for lower rents and central location. For these factors, the number of FP has clearly decreased.
Discussion and conclusions
Technical-methodical KnETs advancement
This paper has showcased an advanced KnETs approach called SilverKnETs for residential choice modelling in the city of Leipzig, Germany. The SilverKnETs tool as a technical KnETs advancement has proven to be a robust data collection instrument. Using RF as a methodological KnETs advancement, we were able to predict residential choice and assess predictor importance based on either MDA or the mean minimal depth. Using ensemble class probabilities, we were further able to uncover factors that likely lead to positive (‘pull’) or negative (‘push’) residential choices as well as identifying likely interactions between these factors. Hence, the approach allows dependencies to be detected between variables, e.g., under which conditions factors such as high rents or disadvantaged locations become more acceptable.
However, we believe that not only those conditions that are perceived as attractive, are of relevance to planners, but also that knowledge about conditions that make residential locations undesirable, thus reinforcing their unattractiveness, is valuable information for stakeholders (Stadt Leipzig et al., 2016). The identification of those factors deemed as (un-)preferred and (un-)important could provide valuable insights in how to develop the housing market to cater to public demand, e.g., in the context of neighbourhood developments and housing stock planning.
Clearly, the proposed KnETs adaptation is not a completely new methodology per se, but is meant to complement established methodologies such as choice-based conjoint analysis (CBC), an increasingly popular method for the elicitation of preferences in environmental and conservation issues (Alriksson and Öberg, 2008). Consequently, there are many similarities. Both methods can reveal the preference structure of individuals or groups. Likewise, also CBC is able to reveal the (relative) importance of attributes and predictors (Alriksson and Öberg, 2008; Brett Hauber et al., 2016). However, a very clear and considerable advantage of tree-based methods is their ability to visualize the hierarchical structure of decision-making, i.e., the rules that a decision follows on a step-by-step basis (Figure S3).
Revisiting predictors and preference heterogeneity
This case study has been based on a limited, expert-elicited set of explanatory variables. From this set of variables, rent, location and house type were consistently identified as the most-important predictors across all RF models. This finding confirms questionnaire surveys undertaken in Leipzig by Welz et al. (2014, 2017), which concluded that housing and neighbourhood attributes appear to be more important than household attributes. Regarding the remaining predictors, there is no clear-cut order of importance. Following Strobl et al. (2008), this may also be due to interactions and complex correlations, which may affect the estimation of variable importance (Hothorn et al., 2006).
Regarding the re-evaluation of predictors, transport-related criteria, e.g., distance to city centre, specific institutions or well-known landmarks could possibly be included. Heterogeneity in preferences in residential choice should additionally be considered (Ettema, 2010; Walker and Li, 2007). It could be shown that by including household attributes, the variance of predicted ensemble class probabilities could be reduced, consequently increasing the predictive performance of the RF models. However, looking at Table 3, the observed increase in precision is surprisingly small and possibly lower than anticipated. Particularly recall seems to have benefitted from including household attributes. I.e., the models were more successful in identifying TP.
In the context of the presented case study, particularly if there is a focus on predicting rather than explaining residential choice, the elicitation of household attributes might thus be re-considered. Instead, predictors could be elicited that reflect on past experiences and choices as well as shocks or changes during the household-aging process. As discussed by Bajari et al. (2013), Clark and Huang (2003) or Yu et al. (2017), these aspects may have significant influences on present and future preferences and behaviour.
Nevertheless, household attributes allow for a further differentiation of predicted class probabilities. As exemplified in Figure 1(e), ensemble class probabilities vary – for some cases considerably – per category of income. However, overall, income is not a very important predictor, and consequently, it does not seem to dominate residential choice. We attribute this to the fact that the housing market in Leipzig at the time of the survey was less tense and polarized and thus more accessible to most households compared to, e.g., London or Paris. We expect that a stronger polarization of the housing market would clearly be reflected in the importance ranking of the predictors, which could be a suitable avenue for further research. As discussed by Walker and Li (2007), lifestyles might be another crucial factor for the explanation of some of the observed patterns. E.g., looking at the median ensemble class probabilities shown in Figure 1(e), left, it becomes clear that the highest rent class is ‘preferred’ to almost the same extent in both the lowest and the highest category of income. For the former income category, this might be explained by lifestyle choices such as flat sharing.
Reflections on the performance and limitations of RF
From the assessment of the RF models, it became obvious that neither model is superior. We have shown that the precision of predicting the minority class is, depending on the model, rather low with 20–26%. This is mainly due to a high number of FP. It appears to be the case that RFs tend to overestimate the impact of favourable conditions such as low rent, resulting in overly ‘optimistic’ predictions and thus rather high uncertainty. To decrease this uncertainty, the number of FP could be reduced by (i) penalizing false predictions; (ii) re-evaluating predictor selection; and (iii) increasing sample size.
Comparing the elicited RF models to a binary logistic regression (Table S4), it can be concluded that the precision of both types of models is of comparable magnitude, but that recall is higher for RF (Table S5). This also applies to a comparison with CART, where the precision of RF is only slightly higher, but recall considerably so (Table S6). It is noteworthy that there is a trade-off between recall and precision, where maximizing any of the two measures results in a degrading performance of the other (Rokach and Maimon, 2015). Hence, highly precise models tend to lack in detecting positive decisions. With increasing recall, on the contrary, precision decreases. This limiting trade-off requires optimization of a model on a case-by-case basis.
Transferability and re-use of SilverKnETs
The generic nature of SilverKnETs allows for a flexible transfer and re-use of the tool across problem domains and use cases. The approach presented here can generally be used within any domain where stakeholder preferences in a decision-making context need to be elicited. This includes, e.g., resource management (Price et al., 2016) or land-use planning (Vollmer et al., 2016). The different knowledge elicitation games included in SilverKnETs, the integration of a dedicated view model that determines how information is presented to the participants on-screen – e.g., regarding language or level of detail – as well as the flexibility offered by the tool to conduct interviews offline or online are seen to facilitate this reusability and transferability.
Supplemental Material
Supplemental material for Combining tacit knowledge elicitation with the SilverKnETs tool and random forests – The example of residential housing choices in Leipzig
Supplemental material for Combining tacit knowledge elicitation with the SilverKnETs tool and random forests – The example of residential housing choices in Leipzig by Sebastian Scheuer, Dagmar Haase, Nadja Kabisch, Manuel Wolff, Dagmar Haase, Annegret Haase, Nadja Kabisch, Manuel Wolff, Nina Schwarz and Katrin Großmann in Environment and Planning B: Urban Analytics and City Science
Footnotes
Declaration of conflicting interests
The author(s) declare no potential conflict of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Sebastian, Dagmar and Manuel benefited from funding through the FP7 collaborative project GREENSURGE (FP7-ENV.2013.6.2-5-603567), the Horizon 2020 innovation action CONNECTING (COproductioN with NaturE for City Transitioning, Innovation and Governance; No 730222-2) and the AXA research award ‘Models, Metrics and Typologies of Resilient Cities’. Nadja’s work was supported by Green-EquityHEALTH, funded by the German Federal Ministry of Education and Research (BMBF; no. 01LN1705A). This research was carried out as part of the project ENABLE, funded through the 2015-2016 BiodivERsA COFUND call for research proposals, with the national funders The Swedish Research Council for Environment, Agricultural Sciences, and Spatial Planning, Swedish Environmental Protection Agency, German aeronautics and space research centre, National Science Centre (Poland), The Research Council of Norway and the Spanish Ministry of Economy and Competitiveness.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
