Abstract
Recent years have seen increased interest in the use of alternative data sources in the definition and production of official statistics and indicators for the UN Sustainable Development Goals. In this paper, we consider the application of data science to the production of official statistics, illustrating our perspective through the use of poverty targeting as an application. We show that machine learning can play a central role in the generation of official statistics, combining a variety of types of data (survey, administrative and alternative). We focus on the problem of poverty targeting using the Proxy Means Test in Indonesia, comparing a number of existing statistical and machine learning methods, then introducing new approaches in the spirit of small area estimation that utilize area-level features and data augmentation at the subdistrict level to develop more refined models at the district level, evaluating the methods on three districts in Indonesia on the problem of estimating 2020 per capita household expenditure using data from 2016–2019. The best performing method, XGBoost, is able to reduce inclusion/exclusion errors on the problem of identifying the poorest 40% of the population in comparison to the commonly used Ridge Regression method by between 4.5% and 13.9% in the districts studied.
Introduction
Recent years have seen a push for the use of alternative “big” data sources in the definition and production of statistics and indicators for the UN Sustainable Development Goals. However, being official statistics, textitasis is placed on the reliability and reproducibility of the process to ensure consistent production of the statistics. A variety of data sources, variously named “non-traditional”, “alternative”, “novel” or “big” data, including earth observation/satellite image data, mobile phone/positioning data, supermarket scanner data, social media data, and more recently AIS shipping data (all of which we call “alternative” data sources in this paper) have been employed in efforts to improve the accuracy and reliability of such statistics, and the efficiency and timeliness with which the statistics can be produced.
In early papers such as Florescu et al. [1], Landefeld [2], Tam and Clarke [3], Kitchin [4], the focus is generally on the data sources themselves, and concerns such as access, availability, commercial sensitivity, and fitness for purpose, whilst downplaying the difficulty of adapting existing statistical methodology, despite the cautions expressed by Florescu et al. [1] in their table of features of big data, that statistical methods ‘may not always be applicable’ to big data, and by Struijs et al. [5] that ‘it is difficult to apply traditional statistical methods, based on sampling theory’. Instead of addressing this issue, Kitchin [4] focuses on the more technical (though still important) questions of transferring, storing, cleaning, checking and linking big data, conjoining big data with official statistical datasets, and presents this work as if belonging to a separate data gathering and integration phase undertaken prior to any statistical analysis.
In this paper, we consider the application of data science to the production of official statistics. We adopt the broad perspective that data science (understood as the application of machine learning to practical problems) is not narrowly construed as the development of models largely in an automated fashion, assuming data collection and interpretation are defined independently of analysis and interpretation, but rather constitutes an iterative process of identifying suitable data and data sources, applying statistical and/or machine learning methods to develop models, and understanding, interpreting and validating those models as being fit for purpose. In determining whether some statistic is “fit for purpose”, all aspects of this process are important, from data collection (the sources of the data, and the reliability of the data and of the sources), the methods used in the process (such as preprocessing by cleaning the data, manipulating the data, running statistical/machine learning methods, internal validation), and the reporting of statistics (including interpretation, external validation, explanation and implications for policy). Unduly focusing on one or other of these aspects risks missing the importance of considering the whole process in determining whether data is “fit for purpose” as this, obviously, depends on the purpose. Similarly, asking whether a data source is “reliable” depends on the purpose, for example a collection of supermarket scanner data might be a reliable source of the price and of quantity of sales of various types of goods, but not of the quality of the products sold.
Our thinking builds on the work of Marchetti et al. [6], who propose three ways to utilize “big data” in the context of small area estimation: (i) use big data sources to create local indicators and compare them to those obtained with small area estimation models; (ii) use big data sources to create new covariates for small area models; (iii) use survey data to check and remove the self-selection bias of the values of the indicators obtained using big data. The first approach posits that indicators derived from alternative data sources can be used to replace statistical methods by defining proxies for area-level statistics, and highlights the need for validation of the process. The second approach posits that alternative data sources can be used to augment conventional statistical methods by providing additional features (any machine learning being used only to define such features). The third approach (though not studied in the paper) suggests the need for methodological improvements to ensure that statistics produced via proxies are reliable.
In this paper, we propose another way to use alternative data sources in a context broader than small area estimation, but nevertheless attacking a similar problem (i.e. lack of data in “small” areas). Specifically, we use machine learning methods to replace statistical methods (like the first approach of Marchetti et al. [6]) but at the unit (household) level, both without and with (like the second approach of Marchetti et al. [6]) the use of area-level features and data augmentation from “similar” districts as inputs to the machine learning methods. We consider area-level features derived from both administrative and alternative data sources, and maintain a strong focus on validation of the approach to ensure reliability, which is of necessity empirical, as is standard in machine learning.
We illustrate our perspective through the use of poverty targeting (identification of “poor” households) as an application, showing that machine learning can play a central role in the generation of official statistics, making use of all types of data (survey, administrative and “alternative”). While there are also uses of alternative data sources for developing measures of specific types of poverty, such as energy poverty [7], poverty targeting in general is an especially interesting application of data science because traditional statistical methods perform poorly [8, 9], with typical inclusion and exclusion errors around 30–40%. However, preliminary to our discussion, it is not clear whether the reason for these high error rates is whether the statistical methods used are not capable of representing an accurate model based on the supplied data, or whether the data itself is of insufficient quality that it is difficult for any method (statistical or machine learning) to perform better. We address this question empirically through comparing a variety of standard statistical and machine learning methods.
More specifically, we focus on poverty targeting using the Proxy Means Test (PMT) in Indonesia [10, 11, 12, 13, 14, 15], though, given that the PMT is widely used and in Indonesia follows World Bank guidelines [16, 17, 18, 19, 20], the analysis and results should also be applicable in other countries where this approach has been applied, such as Uganda [21], Bangladesh, Rwanda and Sri Lanka [8, 22], and Honduras and Peru [23]. In Indonesia, implementing the PMT is underpinned by SUSENAS household surveys conducted every 6 months to collate information at the household level about demographics, levels of education, housing and employment status and sector (agriculture, industry, service), and level of food and non-food consumption. The basic problem is to predict, from this data, the levels of household expenditure for the following year, so as to determine those eligible for subsidies in that year. It is important to note that the usual theoretical guarantee of Ordinary Least Squares (OLS) Regression to minimize the mean squared error on the sample/training set, provides no assurance that the OLS model will work well for this problem because what is required is good predictive performance on out of sample/test set instances. Thus we compare two commonly used statistical methods (OLS and Ridge Regression) and a number of more recently developed statistical and machine learning methods with a variety of encodings of the variables/features representing the survey data.
We also consider the question of data quality by evaluating the best statistical and machine learning methods with data augmentation, both “horizontally” and ”vertically”. Here we imagine the data laid out in a table with one row per household and one column per variable/feature, so use the term horizontal data augmentation to mean adding columns (i.e. variables/features) to each row (household), based on alternative data sources, and use the term vertical data augmentation to mean adding rows to the the table (i.e. more households), making the sample/training set larger but with the same variables/features. We also consider both horizontal and vertical data augmentation. The development of new statistical or machine learning methods is left for future work.
The organization of this paper is as follows. We begin with a general discussion of the Proxy Means Test and a precise definition of various prediction problems that we analyse in the paper. In the following section, we compare standard statistical and machine learning methods on these problems, and in the next section, analyse and discuss improvements to the best performing methods using data augmentation. Following this is a survey of related work on poverty mapping.
Poverty targeting and the proxy means test
The problem of poverty targeting is to identify, from a given collection of households, those meeting certain conditions related to household consumption, such as being below an “eligibility threshold”, being below a “poverty line”, or being “very poor” (the bottom 5%) or in the “extreme poor” (bottom 2%). Alternatively, the problem may require computing a ranking of all households in order of wealth/consumption, or an estimation of the wealth/consumption of all households. Typically the approach is to use the Proxy Means Test (PMT) [17], in which income is not measured directly (this is assumed to be unreliable due to misreporting), but rather household consumption expenditure is observed and per capita household expenditure used as the target for estimation. The results of the PMT can be used for poverty mapping, i.e. to determine aggregated area-level statistics such as the Foster-Greer-Thorbecke (FGT) measures [24] of head count ratio (proportion of people below a poverty line), poverty gap (average additional expenditure required for all people in poverty to reach the poverty line), and poverty severity (a measure of the distribution of expenditure amongst the poor) [16, 10], though these aggregated statistics are not of concern in this paper, and we recognize that there are direct methods of estimating these measures using samples weighted inversely to their probability of being chosen in the survey, going back to Fay and Herriot [25], and techniques based on small area estimation [26, 27, 28] and the related World Bank ELL method [29].
Models typically are evaluated using the metrics of inclusion error (the proportion of non-poor households amongst those predicted poor) and exclusion error (the proportion of poor households not predicted poor as a fraction of all poor households). Where the prediction problem is to identify a target set defined as a fixed proportion of households below a given cut off threshold of per capita consumption expenditure (i.e. the bottom N% of households for given N), the inclusion and exclusion errors are the same, because for every household incorrectly included in the target set, there is another household excluded from the target set that should in fact be included (assuming the model predicts a target set of exactly the right size, i.e. N%, which in turn requires that the total number of households is known). The target set is typically the set of households eligible to receive benefits. Thus inclusion error is an indication of households erroneously receiving benefits they presumably do not need, while exclusion error captures those households erroneously predicted to not require benefits, so represent those households that miss out on benefits for which they should be eligible. In Indonesia, this eligibility threshold is set at 40%, which is also approximately the poverty line in the poorest district.
In the Indonesian context, there are 514 models produced, a different model for each kabupaten/kota, called here “districts” for simplicity, though note that Indonesia also has smaller administrative units at the “subdistrict” and village level (which we use below for data augmentation). The same variables are measured in each district for consistency (though as will be discussed below, the usefulness of variables varies across districts). The same type of model is constructed for each district, partly for ease of interpretability and explainability of the results.
Data for constructing the models is collected from a sample of users in each district. Observed variables include demographic information such as (i) the number of adults, older adults, children, preschoolers and infants in the household, the marital status of the head of the household, and the numbers of each gender in the household; (ii) educational attainment of household members such as primary school, secondary school, etc.; (iii) employment status and sector (agriculture, industry, service) of each household member; and other attributes such as (i) household expenditures on food, rent, education, transport, healthcare, electricity, etc.; (ii) ownership of assets such as refrigerators, air conditioners, cars, motorcycles, motorboats, computers, televisions, water heaters, jewellery, and home ownership; (iii) housing quality measured as per capita floor space, the materials used in the roof, wall and floor, and the type of sanitation; (iv) access to services such as the Internet; and (v) the type of drinking water (bottled or untreated) and cooking fuel (gas, kerosene or charcoal). We use categorical variables rather than conversion of categorical variables to binary dummy variables. This results in fewer variables (60 categorical compared to 91 binary variables), and, though statistical methods anecdotally are supposed to perform better with binary variables, that is not the case for this problem.
The specific task is develop a model to estimate the log of the per capita consumption expenditure based on the other variables (thus underlying the statistical methods is the assumption that the log of per capita consumption expenditure is a linear function of the input variables). This estimate can then be used to identify the households in the various target sets. As usual, the use of the logarithm is to remove the “long tail” of high wealth households to make the distribution of expenditure by household more like a normal distribution. Another assumption adopted in Indonesia is that per capita expenditure is a useful target variable, that is, that household expenditure varies linearly with the size of the household. In Europe, in contrast, adults and children are weighted differently in attributing individual consumption to household members [18]. We comment further on this assumption below.
Before proceeding to data analysis and the study of methods, we note some aspects of the way this task is conceptualized which affects the performance of any method on this task. One key idea is that to enable commensurability between households of different characteristics with respect to expenditure, the “ground truth” consumption is adjusted. First, regarding housing, for those households that own their house, an estimate of the cost to rent the same dwelling is imputed based on district-level rents. Second, for transport, for those who own a motorcycle or car, transport costs are estimated with reference to those without those modes of transport. This imputation of data distorts the ground truth when the estimated costs differ from the actual cost of the asset (perhaps amortized over several years). Similarly, regarding healthcare, what is observed in surveys is how much was spent on care, which varies according to individual (hence household) circumstances, i.e. households in good health will have lower healthcare costs than those which do not. When this information is fed into a predictive model, this will produce estimations that incorporate the typical amount spent on healthcare rather than an estimate based on individual circumstances, as no questions in the SUSENAS survey pertain to health, resulting overall in higher variance between similar households. Moreover, healthy households are likely to be estimated wealthier than they actually are, due to imputed expenditure on healthcare that was not incurred, while unhealthy households are likely deemed poorer than they are because their actual healthcare expenditure is likely underestimated. As long as the adjustments are consistent between training and test sets, this is acceptable as far as model building, however policy makers should be aware of these distortions, especially how they contribute to error. This is also perhaps where a multidimensional poverty index would be appropriate, so that healthcare in particular could be removed from this calculation and treated under separate policy. An added complication is that the observed household expenditures are themselves affected by cash subsidies (usually given just before the survey takes place), so to the extent that some households are erroneously receiving, or not receiving, the subsidies, the reliability of the ground truth data will be compromised.
There are further issues with the level of granularity of the imputed values. At bottom, the variables derived from the survey questions need to differentiate households based on expenditure, but when rental costs are considered, it is obvious that rental costs vary according to the location within a district. Put crudely, poorer people live in poorer areas, though cause and effect is not simple, areas are poor because poor people live there, and in turn, poor people live in poor areas because that is what they can afford. In the absence of further subdistrict level features (a current data limitation), predictive models rely on correlations with other variables to capture the poverty level of a subdistrict. But it is not intuitively obvious what those variables are: even where the survey asks questions about asset holding (such as home ownership), the questions make reference to the quality of the asset (such the materials used in the construction) but not the state of repair (e.g. whether the walls are crumbling or there is a leaky roof). Thus in Indonesia, distinguishing poorer areas is reliant on such variables as the type of drinking water consumed (bottled or untreated) or the type of cooking fuel (gas, kerosene or charcoal), variables which are not always reliable, or indicative of poverty.
In addition, asset ownership (such as jewellery, cars, motorbikes, computers) tend to mark a household as in the top 60%, hence these features do not help distinguish between different levels of poverty in the bottom 40%. Sometimes this is incorrect, e.g. if jewellery is inherited. Then again, sometimes asset ownership is a useful predictor, but only in some districts, e.g. owning a boat might be very useful in a fishing village, but boat ownership is unlikely to be a useful indicator in an urban area.
Finally, note that determining a poverty line is complex and beyond the scope of this paper. Ravallion [16] uses an estimate of the costs of basic needs for food and non-food separately. First, the cost of basic food needs is estimated from a standard diet in the district. Then the cost of basic non-food needs is estimated based on a ratio of food to total expenditure, calculated empirically for those households that make a small reduction in food intake in order to meet basic non-food needs; this is presumed to reflect the minimum non-food need. This ratio is then applied to the full cost of basic food needs to determine the overall expenditure required to meet all basic needs. This definition thus avoids defining what the basic non-food needs are, perhaps as this varies over time, as assets assume greater, or lesser, importance. An example is fixed line broadband Internet, which a decade ago might have been essential for access to financial services, however now is superseded by mobile phones.
This definition, in turn, means that poverty is relative to the district, not absolute (based solely on a single expenditure amount, independent of district). But poverty is not relative within a district in the sense that there is no a priori cut off that determines the poverty line. As a result, the prevalence of poverty varies across districts. This type of definition is preferred, even though it has the consequence that a person can move from one region to another (typically from a rural to an urban district) with an increase in income and expenditure, and yet be worse off in relative terms within the second district due to the higher cost of living, as pointed out by Ravallion [16]. In general, the overall income and expenditure of a (sub)population can rise, but the level of poverty can at the same time increase.
Data analysis and baseline methods
In this section, we compare conventional statistical methods with machine learning methods on the problem of poverty targeting, focusing on the task of identifying the bottom 40% of households by per capita expenditure, and using inclusion/exclusion error (IE/EE) and Root Mean Squared Error (RMSE) as the metrics for evaluation. In all experiments reported in this and the next section, the test set is the observed per capita consumption expenditure from 2020, and the training data comes from the period 2016–2019, calibrated to 2020 values using the following procedure: first, the households in each year are ranked according to per capita expenditure; second, for each percentile, the mean per capita household expenditure is calculated; third, all expenditure values from each percentile are adjusted by (multiplied by the inverse of) the ratio of this mean to the corresponding 2020 mean. The effect is to remove fluctuations in the value of the Rupiah from year to year, and, put simply, convert all 2016–2019 consumption data to a common basis of 2020 Rupiah values.
Statistical and machine learning methods (bottom 40% of households, train: calibrated 2016–2019, test: 2020)
Statistical and machine learning methods (bottom 40% of households, train: calibrated 2016–2019, test: 2020)
Distributions of 2020 per capita household expenditure (Brebes and Surabaya).
We focus on three districts of Indonesia: Brebes, Sumba Tengah and Surabaya. Brebes (population 1.8 million) and Sumba Tengah (population 73,000) are rural areas with high levels of poverty (Brebes 19%, Sumba Tengah 36%); Brebes is dependent on crop yields which are susceptible to weather conditions, resulting in high variation in expenditure from year to year; Sumba Tengah has limited access to services such as health and education, so fewer variables are available to discriminate between households. Surabaya (population 2.9 million) is a large urban area with a low poverty rate (5%) but a large number of people below the poverty line. The distributions of per capita household expenditure for Brebes and Surabaya are shown in Fig. 1 (Sumba Tengah is similar to Brebes). These distributions show the difficulty of the task: the 40% cut off is close to the mean value, and there are a large number of households with expenditure around that value and hence with similar values. The graphs also show the expected log-normal distribution.
We briefly comment on the “per capita” assumption that underpins the use of the Proxy Means Test in Indonesia, to determine if the formulation of the task can be refined or the problem broken up into the estimation of food and non-food components. The “per capita” assumption is that consumption expenditure varies linearly with household size; at the same time, we tested the assumption separately on food and non-food expenditure based on the intuition that economies of scale should eventually take effect as households become larger. We looked at the mean value of food, non-food and total consumption for each household size in each of the three districts. In short, the “per capita” assumption does hold for food consumption, and for non-food consumption for households up to size 5, consistently across districts. However non-food consumption flattens, so total consumption increases, but at a slower rate, for households larger than 5. However, there are comparatively few such large households in the datasets, so these households, while difficult to predict, contribute comparatively little to the error of predictive models.
We thus tried to develop models to predict food and non-food separately (since one is linear and one is non-linear, at least for large households), however this does not work as well as predicting total per capita consumption expenditure. This is presumably because there are hidden interactions between food and non-food expenditure; from a conditional probability perspective, knowing something about food expenditure enables better prediction of non-food expenditure, and vice versa.
The results for conventional statistical and machine learning methods are shown in Table 1 (see below for definitions of the hyperparameter settings for the machine learning methods). Consistent with the report by AusAID [8], inclusion/exclusion errors are around 30%, though this is for the two rural areas, with inclusion/exclusion error around 25% in Surabaya (the AusAID figure is nationwide). What is most noticeable, however, is that there is almost no difference between the statistical and the machine learning methods (XGBoost performs slightly better on IE/EE for Sumba Tengah and Surabaya, but not for Brebes).
Distributions of true vs predicted values by decile (Brebes and Surabaya).
Figure 2 shows the distribution of errors for Brebes and Surabaya for XGBoost with categorical features. The
Decision tree branch (Brebes).
Decision tree branch (Sumba Tengah).
We now delve into some of the reasons for the high error rates, motivating the approach taken in the next section on data augmentation. Given that the performance of all the methods is very similar, we conclude that there is a problem with the data itself, and we see (broadly) two basic related problems: quantity and quality, leading to data “gaps”. The problem of data quantity can be illustrated using the Decision Trees in Fig. 3 for Brebes and Fig. 4 for Sumba Tengah. These Decision Trees are trained on the 2016–2019 data and tested on 2020 data the same as the methods in Table 1, and the full Decision Tree (more precisely, Regression Tree) is computed without stopping or pruning.
The figures show one branch in each such tree. Both branches represent data “gaps” due to lack of data quantity, resulting in high IE/EE for households on these branches because the estimated values are close to deciles 4 and 5 in both cases. For Brebes, the branch represents households with 3 or more females, very small house area, no car/motorbike, no water heater, no electronic devices, and no members who have completed school or work in the service sector. The Sumba Tengah branch represents households with a small house area (but not the smallest), some form of lighting, tiled walls and a non-thatched roof, no electronic devices, and no members who have completed school. The Brebes branch typically catches large households (4 or larger) but there is insufficient data to capture the wide variation between expenditures of such households, and (because as noted above, large households have economies of scale), their expenditure tends to be over estimated. The opposite is the case for the Sumba Tengah branch. In this case, the branch captures small households (of which there are very few), so their expenditure tends to be based on that of larger (3–4 person) households, thus is underestimated, as single person households generally need to pay more on rent and utilities per person than larger households. Fundamentally, the Decision Trees do not have enough data, or features, to split the leaf node any further, pointing to a lack of data quantity. This motivates the addition of extra data from “similar” subdistricts.
Concerning data quantity, we first note that the small sample sizes, for Brebes (1017 households sampled in 2020 of around 460,000 households), Sumba Tengah (536 sampled of around 14,000 households) and Surabaya (1084 sampled of around 815,000 households), mean that the training set cannot cover every possible combination of the variables. To investigate the quality of the variables, we looked at the Spearman rank correlations of all the variables individually with per capita household expenditure, and, as expected, correlations are very weak, amongst the best being employment sector (services sector is better) in the rural areas, and type of drinking water (bottled is better) in Surabaya. This points to the more serious data quality problem that there are insufficient features to distinguish between households, especially in the middle deciles where IE/EE is high. This can also be seen in the Decision Trees (so this also applies to Random Forest Regression, XGBoost and Categorical Boosting), where the same variables (per capita floor space, asset ownership, education) tend to be chosen repeatedly, even along the same branch, and the trees themselves end up highly fragmented, even with stopping criteria or pruning. This motivates the addition of new variables/features from alternative data sources.
One evident limitation of the existing framework is the lack of variables/features to enable more fine-grained models to be developed, as analysis shows only very weak correlations between individual features and consumption expenditure, with even the full Decision Tree treating most of the existing features as uninformative. This problem could also be alleviated through more data, as data availability is limited because of the small survey sample sizes. In this section, we consider statistical and machine learning methods applied to poverty targeting that make use of (i) area-level features at the subdistrict (kecamatan) level, and (ii) additional data from “similar” subdistricts, in order to fill “data gaps”. Both ideas are related to small area estimation insofar as that approach aims to develop overarching models that cover multiple “small” areas using area-level features to differentiate the small areas, and combining the data from multiple areas in order to increase the overall sample size. A difference is that we develop distinct models for each of the districts under consideration; differences between districts make it unlikely that a single model will work well across multiple districts. Thus we focus on minimally augmenting the data in a district by adding only the most relevant data (that from “similar” subdistricts) to improve the performance of the models.
Horizontal data augmentation – New variables/features
For the purposes of poverty targeting in Indonesia, there are many variables/features that could be utilized, however we choose the following three types of variables, due to their reliability and availability, correlation to area-level wealth, and capacity to consistently differentiate subdistricts: (i) earth observation data, (ii) access to healthcare services, and (iii) access to education, explained further below. Note that of these data sources, only the first may be considered an “alternative” data source; the other two are administrative data. In summary, we have 24 additional variables/features, three from earth observation data, four relating to healthcare services and 17 to education. We also evaluate the methods when adding only some of these types of feature, e.g. the 7 features from earth observation data and access to healthcare services.
Additional environmental variables/features are the vegetation indexes NDVI, NDWI and NDBI calculated from earth observation data. These features are related to the degree of urbanization, and may be useful because rural areas are generally less wealthy than urban areas, however we aim to apply use indexes at fine levels of granularity. NDVI (Normalized Difference Vegetation Index) [30] indicates the density of plant growth and higher values indicate a higher density of green vegetation; NDWI (Normalized Difference Water Index) [31] is sensitive to changes in liquid water content of vegetation canopies, and is complementary to, but not a substitute for, NDVI; NDBI (Normalized Buit-Up Index) [32] is designed to differentiate built-up urban areas from woodland or farming areas.
All three indexes are calculated at the village (kelurahan/desa) level, i.e. finer than the subdistrict level, then aggregated to the subdistrict level for use in the models. The features are calculated from Landsat 8 Surface Reflectance (L8 SR) images taken from the Google Earth Engine for the period 2015–2020. To avoid anomalous values and bias due to cloud contaminated pixels, the L8 SR images are preprocessed using a cloud masking technique.1 that aims to remove cloud and cloud shadow contaminated pixels through Level 1 Band pixel quality assessment.2 Then the median value of each index from the preprocessed L8 SR images is calculated for each month based on each village’s polygon. These median values are then averaged over the 12-month period from March to February in the following year (this period being chosen because SUSENAS surveys are done in March). The values are aggregated to provide environmental indexes for each year at the subdistrict level. Note that with heavy cloud cover in some areas of Indonesia for much of the year, some data is missing at the village level, but there is typically enough data to provide meaningful values at the subdistrict level.
Access to services is a differentiator of wealth at a subdistrict level, as availability of services makes areas more desirable, increasing prices (such as housing) and therefore the average wealth of residents who can afford to live there. We consider variables/features for access to four types of healthcare service and 17 types of education at the subdistrict level, each a numeric value of the number of those services in the subdistrict. The four types of healthcare service are hospitals, public health centres (puskesmas), secondary public health centres (puskesmas pembantu) and maternity and child health centres (posyandu). The 17 education variables/features range over public and private and from early childhood through elementary, junior and senior high school, to vocational and community education.
Vertical data augmentation – More data
If small area estimation aims to "borrow strength" from similar areas, the aim here is to “borrow data” from similar subdistricts (that is, similar to the subdistricts in a given district) to augment data for use in a model defined at the district level. More precisely, for each subdistrict (the target subdistrict) of a district, we choose one most “similar” subdistrict from another district in the same province (the source subdistrict) from over the years 2016–2019, and add the data from that subdistrict/year, suitably calibrated, to the original dataset. The idea is that additional data from “similar” subdistricts (kecamatan), in the same province but not the same district, perhaps from an earlier year, can augment the training data for the models, to fill the “data gap” in terms of quantity, especially for household types with very few examples in the original dataset, allowing the models to better fit that data (e.g. allowing methods based on Decision Trees to make more splits, as discussed above).
All 2016–2019 data is taken from the source subdistrict and calibrated to 2020 data for the source district, at the district level, following the calibration procedure described above, before adding it to the original dataset containing the target subdistrict data. We experimented with calibration of the additional data to the target subdistrict (which involves mapping percentiles of consumption expenditure in the source district to percentiles in the target district), but calibration to the target district results in too much data distortion. The problem can best be seen with some of the poorer subdistricts in the urban area Surabaya, which are most similar (in that province) to well off subdistricts in towns in nearby rural districts. However, those towns are amongst the wealthiest areas within their district, so when the district expenditures are mapped to Surabaya’s, the resultant values are too high, introducing error to the predictions. A similar effect occurs even for two rural subdistricts.
The determination of “similarity” is based only on 2019 data. We have investigated numerous methods for defining the similarity of subdistricts. First, “geographical” similarity (i.e. closeness), as might be used in small area estimation, does not work very well because of large differences between subdistricts at district boundaries. Thus it is necessary to use household characteristics and/or area-level features to define similarity. A major problem is that the large number of variables (60 demographic, 24 augmented) introduce much noise, and the fact that some variables that are intuitively similar have too much influence, e.g. there are 17 variables for the numbers of various types of schools but only three variables for earth observation data, so using a naive metric such as Euclidean distance over all variables does not produce meaningful results.
Instead we use a more theoretically motivated approach for defining similarity of subdistricts by choosing four “high-level” variables, two relating to employment sector (agriculture and industry), plus drinking water type and sanitation type, based on analysis of the pairwise correlations between variables and consumption expenditure. Each high-level variable generates a probability distribution over its possible values within each subdistrict, as follows: for the agriculture sector and industry sector, the probability that a (working) household member works in self employment, self employment with unpaid labour, self employment with paid labour, as an employee, as casual labour, or as unpaid labour; for drinking water, the probability that a household uses branded bottled water, bottled or tap water, protected wells or streams, unprotected wells or streams, or rainwater; and for sanitation, the probability that a household uses septic tank, pit/latrine, ground/bush, other, or a public toilet. Then, given two probability distributions for the same variable, i.e. for any two subdistricts to be compared, the similarity of the distributions is calculated using the standard Jensen-Shannon divergence metric (a symmetric version of the more well known, but asymmetric, Kullback-Liebler divergence). The overall similarity between two subdistricts is found by adding the four values for the Jensen-Shannon divergences. Because for each target subdistrict, one similar subdistrict is chosen from amongst 4 years data, the additional data is approximately 20–25% of the size of the original dataset. Actual dataset sizes are shown in Table 2.
Dataset sizes (original, added, augmented training: 2016–2019, test: 2020)
Dataset sizes (original, added, augmented training: 2016–2019, test: 2020)
Another question is whether the instances in the new dataset (i.e. after data augmentation) should be reweighted to preserve properties of the original dataset. There are conflicting intuitions here, as if the added data is supposed to fill “gaps” in the original dataset, the new data should not be weighted any differently from the original data. On the other hand, adding new data changes the characteristics of the original dataset, more so if a large amount of new data is added. In the context of small area estimation, Schirm and Zaslavsky [33] used census data to reweight samples to preserve distributions of household characteristics of the population of US states on certain key dimensions, important because US states vary widely in population, and Tanton et al. [34] reweight samples to preserve various totals of household types obtained from prior census data in the context of poverty mapping in Australia.
Given these conflicting intuitions, we investigate the question empirically, studying two conditions, with results given below: (i) no weighting, and (ii) weighting to preserve the total weight of the instances in each subdistrict, and thus the relative contribution of each subdistrict to the overall model for the district. This is to combat the problem that the quantity of the added data is much greater than the original data or is much larger for some subdistricts but not others. More precisely, under the weighting scheme, if there are
We now present the results, again focusing on inclusion/exclusion error (IE/EE) and RMSE, for three of the methods described above, Ridge Regression (commonly used in practice), Random Forest Regression (a data-driven statistical method that often performs well in removing the effect of outliers), and XGBoost (the best performing machine learning method, that is similar in flavour to Random Forest Regression, but typically fits the training data better, whilst also generalizing to unseen test data). However, as commonly known, the performance of these machine learning methods depends heavily on suitable hyperparameter settings (see Weerts, Müller and Vanschoren [35] for a systematic analysis of this issue and Isabona, Imoize and Kim [36] for insights into tuning for Random Forest Regression).
Often hyperparameters are optimized using a grid search over various possible settings run over a portion (typically 70%) of the training set, and evaluated using the remaining portion of the training set as a validation set. However this automated approach does not work well in our context due to the small training set sizes (especially for Sumba Tengah, see above). Instead, after evaluating various hyperparameter settings using a 70/30 training/validation split, we manually chose hyperparameter settings that work consistently across all the districts. The same settings are then used for all feature and data augmentation experiments so as to isolate the effects of data and feature augmentation from those due to the choice of hyperparameters. Specifically, hyperparameters for the machine learning methods are those standard in the Python package sklearn, except that for Random Forest Regression, the number of estimators is 500, the maximum tree depth is 5, the minimum sample split is 3 and the minimum number of samples at a leaf is 3, while for XGBoost,3 the number of estimators is 700, the maximum tree depth is 5, the minimum sample split is 3, the regularization term
Table 3 shows the results of adding new variables/ features to the original datasets for the three districts, i.e. with no data augmentation. Comparison with Table 1 shows that IE/EE is actually worse in numerous cases, especially for Sumba Tengah. Results for RMSE are consistently worse for Ridge Regression and Random Forest Regression, and better for XGBoost, in Brebes and Sumba Tengah, however not for Surabaya. Moreover, these best results (for this scenario) are with different sets of variables, Ridge Regression with education (Brebes and Surabaya) and healthcare (Sumba Tengah), Random Forest Regression with all features (Brebes and Surabaya) and healthcare (Sumba Tengah), while XGBoost uses all features (Brebes, Surabaya) and healthcare (Sumba Tengah).
Feature augmentation (bottom 40% of households, train: calibrated 2016–2019, test: 2020)
Feature augmentation (bottom 40% of households, train: calibrated 2016–2019, test: 2020)
Data and feature augmentation (bottom 40% of households, train: calibrated 2016–2019 plus unweighted data from similar subdistricts, test: 2020)
Data and feature augmentation (bottom 40% of households, train: calibrated 2016–2019 plus weighted data from similar subdistricts, test: 2020)
In short, the results show that feature augmentation without data augmentation provides almost no benefit, with only XGBoost able to take any advantage of the new features, and results for Sumba Tengah seemingly worse than with the original features alone. A possible explanation for this is that for the added features to be useful, they need to enable further discrimination amongst the data relevant to expenditure estimation, however only a small amount of additional variation in the original data results for Brebes and Surabaya (using the number of schools), while healthcare features for Sumba Tengah seem only to add noise.
Table 4 shows the results when, for each subdistrict in the district, data from the most similar subdistrict in the same province over 2016–2019 is added to the original dataset, calibrated to 2020 values in the source district, and not weighted. In this situation, Ridge Regression (though not for Surabaya) and XGBoost for all districts improve over Table 1, but surprisingly, Random Forest Regression does not. In percentage terms, IE/EE for XGBoost reduces by around 8% for Brebes, 4% for Sumba Tengah and 3% for Surabaya, with small reductions in RMSE for the rural districts. The fact that additional data does not help much for Surabaya underlines the distinctiveness of Surabaya within its province, being the only large city, and thus is highly dissimilar to the other rural areas in the province: perhaps adding similar data from large cities in other provinces could improve these results.
Again, in this table, the best results for the methods are for different additional features: for Ridge Regression with earth observation (Brebes) and healthcare (Surabaya), for Random Forest Regression, education and earth observation (Brebes) and education (Surabaya), and for XGBoost, all features (Brebes) and education (Surabaya), however with no additional features for each method in Sumba Tengah. Again, for the added features to be useful, their ranges of values need to be similar to those in the original district (at least for some similar subdistricts), and also help discriminate between data in the original training set to improve expenditure estimation, and it appears that in Sumba Tengah, none of these features provide any additional information compared to the original features.
Table 5 gives the corresponding results when the dataset is weighted according to the scheme described above. Results are very similar to Table 4, showing that the weighting is making virtually no difference, and if anything results are slightly worse than those without weighting. With weighting, methods in Sumba Tengah are able to make use of earth observation (Brebes) and education features (Sumba Tengah, Surabaya).
A major motivation for this paper was the intuition that the approach to implementing the Proxy Means Test in Indonesia with 514 different models, a different model for each district, could be simplified by combining information from different districts, analogous to the way that small area estimation techniques combine information from different areas. Our approach uses data augmentation to fill hypothesized “data gaps” in each district using data from “similar” subdistricts, and additional features to improve the estimations.
The machine learning method XGBoost consistently performs best amongst the methods studied. Compared to Ridge Regression, the method commonly used in practice (Table 1), XGBoost with additional features and unweighted data augmentation from similar subdistricts (Table 4) provides an improvement in inclusion and exclusion error of 4.5% for Brebes, 13.9% for Sumba Tengah and 6.8% for Surabaya over households in the 2020 test set when trained on data from 2016–2019. Extrapolation from the survey design gives estimates of the reductions in the number of households misclassified as either in the bottom 40% or top 60% of all households at around 4.6% out of 120,000 households for Brebes, 17.4% out of 3700 households for Sumba Tengah, and 6.7% out of 154,000 households for Surabaya. Extended across Indonesia, this would result in considerably fewer erroneous classifications, and can be achieved with low cost, using only readily available administrative data and alternative data sources to provide features to complement the existing survey data. The improvements in IE/EE, however, come without any reduction in RMSE. We believe this is because, while data augmentation helps to fill data gaps, especially for households estimated to be in the middle deciles that contribute most to IE/EE, this comes at the cost of increasing RMSE for the bottom and top deciles (which however does not affect IE/EE as those households remain in the bottom 40% or top 60% of all households, respectively).
Our results show that reductions in inclusion and exclusion error can be achieved using data augmentation with additional features, but the districts we have examined are very different, and are still very different when combined with their “most similar” subdistricts in the same province. In particular, for Sumba Tengah (population 73,000), the sample size is so small that the problem is mainly lack of data, so data augmentation alone provides a good reduction in error, and moreover, feature augmentation does not help because the particular feature values are typically the same across the district (perhaps other features would give an improvement). For Brebes (population 1.8 million) and Surabaya (population 2.9 million), though, the problem is both a lack of data and that the SUSENAS survey features cannot sufficiently discriminate between similar households with differing expenditures; in these districts, data and feature augmentation combine to produce the reductions in error. Because of these differences between districts, we believe that developing better district-specific models is preferable to developing fewer though larger models, perhaps by combining districts.
The differences between the three districts also raise issues when applying our approach generally. First, it is important to choose appropriate features for augmentation, as “similar” subdistricts (as determined using household profiles) can have very different ranges of values for the augmented features, e.g. the number of different types of school and health clinics. This may add to the variation in the augmented dataset, therefore be useful in constructing more fine-grained models, however provide little improvement in estimation for the test set, which includes only examples from the original district. As an example, Sumba Tengah being a small rural district, earth observation and healthcare features tend to be the same in all subdistricts, however these features may vary more in the similar subdistricts. Next, a potential problem with weighting the augmented dataset is that sometimes the same “most similar” subdistrict is chosen for several subdistricts in the original dataset: the added households therefore end up being weighted much more highly than the examples in the original dataset, distorting the model. A similar problem that could arise in the unweighted scenario, though did not in our experiments, is that the similar subdistricts might be much larger than the original subdistricts, potentially affecting results in a similar way, in effect overweighting the added data.
As to potential improvements to the techniques introduced in this paper, we note that similar subdistricts have been defined using only 2019 data, however on closer inspection, the yearly data exhibits a high degree of concept drift, and in fact 2020 is more similar to 2018 (as far as household expenditure distributions are concerned) than 2019. Thus these results could be improved by using different ways of determining subdistrict similarity. It is also possible that alternative additional features would improve the methods, however we believe this is unlikely because of the problems in reliably obtaining values for such features, and the features obtainable from alternative data sources are at best proxies. Instead, what we believe is most needed are more selective ways of performing data augmentation.
Related work
Prior to this paper, there have been two broad approaches to using alternative data sources for official statistics (c.f. Marchetti et al. [6]): (i) the use of alternative data sources to define new features/covariates to be used as inputs to existing statistical methods, and (ii) the use of alternative data sources as proxies for statistics or indicators, typically at an aggregated level. By far the majority of the work related to poverty has been on the development of area-level proxies, i.e. poverty mapping, and that work has mainly focused on two types of data source, earth observations (including night-time lights) and call detail records.
An example where alternative data sources were used for feature construction is the work by Jean et al. [37], who examined night-time light intensities in Nigeria, Tanzania, Uganda, Malawi and Rwanda, using CNNs to extract features to predict the night-time light intensities from daytime satellite imagery; some features seem to correspond to urban areas, non-urban areas, water and roads. A collection of all derived features were then used in a Ridge Regression model with mean cluster-level values from survey data for estimating per capita consumption at the cluster level. Evaluation of the final model was done by comparing estimates calculated using the CNN features with those using night-time lights, with respect to a baseline calculated from surveys, with results derived using 10-fold cross validation. However, the
Work done by the Asian Development Bank assessed whether that method generalizes to other countries: the approach was adapted to measure poverty in the Philippines (at the municipality/city level) and Thailand (at the tambon level) [38]. As in Jean et al. [37], CNNs applied to satellite images were used to predict the mean night-time light intensity from daytime image features; while the authors present several such features, they do not attempt to interpret their meaning. The features were fed into a Ridge Regression model representing areas of size 4km
A number of papers investigate the use of intensity of night-time lights as a proxy for various poverty indicators defined at an area level. Mellander et al. [39] motivates this work by noting that night-time light intensity correlates with economic activity, however adds that such predictions are overestimates in urban areas and underestimates in rural areas. This is partly due to reflective light in cities and near city boundaries in urban areas, and the generally low values of light intensity in rural areas. Their study was conducted in Sweden. In tropical countries such as Indonesia, there are poor provinces such as West Papua with dense forests that filter night-time lights, so it is difficult to distinguish between different levels of poverty in such areas, as all have similar nightlight intensity. In addition, night-time lights data includes gas flares from mines that operate at night, which have no direct relationship to population centres. The work of Yu et al. [40] similarly aims to classify counties in China around Chongqing into poverty quintiles based on an “integrated” poverty index (a weighted sum of economic indicators). Proville et al. [41] take a closer look at correlations between night-time light intensities and a range of socio-economic indicators at national levels, showing that correlations over time are highest for electricity consumption (perhaps unsurprisingly), CO
More recently, night-time light data has been used in conjunction with other earth observation data in the context of poverty mapping, particularly vegetation, land use and crop indexes. Watmough et al. [42] used multiple data sources to predict poverty quintile in the Indian state of Assam. A similar study was undertaken by Zhao et al. [43], who combined land use indexes along with night-time lights in Bangladesh. Shi et al. [44] used night-time lights and vegetation indexes in conjunction with “point of interest” data capturing availability of services such as car services, hotels, medical services, shopping centres, train stations, restaurants, banks, etc., to predict a “comprehensive” poverty index in the counties around Chongqing (extending the work cited above). Similarly, Putri et al. [45, 46] combined night-time lights and vegetation indexes with Sulphur Dioxide (SO
Another strand of work utilizes call detail records (CDR) or mobile phone/positioning data (MPD), either for poverty mapping, or to extract information about mobility. Smith-Clarke et al. [47] shows that some high-level features extracted at the level of the cell tower, such as call volumes and durations, and various network properties, are moderately correlated with poverty, however develop statistics using only OLS regression with no validation. Blumenstock et al. [48] uses a large number of features from call detail records in Rwanda validated with a comparatively small number of individual users to develop a model of poverty at the cell tower level, which is then aggregated to the district level and compared to official statistics derived from household surveys. The aggregated results seem similar to those derived from surveys, though visually the model tends to overestimate wealth [48, Fig. 3)(C), p. 1075]. As in the above work, there is no test set on which the model is evaluated (5-fold cross-validation
Njuguna and McSharry [50] combine CDR and night-time lights data using a small number of intuitively understandable features to estimate a multi-dimensional poverty index at the sector level in Rwanda. One of the features, mobile phone ownership, is however likely to be less useful over time as more people acquire phones, so this model may need regular updating; a related concern pointed out by the authors is that mobile phone ownership is systematically lower in rural areas, potentially biasing the model. The results are given as correlations between estimated and actual values at the sector level derived using cross validation (the details are omitted from the paper), however the application of the method to other countries remains untested. Similarly, Steele et al. [51] combine mobile phone and earth observation data, such as vegetation indexes, night-time lights, climatic conditions and distance to roads or major urban areas, for estimation of poverty levels in Bangladesh using hierarchical Bayesian geostatistical models, with ground truth from aggregated survey data at the level of cell tower. In this work, evaluation was done on a separate 20% (out of sample) test set,
Finally, the work of Pokhriyala and Jacques [52] compares CDR and “environmental data” (economic activity such as production of various types of food, access to services such as major roads and water towers measured by distance) to predict a multi-dimensional poverty index (MPI) for communes in Senegal, showing that the multisource data is slightly more correlated with the MPI than CDR alone, with results derived using 10-fold cross-validation.
As can be seen from the above modest results, it is not reasonable to expect that earth observation data or call detail records, either singly or in combination, can be used as accurate proxies for poverty mapping at area levels, however, as we have shown, such data can be combined with survey and administrative data to reduce errors, even at the household level. Additional variables may help, for example as satellite image resolution increases, but what we believe will be more beneficial are more sophisticated models of poverty and judicious selection of causally relevant features in local models, such as we have recently argued to be essential for mapping food security [53].
Conclusion
Estimating household consumption expenditure via the Proxy Means Test has long been subject to high inclusion and exclusion errors. In this paper, we asked the question whether data science approaches and machine learning methods could improve on the standard commonly applied statistical methods, or whether the problem was the more fundamental one of (lack of) data quality due to limited sampling and inherent limitations to the predictive power of features used in development of models. We compared a number of machine learning methods, and introduced a number of approaches in the spirit of small area estimation that utilize area-level features and data augmentation at the subdistrict level (from “similar” subdistricts) to develop more refined models at the district level, evaluating the methods on three districts in Indonesia on the problem of estimating 2020 per capita household expenditure using data from 2016–2019.
The best performing method, XGBoost, is able to reduce inclusion/exclusion errors on the problem of identifying the poorest 40% of the population in comparison to the commonly used Ridge Regression method by 4.5% for Brebes, 13.9% for Sumba Tengah, and 6.8% for Surabaya, but without any reduction in the RMSE for estimating per capita household expenditure. We believe that the additional data from similar subdistricts is effective in filling “data gaps” and can be exploited using area-level features to produce models that better identify households in the middle deciles (thus reducing IE/EE and RMSE), however the overall RMSE is not reduced due to an increase in RMSE in other deciles.
Footnotes
Acknowledgments
This research was supported by the Australian Government through the Australian Research Council’s Discovery Projects funding scheme (project DP1801009 03). Thanks to BPS (Badan Pusat Statistik – Statistics Indonesia) for providing SUSENAS survey data and for numerous discussions, and TNP2K (Tim Nasional Percepatan Penanggulangan Kemiskinan – National Team for the Acceleration of Poverty Reduction) and Universitas Indonesia for hosting seminars on the topic of this paper.
