Abstract
Background:
The generalized propensity score (GPS) addresses selection bias due to observed confounding variables and provides a means to demonstrate causality of continuous treatment doses with propensity score analyses. Estimating the GPS with parametric models obliges researchers to meet improbable conditions such as correct model specification, normal distribution of variables, and large sample sizes.
Objectives:
The purpose of this Monte Carlo simulation study is to examine the performance of neural networks as compared to full factorial regression models to estimate GPS in the presence of Gaussian and skewed treatment doses and small to moderate sample sizes.
Research design:
A detailed conceptual introduction of neural networks is provided, as well as an illustration of selection of hyperparameters to estimate GPS. An example from public health and nutrition literature uses residential distance as a treatment variable to illustrate how neural networks can be used in a propensity score analysis to estimate a dose–response function of grocery spending behaviors.
Results:
We found substantially higher correlations and lower mean squared error values after comparing true GPS with the scores estimated by neural networks. The implication is that more selection bias was removed using GPS estimated with neural networks than using GPS estimated with classical regression.
Conclusions:
This study proposes a new methodological procedure, neural networks, to estimate GPS. Neural networks are not sensitive to the assumptions of linear regression and other parametric models and have been shown to be a contender against parametric approaches to estimate propensity scores for continuous treatments.
In secondary data analyses, researchers often adjust for the observed covariates and assume that unobservables are balanced. While the assumption is usually untestable in practice (Imbens & Rubin, 2015), neural networks (NN) may have advantages compared with conventional approaches such as ordinary least squares regression. This article provides a tutorial on the use of NN to estimate generalized propensity scores (GPS) for continuous treatments. Accurate GPS estimation is essential to subsequent steps in propensity score analyses (e.g., weighting). Rosembaum and Rubin (1983, 1984) presented proof that if the propensity score model is correctly specified, it will balance the distribution of observed covariates and result in an unbiased estimate of the treatment effect.
NN have advantages over conventional methods for estimating GPS being that they can approximate any polynomial function, regardless of the number of interaction terms (Westreich et al., 2010). Many social scientists are unfamiliar with NN and training machine learning models. Those scientists could use this article as an introductory guide for estimating GPS with NN. Given this primary aim, first, we present a Monte Carlo simulation study to compare NN with well-established propensity score estimation methods. Then, we evaluated grocery spending habits in U.S. Department of Agriculture Economic Research Service (USDA, 2015) designated “low access to supermarket” communities. Our evaluation applied the NN from the Monte Carlo simulation.
Our findings suggest that NN may be an alternative method for GPS estimation for continuous treatments, not a replacement for existing methods. Furthermore, it is common practice in propensity score analysis to apply a set of alternative methods for propensity score estimation and compare results (e.g., Collier & Leite, 2020; Setoguchi et al., 2008).
Theoretical Framework
Neyman’s (1923) notation for potential outcomes for randomized experiments was repurposed by Rubin (1973) for observational studies. The potential outcomes
GPS
Hirano and Imbens (2004) expanded Imbens’s (2000) study on categorical multivalued treatments by demonstrating that GPS shared qualities of Rosenbaum and Rubin’s (1983) propensity score for binary treatments. Such qualities included removing all biases associated with differences in the covariates and balancing properties that can be used to assess the adequacy of particular specifications of the score. The propensity score was generalized to treatments
where the GPS is
Hirano and Imbens’s (2004) procedures for Gaussian distributed treatment doses are as follows:
(1) Model the continuous treatment indicator as a function of covariates:
(2) Estimate the GPS as the density of the normal distribution given covariates:
(3) Model the outcome as a function of the Z and the GPS:
where
(4) Evaluate the covariate balance, and
(5) Estimate the average potential outcome at each treatment dose of interest:
Also in 2004, Imai and van Dyk expanded Imbens’s (2000) and Joffe and Rosenbaum’s (1999) studies to establish causal effects in observational studies where the treatment is categorical, ordinal, continuous, semicontinuous, or multivariate with a large number of covariates. Although Imai and van Dyk’s (2004) methods closely related to the works of the previously mentioned authors, they primarily focused on subclassification rather than matching. They also demonstrated how their methods can improve balance in a randomized design.
Inverse probability weighting can also be used to remove selection bias with continuous treatments (Robins et al., 2000). Inverse probability weights (IPW) for continuous treatment dose have the GPS as the denominator,
(Robins et al., 2000). The IPW remove bias by creating a pseudo-population within which a set of individuals assigned the same dose of treatment and another set receiving a different dose have similarly distributed covariates (Robins et al., 2000). Under the assumption that GPS are correctly estimated, the treatment dose will be uncorrelated with the covariates in the pseudo-population generated by weighting observations by the IPW (Leite, 2016). Since these foundational studies on propensity scores for continuous treatments, researchers across an array of disciplines have conducted propensity score methods using the GPS (e.g., Bia & Mattei, 2008; Foster, 2003; Fryges, 2009; Kluve, 2012; Moodie & Stevens, 2012).
GPS Estimation
Parametric estimation
The most common approach to estimate GPS, demonstrated in Hirano and Imbens (2004), is with ordinary least squares using a parametric model of the relationship between treatment dose and covariates:
where
Algorithmic estimation
Data mining encompasses a wide range of research techniques that include more traditional options such as database queries and more recent developments in machine learning and language technology. The drawbacks of ordinary least squares regression to estimate GPS can be overcome by algorithmic data mining procedures, which do not rely on parametric model assumptions. The application of data mining to propensity score methods is not new to educational research (Beane et al., 2017; d’Agostino, 1998; Rubin & Donald, 1997; Stukel et al., 2007), but more literature is needed on the estimation of GPS using data mining techniques.
NN is a machine learning technique that is inspired by biological NN in the brain (Zhu, 2017). Like nerves in the brain, processing elements known as perceptrons connect to other processing elements (Walczak, 2018). Perceptrons are arranged in a layer or vector, with the output of one layer serving as the input to the next layer (see Figure 1). Many algorithms used in machine learning are based on the notion of the perceptron (Rosenblatt, 1962). For example, deep learning is a NN that uses multiple layers between input and output layers. If
It has shown to yield superior results with numerous data distributions (Witten et al., 2016).

Artificial neural network.
The weights in NN are computed by a mathematical function, such as backpropagation, that activates the neurons, which are the circles within layers shown in Figure 1 (Hagan & Menhaj, 1994). Weights are applied to each link between neurons and provides NN with flexibility for detecting interactions. Each arrow has an associated w that is multiplied by inputs (Gupta, 2013). If the weight is high, it will make the associated input strong (Gupta, 2013).
Optimizing a NN for a particular propensity score application and standardizing general procedures whereby NN can be routinely used to estimate propensity scores take substantial investigation and effort (Westreich et al., 2010). NN should be trained while avoiding overfitting. This is when the error on the training set is driven to a very small value, but when new data are presented to the network the error is large (Srivastava et al., 2014). In such a case, the network has memorized the training examples, but it has not learned to generalize to new situations.
Deep learning is derived from the juncture of research on NNs, artificial intelligence, graphical modeling, optimization, pattern recognition, and signal processing (Deng & Yu, 2014). As with most concepts composed of various disciplines, deep learning has a number of definitions relative to its use and the aim of the researcher. Commonalities across definitions of deep learning are (1) models with multiple layers of nonlinear information processing and (2) supervised and unsupervised learning procedures with more, abstract layers (Deng & Yu, 2014). Figure 1 is an example of a deep learning NN because it has multiple hidden layers, but all NN have an input, a hidden, and an output layer.
Few studies have investigated algorithmic data mining procedures to estimate propensity scores (e.g., Keller et al., 2013; McCaffrey et al., 2004; Setoguchi et al., 2008; Westreich et al., 2010). Data mining methods that have been studied include decision trees, generalized boosted regression, NN, and support vector machines. However, most of these applications were specific to estimating propensity scores for categorical treatments. Within this limited body of research, there are even fewer studies on propensity score analysis with normal, lognormal, and gamma distributed treatment doses. Prior to the current study, no research has expanded deep learning NN to continuous treatments nor compared NN with GLM.
Research Questions
Hypotheses
Because the first research question is descriptive, the hypotheses below refer to the second research question:
Method
The research questions are addressed through a Monte Carlo simulation study. We manipulated the following conditions in this Monte Carlo simulation study: (1) two levels of covariates (eight and 16), (2) three treatment distributions (Gaussian, moderately skewed, and highly skewed), (3) three sample sizes (500, 1,000, and 5,000), and (4) two methods to estimate GPS (GLM and NN). Then, we trained and tested six NN to estimate GPS for eight and 16 covariates by three distributions (2 × 3).
In this “Method” section, we begin by discussing the simulation procedures, then explain procedures used to estimate GPS with NN and GLM, and last explain methods to analyze the results.
Monte Carlo Simulation Study
All simulations were performed in R Version 3.2.3, while analyses with the GLM were performed in R, NN were implemented using Python Version 3.6.2, running on a Linux operating system. For each combination of simulated conditions, we generated 1,000 data sets. The steps to build the model for treatment assignment in this simulation study were as follows: Step 1. Covariates were simulated from a normal distribution with a mean of zero and standard deviation of one. All covariates were uncorrelated. Applications of propensity score estimation in social science data generally range between six and 32 covariates (Ansong et al., 2015; Cho et al., 2012; Monlezun et al., 2015; Morgan et al., 2010; Sullivan & Field, 2013). We simulated eight and 16 covariates to mimic the general range of applied studies. Step 2. We sampled half of the covariates and split them into categorical variables and kept the other half continuous. Half of the categorical variables were binary, and the remaining half were transformed to equal group sizes by randomly selecting either tertiles, quartiles, or quintiles. We varied the effects and number of categories of the covariates in order to prevent the models from overtraining on a single scale. Step 3. The population for the continuous treatment dose was varied randomly with each iteration. Here is a full model for a case of eight main effects:
where
We created models with main effects, quadratic effects, two-way interactions, and three-way interactions for the relationship between treatment assignment and the covariates in each of the number of covariate conditions. The number of main effects was equal to the number of covariates simulated, but the number of remaining effects was randomly chosen. We randomly selected regression coefficients from a uniform distribution with a range of 0 to .5. The uniform distribution ensured a constant probability for all coefficients within our sampling interval. Thus, the uniform distribution removed potential effects of coefficient size on estimation performance. This study seeks to accurately predict propensity scores rather than estimate the coefficients of the population model. So, there is no advantage of having a fixed population model. Also, varying the population model allows the challenge of data mining methods to detect a different set of interactions and terms.
Gaussian and gamma-distributed treatment doses
Treatment doses are not always normally distributed in applied research studies. We studied different treatment distributions to make this simulation study more informative for both applied and methodological researchers. The conditional normal density of the Gaussian distributed treatment dose
where
For each gamma distributed treatment, the independent variables were
Sample size
We kept our sample sizes close to other Monte Carlo simulation studies that evaluated propensity score estimation methods (e.g., Arpino & Mealli, 2011; Drake, 1993; Kim & Seltzer, 2007; Leite et al., 2015; Rosenbum & Rubin, 1983). We simulated sample sizes of 500, 1,000, and 5,000.
Analysis of Simulated Data Sets
Training and testing the neural networks
Hyperparameters, such as hidden units and layers, are often tuned heuristically by hand (Witten et al., 2016, p. 432). In this study, we hand-tuned each NN’s hyperparameters (Bergstra & Bengio, 2012). We determined the computation for a network to estimate GPS with L hidden layers
where the ReLU activation function is applied to the output of preceding hidden layers hl (Witten et al., 2016, p. 423). Each NN had four hidden layers.
We trained the NN to diverse input patterns to increase their generalizability to real-world applications. Eighty percent of each simulated data set was used for training, while the remaining 20% tested the models with mean squared error (MSE) as a performance measure. The optimal splitting of data for training sets typically range from 40% to 80% (Dobbin & Simon, 2011). NN were implemented in Python using the scikit-learn 0.19 library (Pedregosa et al., 2011).
Because our NNs only had a few parameters, it was not hard to select reasonable values by hand (Dahl, Sainath, & Hinton, 2013). Using backpropagation, we trained our NN with a ReLU activation function, an adaptive learning rate beginning at .01, a stochastic gradient-based optimizer, and four hidden layers (Hagan & Menhai, 1994). One hundred neurons and 1,000 iterations and momentum of 0.9 trained the NN to estimate continuous treatments. The solver for weights of the NN was Scikit learn’s default stochastic gradient decent optimizer. We chose “solver” for weight optimization. The specific stochastic gradient-based optimizer solver was proposed by Kingma and Ba (2014).
The MSE, defined as
where
GLM
The GLM were fitted using a full factorial specified model in order to include all possible interactions of confounding variables in the test data sets. We fitted the GLM using the whole data set. Models to estimate GPS for Gaussian distributed treatments were estimated with linear regression, while moderately and highly skewed treatments were estimated with the gamma regression model
(Stacy, 1962) and the probability density presented in Equation 10.
For skewed treatments, we used the gamma.shape function within the MASS package version 7.3-53 in R (R Core Team, 2017). The function estimates the shape parameter and then adjusts coefficient estimations and predictions in the generalized linear model.
Analysis of Monte Carlo Simulation Results
We compared NN and linear regression (LR) with MSE and Pearson’s correlation coefficient. To assist in the interpretation of the results, we ran split plot analysis of variance (ANOVAs) where Pearson’s correlation coefficient and the log of MSE served as the outcome. MSE combines bias and variance (Roberts & Vandenplas, 2017). We took the log of MSE because the raw values had extreme nonnormality because ANOVA is only robust to moderate non-normality. Sample size, number of covariates, and proportion treated were included as between-subject factors. The GPS estimation method was a within-subject factor. We also included all possible interactions of factors. Last, the generalized eta squared (
Results
Following the procedures discussed above, we simulated Gaussian, moderately skewed, and highly skewed treatments, eight and 16 confounding variables, and sample sizes equal to 500, 1,000, and 5,000. After estimating GPS with GLM and NN, we compared the performance of each data mining technique with Pearson’s correlation coefficient and MSE.
Pearson’s Correlation Coefficient
In general, the correlation for estimates from GLM and NN was most similar for Gaussian distributed treatments with 16 covariates. For example, when treatments were Gaussian distributed, there were 16 covariates, sample sizes were 1,000, and the average correlations were equal between GLM and NN. The results indicate that with a few exceptions, GPS estimated with NN averaged higher correlation coefficients compared with GLM. Table 1 illustrates each of these differences.
Average Pearson Correlation Coefficients for Estimated Generalized Propensity Scores.
In Table 1, the average Pearson correlation coefficient between actual and estimated GPS is reported for conditions in which the sample sizes were 500, 1,000, and 5,000, the distributions of treatments were Gaussian, moderately skewed, and strongly skewed, and the number of covariates were equal to eight and 16.
Some conditions in which Pearson’s correlation coefficient was higher for GLM compared with NN are also displayed. These cases tended to occur with Gaussian distributed treatments, eight covariates, as well as when sample sizes were 5,000.
Manipulated conditions with substantial effects
This section describes which manipulated conditions impacted the average Pearson correlation coefficients. Table 2 displays the
Table of Generalized Eta Squared (
Note. The colon in between conditions indicates an interaction term. DMT = data mining technique; GES = generalized eta squared.
Main effects
The particular data mining technique employed, GLM or NN, had the highest effect (
Interaction effects
Each possible two-way interaction of data mining techniques with other manipulated conditions (distribution of treatments, sample sizes, and number of covariates) notably affected
There was a three-way interaction between treatment distribution, sample size, and data mining technique
Mean Squared Error
A visual examination of the MSE of the manipulated conditions and interactions reveal noticeably different means between GLM and NN when treatments were skewed. In Table 3, MSE ranged between 0 and 206,728,676 for GPS estimated with GLM and between 0 and 195.61 with NN.
Summary Statistics of Mean Squared Errors.
For conditions of Gaussian distributed treatments, the average MSE values were negligible (
Manipulated conditions with substantial effects
The manipulated conditions that affected the MSE values are presented in Table 4. All manipulated factors and interactions had a negligible effect on MSE. MSE values reported in Table 3 and Table 4 are raw.
Average Mean Squared Errors for Estimated Generalized Propensity Scores.
However, the log transformation on the raw values was employed to make the highly skewed distribution of MSE less skewed to conduct an ANOVA and calculate generalized
Table of Generalized eta Squared (
Note. The colon in between conditions indicates an interaction term. DMT = data mining technique.
Main effects
There were substantial main effects of all manipulated conditions in this study: distribution of treatments (
Interaction effects
There were substantial two-way effects of treatment and sample size (
The interaction between the distribution of the treatment and number of covariates can also be seen in Table 4. Of note, there was no change between eight and 16 covariates when the treatments were Gaussian distributed. However, MSE averaged higher values for 16 covariates compared with eight covariates when the treatments were highly skewed. Under most simulated conditions when the treatments were Gaussian distributed, both data mining techniques performed nearly equivalently. However, average MSE values were lower for NN compared with GLM when treatments were skewed. MSE values varied across methods and sample size more so for GPS estimated with GLM compared with NN. When treatments were Gaussian distributed, NN estimated GPS with slightly higher MSE when the number of covariates was eight compared with 16. For gamma-distributed treatments, NN estimated GPS with higher MSE when the number of covariates was 16 compared with eight. The MSE were not as clean-cut for GPS estimated with GLM but are explained by the significant three-way effect of treatment, sample size, and data mining technique (
Under these circumstances, when sample sizes increased, MSE decreased. However, under similar conditions when the number of covariates were eight, MSE values increased with sample size.
Empirical Study
We sought to quantify the effects of distance (i.e., miles between permanent residence and main grocery store) on consumer shopping behavior. Specifically, we analyzed data from the Food in our Neighborhood Study (FIONS; Karpyn et al., 2020) to estimate GPS and to predict dose–response functions (d.r.f), that is, the response function of grocery store expenses for each level of distance between grocery stores and permanent residences.
Prior to implementing propensity score analyses, we hypothesized that people spend more money at the grocery store when they live further away from it. We estimated GPS with ordinary least squares regression and contrasted GPS with estimates from a newly proposed methodology in the context propensity score analysis, NN. Estimated GPS from both methods were employed in propensity score analyses to measure the impact of the treatment variable, distance, on the outcome variable, monthly grocery store expenses. This demonstration aims to introduce applied researchers and methodologists to the promise of NN in propensity score analysis.
Sample
Door-to-door surveys of residents in food deserts, areas with limited or no access to grocery stores as defined by the USDA (2015), were administered to collect data on primary outcome measures and covariates (e.g., distance, marital status, general health, race/ethnicity, income, type of home, number of residents in the home, body mass index, education, and allocated government assistance). The sample includes 796 men and women aged 18+ who identify themselves as the primary food shopper in their household. The analytic sample (N = 743) was constrained to shoppers with complete case responses to all relevant pretreatment covariates (c = 16). Participants were asked to report the store where they did their primary food shopping, and a Euclidian distance was calculated from their home address to that location, forming a “distance to store” outcome variable. An average per person per household total monthly grocery expenditure was calculated by taking the response value to the question “How much do you spend per month on groceries?” and dividing it by the response to the question “How many people does this feed?”
Analysis
The raw distance variable was skewed and was consequently transformed by taking the common logarithm to approximate the normality assumption of ordinary least squares (OLS) regression. Then, the authors estimated the conditional densities for the distance treatment variable with OLS and NN. The GPS model estimated with OLS included all combinations of two-way and three-way interactions. Table 6 displays the summary statistics of GPS estimated by both methodologies.
Summary of Estimated GPS.
Note. OLS = ordinary least squares; NN = neural networks; GPS = generalized propensity score; DMT = data mining technique.
GPS estimated with OLS had a greater range compared with GPS estimated with NN; medians and means were comparable. Next, we evaluated the covariate balance by stratifying based on the GPS and fitting one regression for each covariate with the treatment dose as the outcome and with GPS strata and the covariates as the predictor. The standardized regression coefficients were used as a measure of the effect size of the covariates on treatment dose. Covariate balance was considered adequate if the coefficient was lower than .25 (Stuart, 2010). The standardized regression coefficients obtained with the code above are shown in Table 7.
Standardized Coefficients of Regressions of Treatment Dose on Covariates.
Note. OLS = ordinary least squares; NN = neural networks; SNAP = The Supplemental Nutrition Assistance Program; WIC = Women, Infants, and Children.
All but two covariates (bold in Table 7) achieved the desired level of balance with OLS, but all covariates achieved balance with NN. Following Hirano and Imben’s (2004) methods, distance, poverty, employment, and GPS obtained with OLS were included in the first outcome model as a function of the treatment. The second outcome model predicted grocery expenses with distance and GPS obtained from NN but without covariates because all of the covariates met balancing criteria.
In the final step, predicted values from the outcome models were plotted to display a d.r.f. Figures 2 and 3 show the treatment dose effects on the monthly grocery expenses with confidence intervals indicated by dashed lines. We calculated the confidence intervals by first estimating individual treatment effects at the 1%–100% percentiles of treatment dose. Then, we calculated upper and lower limits by adding and subtracting 1.96 times the standard errors from the mean grocery expenses. Figure 2 is derived from the first outcome model with GPS estimated with OLS, and Figure 3 derived from GPS estimated with NN.

Estimated dose–response function post estimation of generalized propensity score with ordinary least squares.

Estimated dose–response function post estimation of generalized propensity score with neural networks.
There were little to no differences between the d.r.f. in Figures 2 and 3. The confidence intervals are narrower in the left side of the distribution of log distance to the supermarket because more shoppers were available to estimate the treatment dose effects at the left region of the distribution. It is also noticeable that grocery expenses average increases with increases in log distance to the supermarket. This positive effect of distance from the grocery store supports the hypothesis that people spend more money at the grocery store when they live further away from it. Despite differences in the range and distribution of GPS estimated with OLS and NN, following Hirano and Imben’s (2004) methods to determine a d.r.f. yielded nearly equivalent findings.
Discussion
Accurate estimation of GPS is critical to the application of propensity score methods to estimate the effect of continuous treatments. GLM have model assumptions, require complex manual specification of interactions and quadratic terms, and are susceptible to overfitting compared with nonparametric data mining techniques (Setoguchi et al., 2008; Westreich et al., 2010). We investigated the accuracy of GPS estimated with GLM and NN from simulated data with medium to low sample sizes and complex models without changing the specification of interactions. Additionally, we performed propensity score analyses on an empirical data set using traditional and newly developed approaches.
This study is the first to establish literature on fitting NN for GPS estimation for continuous treatment doses. All studied approaches were trained and tested with MSE. Although the loss was low with our NN architectures, there exists other hyperparameters that may yield even better performance. Future research should compare NN trained using quasi-Newton methods with different training algorithms (e.g., gradient descent, conjugate gradient, and Levenburg–Marquardt; Fazayeli & Banerjee, 2016; Kingma, 2014; Le et al., 2015). We fitted GLM to estimate GPS with ordinary least squares for Gaussian distributed treatments and maximum likelihood estimation for gamma-distributed treatments. Other estimators are also available for GLM (Amiguet et al., 2017).
Our study extends the approach to estimate GPS proposed by Hirano and Imbens (2004). Previous research on estimating propensity scores for categorical treatments has favored nonparametric data mining techniques over traditional methods (Collier & Leite, 2020; McCaffrey et al., 2004; Pirracchio et al., 2015; Setoguchi et al., 2008). This study aligns with previous research, having found higher correlations and lower MSE values after comparing the true GPS with the GPS estimated by NN. Kreif et al. (2015) applied and proposed the “Super-Learner” as a new technique to estimate GPS and the d.r.f. for continuous treatments. Similar to this article, they found nearly equivalent GPS and outcomes after comparing the results of the machine learning technique to parametric implementations of the GPS. Kreif et al. (2015) also used MSE as the loss function for the “Super Learner.”
GLM
We demonstrated how GLM can estimate GPS of skewed treatments that are moderately correlated with actual GPS. However, MSE values were large. By looking at the data, we learned that the correlations and MSE were consistent in that when the correlation was very low, the MSE was very high. Furthermore, these cases only occurred when GPS were estimated with GLM. The minimum correlation between actual and estimated GPS for GLM was 0, while the minimum correlation for NN was 0.54. The implication is that if the correlation between true GPS and estimated GPS is low, the selection bias would not be removed.
Distributions in GLM
Theoretically, the gamma distribution was the right choice in our simulation because the treatments ranged from 0 to infinity, the variance of the treatments was proportional to the square of the mean response, and the treatments were generated at each covariate from a gamma distribution with the mean as the exponential of the linear function of
Estimating Treatment Effects
Dose–response functions
The GPS estimated with NN and GLM can be used in plotting the d.r.f., as demonstrated in our empirical example. Graham et al. (2015) extended the binary d.r.f e augmented regression approach of Scharfstein et al. (1999) to derive a Taylor approximation for continuous d.r.f. In this extension, an outcome regression model is augmented with a set of inverse GPS covariates to correct for potential misspecification bias. Users of the R statistical software can estimate the average dose–response functions with the causaldrf package version 0.3 (Schafer & Galagate, 2015). The package offers a range of existing and new estimators such as Schafer and Galagate (2015), Bia et al. (2014), Flores et al. (2012), Imai and Van Dyk (2005), and Hirano and Imbens (2004).
IPW
We previously stated that IPW can also be used to remove selection bias with continuous treatments (Robins et al., 2000). Weights can be implemented as survey weights when estimating the ATE after calculating GPS with GLM. This study demonstrated how to calculate the denominator of IPW with NN. Had we trained the NN to predict treatment dosage instead, perhaps the same NN could have estimated the numerator with history covariates as a predictor of treatments. Gruber et al. (2015) calculated the numerator of IPW with a NN with one hidden layer containing two nodes. However, the authors did not outline the training techniques and details on their NN hyperparameters. Future works should compare IPW and alternative weighting techniques to reduce the bias and estimate ATE.
Limitations and Future Research
We did not test multiple NN architectures to determine the most optimal approach to estimate propensity scores. Zhu et al. (2015) proposed a boosting algorithm to estimate GPS and provided a method to determine optimization. There are several established methods to optimize NN. It is unclear what performance would have been yielded by other NN architectures and should be noted for future work. Another limitation is that the confounding variables were all uncorrelated. Varying the simulated relationships would have added more complexity to each iteration and ultimately, the study.
As noted above, continuous treatments are common in behavioral and educational research. The conditions of our simulation study suggest that researchers implementing NN can estimate GPS that are approximately unbiased at small and moderate sample sizes for skewed treatments. On average, the MSE values decreased with increases in skewedness. Researchers who are exploring model specification can use NN as an alternative to ordinary least squared models because it is nonparametric and less prone to errors (Drake, 1993). Variable selection methods, such as stepwise deletion, have produced unstable estimates in previous research and yielded highly variable treatment effect estimates (Breiman, 1996; McCaffrey et al., 2004). Despite the limited research on implementing NN in the context of propensity score analysis, we believe the current study provided improvements for estimation with continuous treatments. The difference in GPS estimation may not result in differences in treatment effect estimates for the simulated conditions, and additional research is needed to understand when one of the methods would be preferable. This article serves solely as an introduction. We hope you agree that a gentle introduction is much more engaging, particularly for nonmachine learning expert readers.
With regard to the data used for the empirical study, we acknowledge that the distance to the store variable was limited to a sample of shoppers living in a food desert. This may in turn limit interpretation of findings to similar, food desert circumstances. However, given the importance of food retail to community well-being and health, such information can support future efforts to understand the consequences of food deserts.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
