Abstract
Researchers have become increasingly interested in programs’ main and interaction effects of two variables (A and B, e.g., two treatment variables or one treatment variable and one moderator) on outcomes. A challenge for estimating main and interaction effects is to eliminate selection bias across A-by-B groups. I introduce Rubin’s causal model to approximate factorial experimental designs for studies with partial randomization and nonrandomization. I apply a Monte Carlo simulation to evaluate several propensity score applications. The findings suggest the following two applications for reducing bias and mean square error of parameter estimates when analyzing the relationship of two variables and an outcome: (a) inverse of propensity score weighting based on one multinomial propensity score model and (b) subclassification based on two binary propensity score models. As a demonstration, I examine whether the effects of the Head Start program, compared to other center-based care, for improving children’s reading achievement vary by child care quality.
Introduction
Researchers and policy makers have become increasingly interested in both the main and interaction effects of multiple variables in program evaluations. One might want to know not only the main effects of a new math curriculum (A), for example, or a teacher professional training program (B) on students’ math achievement (Y) but also whether the effects of the new math curriculum taught by teachers who also received the teacher professional training are better than the effects of the curriculum taught by teachers who did not receive the training. As another example, it might be valuable to know not only the effects of Head Start (A) on child outcomes in general but also the influence of child care quality (B).
These two examples illustrate scenarios that concern the interaction of two independent treatment variables, (A × B); that is, the effect of one treatment variable also depends on the level of another treatment variable. In the first case, teacher professional development (B) might influence the effect of the math curriculum (A); in the second case, it is quality (B) that might influence the effect of Head Start (A). Or, stated another way, in the first case, the effect of teacher professional development (B) might vary by the curriculum condition (A, new curriculum or not); and, in the second case, the effect of child care quality (B) might differ by the types of care (A, Head Start or not). These are situations in which two independent treatment variables are malleable. One can imagine changing how much professional development teachers receive or trying to improve a program’s quality. There is also a scenario in which these kinds of variables, that is, ones that influence treatment effects, cannot be manipulated, but we might be interested in their influences, nevertheless. For example, in a study that examines how the effects of Head Start (A) vary by race/ethnicity (B), exposure to Head Start can be manipulated, while race/ethnicity cannot. In these situations, we consider these nonmalleable variables, for example, race/ethnicity, to be moderators. A moderator affects the direction and/or magnitude of the relationship between the treatment variable and the outcome variable, and the moderation effects are usually tested through an examination of the interaction term (A × B) of the treatment variable and the moderator (Baron & Kenny, 1986). Note that some researchers (e.g., Kraemer, Kiernan, Essex, & Kupfer, 2008; Kraemer, Stice, Kazdin, Offord, & Kupfer, 2001) suggest that the moderators should precede the treatment variables, however, the Baron and Kenny’s approach (e.g., Baron & Kenny, 1986; Frazier, Tix, & Barron, 2004) does not require this criterion for moderation analysis, which means that the moderators could be posttreatment variables, for example, the program-related subgroup variables (e.g., Hill, Brooks-Gunn, & Waldfogel, 2003; Lochman, Boxmeyer, Powell, Roth, & Windle, 2006; Peck, 2003; Schochet & Burghardt, 2007). This article adopts the Baron and Kenny’s approach for moderator analysis.
The above examples illustrate two commonly seen situations in which two variables, that is, two independent treatment variables, and the main treatment indicator and a moderator of some sort, work together to affect an outcome. Another commonly seen situation is mediation, whereby the treatment variable, which occurs earlier, changes the outcome through changing the mediator, which occurs later (Baron & Kenny, 1986; Kraemer et al., 2001). Detecting a mediation effect is of interest in evaluation research, and researchers have proposed various approaches for causal mediation analysis (Hong & Nomi, 2012; Imai, Keele, & Tingley, 2010). Nevertheless, this article does not concern causal mediation analysis but focuses on the analysis of main and interaction effects of two independent treatment variables and the moderation effect under partial randomization or nonrandomization.
The purpose of this study, based on the counterfactual model (Holland, 1986; Rosenbaum, 2002; Rubin, 1974), is to evaluate several propensity score methods by using a Monte Carlo simulation to compare several propensity score methods and to identify appropriate approaches to reduce bias and mean square error (MSE) of parameter estimates of the main and interaction effects of two variables. In the rest of the first section, I demonstrate the challenge in analyzing the effects of two treatment variables and moderation effects.
The biggest challenge in analyzing two treatment variables and moderation effects is the ability to produce unbiased estimates of both the main and interaction effects of two variables. In the situation in which the two variables/factors (A and B) are both dichotomous and malleable, ideally, we can use a 2 × 2 factorial experimental design to estimate the main and interaction effects of A and B. Figure 1 (Shadish, Cook, & Campbell, 2002, p. 264) demonstrates a 2 × 2 factorial experimental design, where subjects are randomly assigned to four groups: (1) Level 1 of Factor A and Level 1 of Factor B (A1B1), (2) Level 1 of Factor A and Level 2 of Factor B (A1B2), (3) Level 2 of Factor A and Level 1 of Factor B (A2B1), and (4) Level 2 of Factor A and Level 2 of Factor B (A2B2). Due to randomization, the main and interaction effects of two factors can be unbiasedly estimated. However, when random assignment of both variables is not feasible, which includes situations for which only one or none of the two variables concerns random assignment, the students in the four cells (groups) in Figure 1 may systematically differ from each other and produce biased estimates of both the main and interaction effects.

2 × 2 Factorial design. Adapted from figure 8.1 by Shadish, Cook, and Campbell (2002, p. 264).
To illustrate that partial randomization (only one variable is randomly assigned) may produce a biased estimate of the interaction effect, consider a hypothetical example (Table 1). The purpose of the study in the example is to examine the effects of two variables (treatment status and race/ethnicity) on math achievement. This example involves partial randomization, in which the subjects were randomly assigned to the treatment condition (e.g., new math curriculum), while race/ethnicity (White vs. non-White) cannot be manipulated or randomly assigned. The total sample included 16 participants (8 White and 8 non-White). The White sample had higher socioeconomic status (SES) than did the non-White sample, and the average SES for the total sample is 0.
Data for a Hypothetical Example of a Partial Randomization.
Note. SES = socioeconomic status. aMatched sample is indicated by *. Matching on SES across four groups (Treatment × White).
Eight subjects were randomly assigned to the treatment condition. Math achievement was collected after a 1-year intervention. The true productivity function of the math outcome was assumed to be the following: Math = 100 + 10 × White + 2 × SES + 5 × Treatment + 1 × SES × Treatment + 0 × White × Treatment. In this equation, the average treatment effect is assumed to be 5 points and the moderation effect of SES is assumed to be 1 point, but there is no moderation effect of race/ethnicity, that is, the treatment effect did not differ between the White and the non-White participants. One advantage of randomized experiments is that the treatment effect estimates do not rely on statistical models and can be directly estimated by the treatment control mean difference.
Table 2 presents the means and treatment effects using the full sample for Table 1. Note that the overall treatment effect is 5 points, which is an unbiased estimate. For subgroup analysis in regard to race/ethnicity, the treatment effects for non-White and White are 4 and 6 points, respectively. Due to the random assignment of participants to the treatment condition, the treatment effects for subgroups are unbiased estimates, as well. However, the correct interpretation is that the treatment effect for the non-White (lower SES) group is 4 points and, for the White (higher SES) people, 6 points. It would not be accurate, however, if one claimed that the moderation effect of race is 2 points or that the treatment effect is 2 points higher for Whites than for non-Whites. The 2-point treatment effect difference is not due to race/ethnicity itself but, rather, due to SES. It is because the SES of Whites is 2 points higher than that of non-Whites that the treatment effect for Whites appeared 2 points higher than that for non-Whites. If race/ethnicity could be manipulated and participants randomly assigned to White and non-White groups, then there would be no SES difference between the race/ethnicity subgroups, and the treatment effect difference between Whites and non-Whites (i.e., moderation effect) could be estimated unbiasedly. This example illustrates that, in analyzing moderation effects under partial randomization, it is possible to get biased moderation effect estimates when there are critical differences between the subgroups (SES difference between Whites and non-Whites in this example), even if the participants were randomly assigned to treatment conditions. 1
Means and Treatment Effects for the Full Sample in Table 1.
Note. N = 16.
For the above simple example, we can make the baseline variable (SES) equivalent across four groups (non-White control, non-White treatment, White control, and White treatment) by matching on SES. The participants denoted by the asterisk (*) in Table 1 consist of the matched sample, in which each of the four groups includes two participants with SES equal to 0 and 1. The means and treatment effects for the matched sample (N = 8) are presented in Table 3. The overall treatment effect for this sample is 5.5 points, which is an unbiased estimate, and the treatment effects for two subgroups are both 5.5 points, which also is an unbiased estimate. The moderation effect, that is, the treatment effect difference between the Whites and non-Whites is 0, which is an unbiased estimate. This simple matching analysis illustrates that it is essential to keep the baseline variables equivalent across all four groups to estimate the interaction effect under partial randomization.
Means and Treatment Effects for the Matched Sample in Table 1.
Note. N = 8.
The above example also indicates that it is possible to achieve unbiased estimates of the interaction effects under partial randomization when analyzing the effects of two variables. Within the past decade or so, statistical methods for analyzing the effects of two variables have been developed. In experimental subgroup analysis, researchers have varyingly extended propensity score methods (Rosenbaum & Rubin, 1983), in an attempt to minimize selection bias, and have identified well-matched subsets of treatment and control groups within which to estimate subgroup treatment effects (Hill et al., 2003; Lochman et al., 2006; Peck, 2003; Schochet & Burghardt, 2007). When the subgroup variable is binary, the basic practice is to match participants between the treatment and control groups within each of the two subgroups. These separate matches may make the baselines equivalent for the treatment and control groups for each subgroup; thus, they may produce unbiased estimates of treatment effects for that subgroup (for further discussion about matching within subgroups, see Peck, 2013). However, without making an effort to make the baselines equivalent among all four groups (Figure 1), we cannot produce an unbiased estimate of the interaction effect (i.e., whether the difference in treatment effects between two subgroups is due solely to the intervention and subgroup factors), as illustrated in Tables 1 and 2.
Other researchers, for example, Imai and van Dyk (2004), have generalized propensity score methods to study the main effects of two treatment variables using subclassification. However, we still lack a systematic study to examine all of the methods noted previously for analyzing the main and interaction effects of two variables on the outcome.
In the second section, I present Rubin’s causal model to approximate factorial design. In the third section, I review and identify applicable propensity score methods. Then, in the fourth section, I present the Monte Carlo simulation used to evaluate these methods. I use a case study, in the fifth section, to demonstrate how to use the suggested propensity score methods to conduct causal analysis. Finally, the sixth section concludes the article by providing a discussion of applications.
Causal Framework for Analyzing the Effects of Two Treatment Variables or Moderator Effects on the Outcome
Using the counterfactual model (Holland, 1986; Rosenbaum, 2002; Rubin, 1974), I present the potential outcomes for individuals under the situation that both treatment A and treatment B (or moderator) are binary variables (Figure 2). This is analogous to a 2 × 2 factorial experimental design (Figure 1). The main effect of treatment A is the marginal mean difference Y(1,.) − Y(0,.); the main effect of treatment B or moderator is the marginal mean difference, Y(.,1) − Y(.,0). The interaction effect of treatment A and treatment B, or moderator effect (i.e., the difference in treatment effect between two levels of moderator), is [Y(1,1) − Y(0,1)] − [Y(1,0) − Y(0,0)].

Potential outcomes of two variables.
Alternatively, this counterfactual model can be illustrated in Figure 3. Note that, in the figure, this path diagram looks similar to the one that Raudenbush (2011, p. 22) used to illustrate the potential outcome of mediation analysis (Case 2: Treatment → Mediating Treatment → Y), but there are notable differences. In Raudenbush’s diagram, there is a causal link (arrow) from treatment A to treatment B, which, in turn, causes the outcome. In this diagram, however, no causal relationship between treatment A and treatment B (or moderator) is specified, that is, treatment A and treatment B (or moderator) are independent, there is no arrow from treatment A to treatment B (or moderator), and it is treatment A and treatment B (or moderator) together that cause the outcome (Y). Although the mechanism is different, we can use a similar counterfactual model to make a causal inference.

Potential outcomes and paths of two variables.
Applicable Propensity Score Methods to Approximate Factorial Experimental Designs
In this section, I extend Imai and van Dyk’s (2004) approach to analyzing the effects of two binary variables on the outcome, using a stratification method based on the two propensity score functions. Then, I introduce other potential propensity score methods for analyzing two binary variables.
Imai and van Dyk (2004) used Rubin’s causal model to generalize a propensity score application for the analysis of the effects of two continuous treatment variables on the outcome. To do so, they relied on two standard assumptions (the stable unit treatment value assumption; Rubin, 1980, 1990) and the strong ignorability of the treatment assignment assumption (Rosenbaum & Rubin, 1983) as well as a third assumption of a uniquely parameterized propensity function. The third assumption implies that the propensity score function can be effectively summarized by the parameter that can be calculated from Gaussian regression using the covariates (see Imai & van Dyk, 2004, for more details about this assumption). The two propensity score functions for the two continuous treatment variables are estimated based on two independent Gaussian linear regression models. The two estimated propensity score functions are used to subclassify the data into several groups. Figure 4 illustrates the 3 × 3 subclassification based on two propensity score functions (Imai & van Dyk, 2004).

3 × 3 Subclassification based on two propensity score functions. Adapted from figure 4 by Imai and van Dyk (2004, p. 861). Data are subclassified into three groups (lower third, middle third, and upper third) based on each of the two propensity score functions, respectively. Each cell of the 3 × 3 table represents a group based on two propensity score functions jointly.
Data are classified into three groups (lower third, middle third, and upper third) based on each of the two propensity score functions. Each cell of the 3 × 3 table represents a group based on two propensity score functions jointly. The overall treatment effect point estimate and its standard error estimate are a weighted average across nine groups (Imbens & Rubin, 2009). In addition to a 3 × 3 subclassification, Imai and van Dyk (2004) presented simulation results for a 2 × 2 and a 4 × 4 subclassification. In general, more groups produce less bias, as in the analysis of a single treatment variable (Lunceford & Davidian, 2004). 2 Note that the subclassification method based on two propensity score functions could be used for analyses in which both two variables are binary or one variable is binary and another variable is continuous.
Other Propensity Score Methods for Analyzing Multiple Groups
Imai and van Dyk’s (2004) subclassification requires the estimation of two independent propensity score functions based on treatment A and treatment B (or moderator), respectively. There are, however, alternative approaches to using propensity scores for analyzing multiple groups. First, we can collapse two dimensions (2 × 2) of design to one dimension (4 × 1) of design and use the propensity score methods that are appropriate for one-dimensional multiple groups analysis to analyze the effects of two variables. We convert a 2 × 2 design to a 4 × 1 design, that is, a design that has one treatment variable with four levels. For example, Figure 1 presents a 2 × 2 factorial design in which there are two factors (Factors A and B), and each factor has two levels. The conventional outcome analysis involves the use of the two factors and their interaction term as predictors in the statistical model. We can combine these two factors, however, to create one new variable with four levels (A1B1, A1B2, A2B1, and A2B2) and use this new variable as the predictor in the statistical model. By doing so, we simplify our analysis of two predictors to the analysis of one predictor. This is particularly helpful when we need to estimate the propensity scores of the predictors.
Second, the propensity scores can be estimated using a multinomial logistic regression model, for example, Group A1B1 in Figure 1 serves as the reference group, and all the other groups are compared with this reference group. The propensity scores for a participant are the probabilities of being in a particular group, given the covariates.
Third, various propensity score methods can be applied to estimate the group differences, which can be easily converted to the estimates of the main and interaction effects of two variables. These propensity score applications include subclassification (Rosenbaum & Rubin, 1984), optimal nonbipartite matching (Lu, Greevy, Xu, & Beck, 2011; Lu & Rosenbaum, 2004), inverse of propensity score weighting (Imbens, 2000), and propensity score as covariates.
Monte Carlo Simulation
The previous sections provided a discussion of various applications of propensity score methods in analyzing the effects of two variables on the outcome. These applications have multiple dimensions: (a) estimates of propensity scores (using one multinomial propensity score model or two binary propensity score models); (b) application of propensity scores (subclassification, weighting, matching, or as a covariate); and (c) covariate adjustment (using the covariates that are included in the model to estimate propensity scores to estimate the effects of two variables on the outcome). I evaluate these applications of propensity score methods by conducting a Monte Carlo simulation for two scenarios: (a) partial randomization, which concerns random assignment of treatment A, while treatment B (or moderator) is not randomly assigned; and (b) nonrandomization, that is, neither treatment A nor treatment B (or moderator) concerns random assignment, and treatment A and treatment B (or moderator) are artificially correlated through the other variables.
One example of a partial randomization can be seen in a design to study the main and interaction effects of Head Start (treatment A) and child care quality (treatment B). Students are randomly assigned to a treatment group (Head Start) or control group; however, they are not randomly assigned to child care quality groups. In nonrandomization, the students are neither randomly assigned to a treatment group (Head Start) or control group nor randomly assigned to child care quality groups.
In the Monte Carlo simulation, I create the data with known effects of the two variables under different assumptions. I then use different propensity score methods to estimate the main and interaction effects of two variables through an analysis of the simulated data. By comparing the effect estimates with the true effects, we can determine which propensity score applications work better to approximate factorial experiments to produce less-biased effect estimates. To save space, I put the models that produce data for the simulation in Appendix A. The following sections focus on the approaches to analysis and results of the simulation.
Approaches to Analyzing the Main and Interaction Effects of Two Variables on the Outcome
Partial randomization
I first analyze the full sample using an ordinary least square (OLS) model (Model 1), where W 1 and W 2 are two covariates, A is an indicator of treatment A, B is an indicator of treatment B (or moderator), and β A , β B , and β A × B are the main and interaction effects of variables A and B. Model 1 is a misspecified model because it omits the cubic term of a covariate (W 1) as compared to the model used to produce the data (Model A1 in Appendix A), which simulates the reality that researchers might not know the correct model. These results serve as a reference for comparison with propensity score methods.
I then apply propensity score methods to analyze the data. Table 4 presents various propensity score applications. First, the multinomial logistic regression model is run to estimate the generalized propensity score (Imbens, 2000) using covariates W
1 and W
2. The generalized propensity score is the conditional probability of receiving treatment t, given pretreatment covariate W, that is,
Propensity Score Applications in a Monte Carlo Simulation Under Partial Randomization.
Note. ANOVA = analysis of variance; OLS = ordinary least square; ANCOVA = analysis of covariance. aCovariate adjustment is to adjust for covariates W 1 and W 2. bSubclassification, matching, and covariate are all based on the probabilities of being in the reference group. cWeighting is by the inverse of the probabilities of being the group of treatment that participants actually received. dSubclassification and matching are based on two propensity score functions.
The generalized propensity score is estimated using the multinomial logistic regression as given subsequently, where the reference group (e.g., Y(0,0) in Figure 2) is coded as 4:
The distributions of the generalized propensity scores of being in the reference group (e.g., the probabilities of being in Group Y(0,0) in Figure 2) for the four groups are examined for overlap. A sample with common support can be obtained by excluding participants whose probabilities of being in the reference group are larger than the minimum value of four maximum probabilities or smaller than the maximum value of the four minimum probabilities (called “common support sample,” Item 2 under Data in Table 4). The formal set notation of the common support sample (C) can be expressed as follows:
where
This common support sample is further used for the nearest neighbor one-to-one matching by matching the reference group with all the other three groups, respectively, based on the probabilities of being in the reference group. I then have three matched samples, and each matched sample contains one sample of the reference group. I choose the common sample of being in the reference group among the three matched samples. The final matching sample includes the common sample of being in the reference group and its matching sample in all other three groups (Item 3 under Data in Table 4).
For the full sample and common support sample, to make results comparable with a 3 × 3 subclassification based on two propensity score functions, I subclassify data into nine groups based on the probabilities of being in the reference group and obtain the weighted average parameter estimates across nine groups. I also use the inverse of the generalized propensity score as weight (Imbens, 2000) to analyze data using the full sample and common supported sample, respectively. The weight is
Finally, two independent binary logistic regression models, based on covariates W 1 and W 2, are used to estimate the propensity scores for treatment A and treatment B (or moderator), respectively. The common support sample (Item 4 under Data in Table 4) is obtained by including the overlap sample among the four groups based on propensity scores for variables A and B through the following steps: (a) the first common support sample is the overlap sample between two levels of treatment A based on the propensity scores of treatment A, (b) the second common support sample is the overlap sample between two levels of treatment B based on the propensity scores of treatment B, and (c) the final common support sample is the overlap sample between the first common support sample and second common support sample. As illustrated in Figure 4, I apply a 3 × 3 subclassification based on two propensity score functions and estimate the weighted treatment effect across nine groups (Imai & van Dyk, 2004). Both the OLS (Model 1) and ANOVA models are used for data analysis.
Non-randomization
I conduct analyses similar to those in the partial experiment scenario. I first analyze the full sample using an OLS model (Model 4), where β A , β B , and β A × B are the main and interaction effects of variables A and B. Similar to Model 1, Model 4 is a misspecified model because it omits a cubic term of a covariate and two interaction terms of covariates and variables A and B (see Expression A2 in Appendix A), which simulates the reality that researchers might not know the correct model. These results serve as a reference for a comparison with propensity score methods.
I then apply propensity score methods to analyze the data. Table 5 presents various propensity score applications. Both the OLS (Model 4) and ANOVA models are used for data analysis.
Propensity Score Applications in a Monte Carlo Simulation Under Nonrandomization.
Note. ANOVA = analysis of variance; OLS = ordinary least square. aCovariate adjustment is to adjust for covariates W 1, W 2, and W 3. bSubclassification, matching, and covariates are all based on the probabilities of being in the reference group. cWeighting is by the inverse of the probabilities of being the group of treatment that participants actually received. dSubclassification and matching are based on two propensity score functions.
I replicate 1,000 times to produce and analyze data. The estimates of bias, percentage bias, and MSE of the parameters (θ) can be calculated as follows:
Percentage bias is defined as:
Bias is a measure of the difference between the effect estimate and the true effect. A positive sign of bias indicates an overestimate of the effect, and a negative sign of bias indicates an underestimate of the effect. The percentage bias indicates the percentage of bias in terms of the true effect. MSE is a measure of the accuracy of the effect estimate. We would hope to have small bias, percentage bias, and MSE for our estimates. I estimate bias, percentage bias, and MSE for β A , β B , and β A × B , respectively. I evaluate the performance of various methods using the percentage bias for the assessment of bias and MSE for the assessment of accuracy (Burton, Altman, Royston, & Holder, 2006).
Simulation Results
Tables 6 and 7 present the average percentages of the final analysis samples in the full sample for partial randomization and nonrandomization simulations, respectively. The common support sample for both propensity score models is quite large. Specifically, it is larger than 85% in the partial randomization simulation and larger than 68% in the nonrandomization simulation. It is not surprising, however, that the matching samples are small because I used one-to-one matching and allowed the sample allocation between two levels of treatment A and treatment B (or moderator) to vary (i.e., the proportion of Level 1 of treatment A ranges from 0.2 to 0.8 to simulate both balanced and nonbalanced experimental designs in the partial randomization simulation). It is smaller than 30% in the partial randomization simulation and smaller than 20% in the nonrandomization simulation.
Average Sample Sizes of Various Analyses in a Partial Randomization Simulation.
Note. Entries are the average percentages of the final analysis samples in the full sample (N = 8,000). A total of 1,000 replications.
Average Sample Sizes of Various Analyses in a Nonrandomization Simulation.
Note. Entries are the average percentages of the final analysis samples in the full sample (N = 8,000). A total of 1,000 replications.
In a partial randomization simulation, the results of seven types of sample allocation to the treatment group (Table 6) have very small variation. Thus, I report their average results. Table 8 presents the results for percentage bias and MSE by three parameters separately in a partial randomization simulation. The conventional OLS estimates, by including both variables and their interaction term using the full sample, produce large bias (46.1%, −23.0%, and −49.7%) and MSE (21.86, 5.62, and 25.89) for the coefficients of the interaction term, treatment A and treatment B (or moderator). This implies that, even in an experiment in which one treatment variable is randomly assigned, when the treatment effect is heterogeneous (i.e., the treatment effect varies by the covariate, W 1), the conventional OLS estimate could produce biased estimates of the effects of two variables and their interaction (the average treatment effect estimate that uses OLS without the interaction term, however, would be unbiased).
Percentage Bias and MSE in a Partial Randomization Simulation.
Note. ANOVA = analysis of variance; OLS = ordinary least square; ANCOVA = analysis of covariance; MSE = mean square error. A total of 1,000 replications. %Bias is the percentage bias
For one multinomial propensity score model, the application of the common support sample has smaller percentage bias and MSE than does the application of the full sample, and the OLS estimate has smaller percentage bias and MSE than does an ANOVA. These results are consistent with previous propensity scoring studies of the use of the common support sample and the inclusion of covariates in the outcome model for double robustness (e.g., D’Agostino, 1998; Lunceford & Davidian, 2004). In addition, including the propensity score as a covariate in the analysis has smaller effects on bias and MSE reduction than do the other propensity score applications (e.g., subclassification and weighting). For two binary propensity score models, subclassification has similar good performance in MSE reduction as do subclassification, weighing, and matching for one multinomial propensity score model when OLS regression is used. However, subclassification for two binary propensity score models produces slightly larger percentage bias than do the other methods.
In a nonrandomization simulation, the results of five combinations of sample allocations have some variations. Thus, I report their individual results and average absolute values. I rely on the average results, however, to evaluate the performance of various methods. Tables 9 through 11 present the results for percentage bias and MSE for the coefficients of the interaction term, treatment A, and treatment B (or moderator), respectively. The two approaches that have smaller average percentage bias and MSE across three parameter estimates than do the conventional OLS analysis are (a) inverse of propensity score weighting based on one multinomial propensity score model and (b) subclassification based on two binary propensity score models.
Percentage Bias and MSE for
Note. ANOVA = analysis of variance; OLS = ordinary least square; MSE = mean square error. aEntries are the means of the absolute value of percentage bias among five combinations of sample allocations. bEntries are the means of the MSE among five combinations of sample allocations. A total of 1,000 replications for each parameter combination. %Bias is the percentage bias
In sum, the simulation results from partial randomization and nonrandomization scenarios suggest two good propensity score applications to reduce bias and MSE of parameter estimates in analyzing two variables: (a) inverse of propensity score weighting based on one multinomial propensity score model and (b) subclassification based on two binary propensity score models. Further, the use of a common support sample and covariate adjustment in outcome analysis will reduce bias and MSE.
Case Study: The Main and Differential Effects of Head Start
The Monte Carlo simulation suggested two appropriate applications of propensity score methods for analyzing the effects of two binary variables on the outcome. In this case study, I focus on these two approaches. Here, the inverse of propensity score weighting method (Imbens, 2000) can be used to analyze two multicategorical variables, while Imai and van Dyk’s (2004) method can be used for an analysis that involves continuous variables.
I demonstrate the applications of Imbens’ (2000) and Imai and van Dyk’s (2004) methods for the examination of the effects of the Head Start program as compared with other center-based care in improving children’s reading achievement, as differing by child care quality (as a binary variable and continuous variable, respectively). Because the Head Start program is a particular center-based care, this case study is also used to determine whether the care in the Head Start program is more effective than the care in the other center-based settings.
Sample and Measures
I use data from a nonrandomization study in which neither of the two variables (child care type and child care quality) concerned random assignment and their effect estimates suffer from selection bias. The data set used for this case study is the Early Childhood Longitudinal Study–Birth cohort (ECLS-B). ECLS-B began as a randomly selected nationally representative sample of 14,000 children born in 2001 in the United States. ECLS-B collects information on a rich array of individual-, household-, teacher-, and child care–level measures. I use the Item Response Theory (IRT) scale score for reading skill measured at kindergarten as the outcome variable, and I focus on observed global classroom/care quality. The quality of Head Start’s and other center-based care was assessed through the Early Childhood Environmental Rating Scale–Revised (ECERS-R; Harms, Clifford, & Cryer, 1998).
The ECLS-B administered 37 of the 43 items included in the original ECERS-R, which covered the following six subscales: (a) furnishing and display, (b) personal care routines, (c) listening and talking, (d) learning activities, (e) interaction, and (f) program structure. Each of the 37 items was rated on a 7-point Likert-type scale. I use the overall ECERS-R score in the analysis. The covariates used in this study include the mental IRT scale score measured at 2 years old, reading and math IRT scales measured at prekindergarten, gender, race/ethnicity (White, Black, Hispanic, and other race/ethnicity), birth weight, age at outcome assessment, English spoken at home, disability status, special education status, health problem, family’s SES, mother’s education, income, poverty level, welfare receipt, home violence, household structure, and one dummy variable for the year that the child was in kindergarten (2007 vs. 2006).
Analytic Procedure
One of the main variables of interest, the child care quality measure (the ECERS-R overall score) is a continuous variable. I dichotomize it with a median split to indicate high quality and low quality. Both the continuous and dichotomous ECERS-R scores are used for demonstration purposes. I first estimate the overall effect of Head Start as compared with other center-based care using conventional OLS regression and propensity score methods (subclassification and weighting), without including the care quality measure and its interaction term with care type. When child care quality is treated as a binary variable, the following approaches are used to examine the differential effect of Head Start between two child care quality groups: Conventional OLS regression analysis with interaction of care quality (low quality vs. high quality) and care type (Head Start vs. other center-based care). Inverse of propensity score weighting analysis based on one multinomial propensity score model (Imbens, 2000). Subclassification based on two binary propensity score models (Imai & van Dyk, 2004). I used a 3 × 3 subclassification based on two propensity score functions estimated by two binary logistic regression models.
When care quality is analyzed as a continuous variable, I use Approach 1 with a modification that care quality is a median-centered continuous variable and Approach 3 with a modification in which subclassification is based on one binary propensity score model and one Gaussian linear regression model.
Findings
As presented in this section, I first checked covariate balance for the analyses before and after applying the propensity score methods. In the end, I report the results of examining whether the effect of the Head Start program varies by the care quality (moderator).
Because my main interest is analyzing both the main and moderator effects, I present the results of covariate balance checking for four comparison groups. I present the results of an F-test of a comparison on covariate means among four comparison groups and the maximum standardized mean difference among four groups. Because an F-test is used to test population parameters and is sensitive to sample size, it may not be appropriate for checking the sample characteristics (Austin, 2007; Imai, King, & Stuart, 2008). Thus, I focus on an interpretation of the maximum standardized mean difference among the four groups.
Tables 12 and 13 present covariate balance checking results for child care type by care quality groups before and after applying the inverse of propensity score weighting, respectively. Stuart (2007) suggests that the standardized mean difference should ideally be less than 0.25 and that a value greater than 0.50 is “particularly problematic.” As seen in Table 12, of the 22 covariates, 9 have a maximum standardized mean difference of greater than 0.50 among four groups. After weighting by the inverse of propensity score, 17 of the 22 covariates have a maximum standardized mean difference of less than 0.25 among four groups, and the other 5 covariates have a maximum standardized mean difference of less than 0.50 among four groups (Table 13) 4 . This suggests that the inverse of propensity score weighting can greatly reduce selection bias. Covariance balance also was examined (but not presented) for the subclassification method, which resulted in the covariates being much more balanced than without the use of any propensity score methods.
Percentage Bias and MSE for
Note. MSE = mean square error; OLS = ordinary least square; ANOVA = analysis of variance. aEntries are the means of the absolute value of percentage bias among five combinations of sample allocations. bEntries are the means of the MSE among five combinations of sample allocations. A total of 1,000 replications for each parameter combination. %Bias is the percentage bias
Percentage Bias and MSE for
Note. MSE = mean square error; OLS = ordinary least square; ANOVA = analysis of variance. aEntries are the means of the absolute value of percentage bias among five combinations of sample allocations. bEntries are the means of the MSE among five combinations of sample allocations. A total of 1,000 replications for each parameter combination. %Bias is the percentage bias
Covariate Balance Check Among Child Care Type by Care Quality Groups Before Applying Propensity Score Methods.
Note. ECERS = Early Childhood Environmental Rating Scale; NCES = National Center for Education Statistics. aEntries are the means and standard deviation (in parentheses). bEntries are the F-statistics of comparing the means across four groups (*p < .05. **p < .01. ***p < .001). cEntries are the maximum standardized mean differences among four groups, which were calculated based on the pooled standard deviations across all four groups. dSample size is rounded to the nearest 50 per NCES regulation.
Covariate Balance Check Among Child Care Type by Care Quality Groups Weighted by the Inverse of Propensity Score.
Note. ECERS = Early Childhood Environmental Rating Scale; SES = socioeconomic status. NCES = National Center for Education Statistics. aEntries are the means and standard deviation (in parentheses) weighted by the inverse of propensity score. bEntries are the F-statistics of comparing the weighted means across four groups (*p < .05. **p < .01. ***p < .001). cEntries are the maximum standardized mean differences among four groups, which were calculated based on the same pooled standard deviations as in Table 12 from the unweighted analysis. dSample size is rounded to the nearest 50 per NCES regulation.
Table 14 presents the differential effect of Head Start by a binary care quality measure. Neither the conventional OLS nor propensity score analyses detected a statistically significant difference between high- and low-quality care for the effect of Head Start, as compared to other center-based care.
The Differential Effect of Head Start as Compared With Other Center-Based Care on Kindergarten Reading by Binary Care Quality Measure.
Note. OLS = ordinary least square. ECLS-B = Early Childhood Longitudinal Study–Birth cohort; SES = socioeconomic status; IRT = Item Response Theory. Adapted from ECLS-B. The covariates included in the propensity score models and adjusted in the outcome analysis include the mental IRT scale score measured at 2 years old, reading and math IRT scales measured at prekindergarten, gender, race (White, Black, Hispanic, and other race), birth weight, age at outcome assessment, English speaking at home, disability status, special education status, health problem, family’s SES, mother’s education, income, poverty level, welfare receipt, home violence, household structure, and one dummy variable that indicates the year that the child was in kindergarten (2007 vs. 2006). aSample size is rounded to the nearest 50 per National Center for Education Statistics regulations. bThe number of the original groups was nine, while the final groups used for weighted impact estimate was six. Three groups were excluded from analysis because at least one of the four groups (cells) contained fewer than five children.
Table 15 presents the differential effect of Head Start by a continuous care quality measure. Neither the conventional OLS nor propensity score analysis detected a statistically significant difference on the coefficient (slope) of care quality measure for Head Start as compared with other center-based care.
The Differential Effect of Head Start as Compared With Other Center-Based Care on Kindergarten Reading by Continuous Care Quality Measure.
Note. OLS = ordinary least square; IRT = Item Response Theory; ECLS-B = Early Childhood Longitudinal Study–Birth cohort. Adapted from ECLS-B. The covariates included in the propensity score models and adjusted in the outcome analysis include the mental IRT scale score measured at 2 years old, reading and math IRT scales measured at prekindergarten, gender, race (White, Black, Hispanic, and other race), birth weight, age at outcome assessment, English speaking at home, disability status, special education status, health problem, family’s SES, mother’s education, income, poverty level, welfare receipt, home violence, household structure, and one dummy variable that indicates the year that the child was in kindergarten (2007 vs. 2006). aSample size was rounded to the nearest 50 per NCES regulations. bThe number of the original groups was nine, while the number of the final groups used for weighted impact estimate was six. Three groups were excluded from analysis because each contained fewer than five children in Head Start.
In sum, the reading achievement at kindergarten for children in Head Start was significantly lower than that of their peers in other center-based care. The effect of Head Start as compared with other center-based care on kindergarten reading, however, did not differ by care quality.
Conclusions
The Monte Carlo simulation suggested two appropriate applications of propensity score methods for analyzing two binary treatment variables. The inverse of the propensity score weighting method based on one multinomial propensity score model (Imbens, 2000) and subclassification based on two propensity score models (Imai & van Dyk, 2004) are both effective in reducing bias and MSE and can be easily applied. These two methods can be used for analyzing two binary variables. When two variables are multicategorical, for example, the total numbers of levels for two variables are J and K, Imbens’ (2000) method is easier to apply by converting a two-dimensional J × K design to a one-dimensional design with J × K levels. When one or two variables are continuous, Imai and van Dyk’s (2004) subclassification method is a good choice.
Note that this study assumes that the treatment variables and moderator are available for all sample members. However, in some circumstances, the interest variable is not measured for some particular members (e.g., in some experimental studies, the fidelity of implementation is measured only for the treatment group). The propensity score methods can be applied to this type of situation (e.g., Hill et al., 2003; Peck, 2003, 2013; Peck & Bell, 2012; Schochet & Burghardt, 2007); however, this is not the main interest of this article. Further, although this study did not directly address the issues that pertain to clustering, which is common in education evaluation, the methods discussed in this study can be extended to multilevel analysis, for example, the multilevel logistic regression models can be used to estimate the generalized propensity score and the hierarchical linear model for outcome analysis to take account of clustering effects in educational evaluation.
In sum, it is important to eliminate selection bias across variable-A-by-variable B groups in the analysis of the relationship between two variables and an outcome. This article identified and illustrated appropriate applications of propensity score methods for this purpose. Further, although this article focused on the analysis of two independent variables, the method can be extended to analyze mediation effects in future research.
Footnotes
Appendix A
Acknowledgment
The author thanks Mark Lipsey and Peter Schochet, two anonymous reviewers, and the former associate editor Laura Peck for their valuable comments on previous versions; however, all errors remain the author’s.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
