Abstract
A rapidly growing number of algorithms are available to researchers who apply statistical or machine learning methods to answer social science research questions. The unique advantages and limitations of each algorithm are relatively well known, but it is not possible to know in advance which algorithm is best suited for the particular research question and the data set at hand. Typically, researchers end up choosing, in a largely arbitrary fashion, one or a handful of algorithms. In this article, we present the Super Learner—a powerful new approach to statistical learning that leverages a variety of data-adaptive methods, such as random forests and spline regression, and systematically chooses the one, or a weighted combination of many, that produces the best forecasts. We illustrate the use of the Super Learner by predicting violence among inmates from the 2005 Census of State and Federal Adult Correctional Facilities. Over the past 40 years, mass incarceration has drastically weakened prisons’ capacities to ensure inmate safety, yet we know little about the characteristics of prisons related to inmate victimization. We discuss the value of the Super Learner in social science research and the implications of our findings for understanding prison violence.
Statistical or machine learning algorithms show superior performance compared to approaches that have to make strong assumptions about the underlying data generating process. Through an inductive approach, machine learning enables us to account for nonlinear relationships that even the most finely developed social theory could not postulate explicitly (Berk 2008). A growing variety of learning algorithms are available but rarely, if ever, do we know which algorithm will perform best for a particular research question and data set at hand. Researchers typically choose one or a handful of algorithms, in a largely arbitrary fashion, although there might be other algorithms that perform better. In this article, we present the Super Learner—an ensemble approach to statistical learning that systematically selects the best performing algorithm or a weighted combination of many (van der Laan, Polley, and Hubbard 2007). In doing so, we engage the ongoing debates about the comparative effectiveness of prediction algorithms in criminology (Berk and Bleich 2013; Hamilton et al. 2014; Ngo, Govindu, and Agarwal 2015) and respond to calls to improve forecasting capacity in the social sciences by capitalizing on advances in computing and statistical theory (Berk 2008; Glymour, Osypuk, and Rehkopf 2013).
We illustrate the Super Learner by forecasting violence in a population of prisons in the United States. Over the past 40 years, many prisons have become overcrowded, underresourced, and dangerous (Clear and Frost 2013), making research on prison violence perhaps more important than ever (Bierie 2012; Steiner 2009). The increasing numbers of inmates are being supervised by fewer staff, many serve sentences in old and poorly equipped facilities, while programming resources have been steadily disappearing (Gottschalk 2014; MacKenzie 2006). We chose this particular application because prison violence is a product of many predictors that act alone, in multiple interactions, or in other nonlinear ways. Statistical learning approaches are increasingly being used in correctional settings, but they have mostly relied on classification and regression trees (CARTs). Taxman and Kitsantas (2009), for instance, have used CARTs to identify facility-level predictors of the availability of substance use programs in prisons, whereas Ngo et al. (2015) used CARTs to forecast inmate misconduct. The vast majority of research on prison violence has used parametric models, which introduce strong assumptions and restrict the analysis to mainly linear relationships.
Super Learner
Super Learner is a cross-validation-based approach for combining machine learning algorithms, which produces predictions that are at least as good as those of the best input algorithm (van der Laan and Dudoit 2003; van der Laan et al. 2007; van der Laan and Rose 2011). In this section, we first discuss the motivation behind the Super Learner, then go on to describe how the Super Learner works, and finally briefly mention some of the theoretical guarantees the Super Learner provides.
Motivation
The problem that the Super Learner solves is quite simple: We never know, a priori, which prediction or machine learning method will work best for any given data set.
Countless prediction methods are available in standard software: parametric regression models, nearest neighbor matching, kernel methods, splines, generalized additive models, partially linear models, ridge regression, deep learning, lasso, CARTs, multivariate adaptive regression splines, neural networks, boosting, random forests, and many others. Even if we decide to use just a single algorithm, there are many other analysis choices to make, for example, which variables to include and how to include them. Further, it is always necessary to select tuning parameter values that indicate how much to smooth the data (i.e., how to trade-off bias and variance). For parametric regression, we must decide whether to include higher-order terms and interactions, and for which variables; for nearest neighbor methods, we must decide how many neighbors to match (as well as whether and how this number should change across matches); for kernel methods, we must pick bandwidth values; for splines and generalized additive models, we need degrees of freedom for each term; for lasso, we need to select the tuning parameter controlling how coefficients are penalized; for random forests, we need to decide how many variables to split on, as well as leaf size and number of trees, and so on.
If we select these parameters to yield low-variance predictions, then we will typically also get bias (oversmoothing) while if we select them to yield small bias, then the predictions will typically be highly variable and noisy (undersmoothing). Even though default values of tuning parameters are often provided with standard software, there is no reason to think that these will necessarily give good performance for any new data set. Table 1 lists some popular prediction methods and a few corresponding tuning parameters for each.
Popular Prediction Methods and Some Tuning Parameters.
So how do we decide among the abundance of algorithms available in the literature? And for a given algorithm, how do we decide on tuning parameter values and other analysis choices? It is simply impossible for human analysts to make these decisions intelligently on their own. Fortunately, these choices do not have to be made in a nonprincipled fashion; the data can tell us how to decide, specifically via the Super Learner and cross-validation.
Implementation
Super Learner is based on the idea that, to obtain the best predictions, we should test different approaches (on new data, to prevent overfitting) and find the best one or best combination. More specifically, the standard implementation of the discrete Super Learner algorithm (which merely picks the best performing input algorithm rather than the best combination) proceeds as follows: Split data set into B blocks (B = 10 is a common choice). Fit all K methods on blocks 2 − B, excluding block 1. Generate predictions for block 1 and estimate corresponding mean squared error (MSE) for each method. Repeat steps 2–3 (B − 1) times, leaving out each block j = 2, 3, …, B. Use predictions from method with smallest average estimated MSE across blocks.
This is essentially the typical cross-validation scheme, except it chooses among many different methods (possibly indexed by various tuning parameter choices), rather than only among different tuning parameter choices for a single method. (At the end of this subsection, we give more details on the relationship between the Super Learner and previously developed cross-validation methods.) Note that the Super Learner scheme penalizes overfitting since methods are always trained and tested on different data (i.e., estimation of MSE is based on out-of-sample predictions). For reference, the estimated MSE for method k can be written as:
where
The algorithm described above is called the discrete Super Learner because it uses predictions from the single best performing method. For the full Super Learner algorithm, which finds the best weighted combination of methods, we perform steps 1–4 above but replace step 5 with: 5′. Use weighted average of predictions from all K methods, with weights equaling coefficients from a regression of outcomes on out-of-sample predictions from step 3.
Note that, although finding the best weighted combination of methods sounds computationally intensive, only one extra regression fit is required after steps 1–4. So, the full Super Learner really is no more computationally intensive than the discrete version. And the benefits of using the full Super Learner can be substantial since, even if none of the individual methods yield decent predictions, it is possible that a weighted average of them performs well (other ensemble methods such as boosting are based on this phenomenon as well).
More details on implementing the Super Learner algorithm, as well as extensive simulation-based evaluations, can be found in van der Laan and Dudoit (2003), van der Laan et al. (2007), and van der Laan, and Rose (2011). Programming information is provided in the help files for the R package “SuperLearner.” We also give example R code in Online Appendix.
In predicting prison violence, we selected the following algorithms: random forests (Breiman 2001), boosting (Freund and Schapire 1997), Bayesian additive regression trees (Chipman, George, and McCulloch 2010), generalized additive models (Hastie and Tibshirani 1986), multivariate adaptive regression splines (Friedman 1991), lasso (Tibshirani 1996), logistic regression, k-nearest neighbors, and a simple mean prediction. We chose to illustrate the Super Learner with these methods because they are diverse, popular, freely available in R software [version 3.3.3], and relatively computationally efficient. Other variations and different methods could of course be added as well. Further, we applied these methods in conjunction with four different variable selection or screening approaches: no screening (so all variables were included), marginal correlation screening (where only variables significantly correlated with the outcome at level α = 0.01 were included), and finally facility-based and inmate-based variables, separately. We did not incorporate screening for random forests, boosting, or Bayesian additive regression trees since these approaches can be viewed as having built-in variable selection. Detailed reviews of the prediction methods we used can be found elsewhere, for example, in Hastie, Tibshirani, and Friedman (2009).
As hinted at earlier, in some ways, the Super Learner is conceptually very similar to previous cross-validation-based methods, especially the stacking approaches of Wolpert (1992) and Breiman (1996). However, previous proposals and theoretical results were often limited to tuning parameter selection for specific algorithms (e.g., neural networks) in specific settings (e.g., regression). In contrast, van der Laan and Dudoit (2003) gave results on cross-validation in great generality, for use with many methods and in many settings, including, for example, finite sample and asymptotic results for general loss functions that contain unknown nuisance functions. (We will briefly mention some of these results in the next subsection.) A more thorough discussion of how the Super Learner relates to other work on cross-validation and ensemble learning methods is given in chapter 3 of van der Laan and Rose (2011).
Theoretical Guarantees
So far, we have discussed the motivation behind the Super Learner as well as how it is implemented. However, we have yet to give any of the theoretical results justifying why it works, that is, our task in this subsection. In short, van der Laan and Dudoit (2003) derived results showing that cross-validation selectors like the Super Learner behave like an “oracle” that knows the true distribution of the data, both asymptotically and in finite samples.
More specifically, let
denote the cross-validation selector described previously, which picks the best performing method k (based on estimates of out-of-sample MSE) from a set
averages the true MSE (rather than estimated MSE) across blocks. The notation
The foundational oracle inequality result from van der Laan and Dudoit (2003) states that, as long as the outcome Y and estimators
where C∊ is a constant depending on ∊ (but not n), and
The inequality (1) bounds how far away the expected risk can be for the cross-validation selector
This ratio compares the optimal parametric rate of convergence (in the numerator) to that of the oracle selector (in the denominator). The numerator represents a parametric benchmark since the MSE for a well-behaved parametric model (e.g., correctly specified logistic regression) scales like 1/n. In the following discussion, we will rely on order notation, where
If the oracle converges at a fast parametric rate (e.g.,
This means the cross-validation selector
This means that, asymptotically, the cross-validation selector
Illustration: Prison Violence
We illustrate the use of the Super Learner by predicting violence among inmates from aggregate information about correctional facilities and their inmates. Over the past 40 years, mass incarceration in the United States has drastically weakened prisons’ capacities to ensure inmate safety, yet we know little about the characteristics of prisons related to inmate victimization.
Despite an emerging political consensus on the need to reduce incarceration, the current conditions in American prisons are still largely a product of the “penal harm” movement popular among politicians in the 1980s and 1990s—the idea that prisons should be no-frills environments focused on retribution rather than rehabilitation (Clear 1994; Finn 1996). But despite the worrisome trends in the quality of correctional facilities, most research on prison violence has focused on the individual-level characteristics of inmates rather than the characteristics of prisons (Steiner, Butler, and Ellison 2014). Yet prison violence also results from the physical and social context of confinement (Bottoms 1999; Steiner et al. 2014). Because of a diversity of methodological approaches and types of data and measures used across studies, there is relatively little agreement about characteristics measured at the prison level that predict violence among inmates. Here, we discuss the most frequently identified predictors that include the characteristics that describe inmates (e.g., proportion white inmates) and the characteristics of correctional facilities (e.g., whether a prison is managed privately).
Evidence from experimental and quasi-experimental studies points to programming, broadly conceived, as a critical predictor of inmate institutional behavior (French and Gendreau 2006; Gaes et al. 1999). As a result of participating in prison programs, inmates become more skilled in avoiding violent confrontation, especially if these programs address the underlying causes of aggressive behavior such as substance abuse and other psychological disorders (Kinlock, O’Grady, and Hanlon 2003; Pearson and Lipton 1999; Landenberger and Lipsey 2005). Further, some reviewers have concluded that a variety of programs are successful in preventing prison violence, such as vocational training, despite not being designed with the primary goal of increasing prison safety (Byrne and Hummer 2008). A more general explanation underlying these findings includes the time spent in structured activities that reduces the time inmates are exposed to risky situations.
Research on crowding has produced mixed findings, although studies seem to tip in favor of a positive association with inmate violence (Franklin, Franklin, Franklin, and Pratt 2006; Gendreau, Goggin, and Law 1997; Lahm 2008; Wooldredge, Griffin, and Pratt 2001). Much of the research has been motivated by the classic deprivation model where responsibility for inmate behavior is assigned to environmental influences rather than their preexisting characteristics (Sykes 2007). The main mechanisms implicated in the link between crowding and violence are the elevated stress due to dense surroundings and the reduced ability of staff to monitor and prevent inmate misconduct (Gaes 1994; Haney 2006). The studies that found null or negative effects have suggested that prison administrators respond to crowding by instituting measures to increase the safety of staff and inmates (e.g., Camp et al. 2003; McCorkle, Miethe, and Drass 1995; Walters 1998). Despite its contested role in prison life, crowding is a relevant factor in understanding violence among inmates.
A phenomenon occurring in parallel with mass incarceration has been the emergence of private prisons. In the wake of the prison boom, policy makers have encouraged privatization with the hope it will reduce the costs while preserving—or improving—the number and quality of correctional services (Harding 2001; Selman and Leighton 2010). About 10 percent of the inmate population is currently managed by firms contracting with state or federal governments. Concerns about privatization in the criminal justice system echo the larger issues in the ongoing debates about privatizing what have traditionally been public services (e.g., Stuckler and Basu 2013). The cost-conscious private prisons, it has been argued, may select less dangerous inmates who require fewer medical and security resources (Simon 1992). But given the many material constraints, public prisons also have to compromise between the amount and quality of services they provide and the increasing costs of incarceration. As an argument for private prisons, Logan (1992) pointed out that corporations may invest more in inmate safety to assuage public fears about the privatization of punishment.
Our assessment of prior research suggests a relative diversity of findings but also some consistency in identifying predictors important for predicting inmate violence. In addition to programming and crowding, other aspects of facilities have also been linked to prison misconduct and violence though less consistently. These variables include the size of the prison population, prison age, and the security status of the facility (Beijersbergen et al. 2014; Gonçalves et al. 2014). We also include these variables, described in detail in the next section, in our prediction model.
Data
Sample
The administrative data on prisons are drawn from the 2005 Census of State and Federal Adult Correctional Facilities that surveyed prison administrators by mail using a self-administered questionnaire. The survey included facilities housing state or federal prisoners (excluding jails and facilities with specialized functions such as military facilities and immigration detention centers). In 2005, there was a 100 percent response rate except for missing information on all Illinois facilities and staffing data for state-administered facilities in California. All the facilities from those states were removed from the sample. We also removed local facilities, those under joint state and local authority management, and facilities whose main functions are medical or youth centered. To reduce noise, our final sample selection stage included only prisons that had more than 200 inmates. The outcome is not meaningful for smaller prisons where high or low assault rates can occur simply by chance. The final analysis sample included 805 prisons. Data missing on covariates were imputed using the Super Learner.
Measures
The outcome variable is based on the number of inmate-on-inmate assaults assessed using the following question: “Between January 1, 2005, and December 30, 2005, how many inmate-inflicted physical or sexual assaults occurred on other inmates in this facility?” To take into account prison size (since prisons with more inmates have more opportunities for assaults), we defined prisons with more than 15 assaults per 1,000 inmates as having elevated assault rates. This definition yields 248 prisons (31 percent) as having elevated assault rates.
The characteristics of the facilities included the following binary variables: Whether the prison is managed privately or publicly, whether work is available to inmates (e.g., manufacturing license plates), if the prison is under court order for conditions of confinement, and whether the prison is under state or federal jurisdiction. We also used binary covariates that indicate what types of programs the prison makes available to their inmates. The three types of counseling programs were focused on psychological, HIV/AIDS, and substance use problems. We also included a measure describing the availability of educational programs. Prison age was measured in years. Crowding was measured as the ratio between the total number of inmates in the facility and its design capacity. Other continuous variables included proportion of full-time staff, number of inmates, and staff-to-inmate ratio (truncated to the 99th percentile). The one categorical measure was the region in which the prison is located (Northeast, Midwest, South, or West).
Inmate characteristics include binary variables indicating whether there are any juvenile inmates in the facility, inmates on death row, inmates who are not citizens of the United States, and whether there were any reports of inmates assaulting staff over the past 12 months. We also included measures describing the proportion of inmates in the facility who are white and the proportion of veterans. Other predictors included three-level categorical variables indicating whether the prison houses only men, only women, or both sexes as well as a measure describing the security designation of the prison (minimum, medium, and maximum).
Results
Summary statistics for predictors used in the analysis are presented in Table 2. Slightly fewer than 10 percent of prisons are managed entirely by private corporations. Vast majority of prisons offer some type of work to their inmates. About one in five prisons is under court order for conditions of confinement. Most prisons offer health-related counseling, especially programs focused on substance use. Other characteristics of prisons in our sample that we would like to highlight are that prisons tend to be overcrowded, about half are located in the South, and very few have inmates who are on death row. It is important to note here that we have made a number of decisions to facilitate the analysis, which have resulted in a smaller sample. Our analysis sample therefore does not reflect all the prisons available in the Census.
Summary Statistics for Predictors of Prison Violence.
Source: 2005 Census of State and Federal Adult Correctional Facilities (N = 805).
Table 3 shows results from our analysis predicting whether prisons had elevated assault rates, based on a variety of different approaches. For each method, Table 3 shows what variables the method was based on, the cross-validated estimate of risk/MSE (and its standard error), the percent improvement (based on risk) of each method relative to a standard logistic regression with only main effects, as well as two other measures of predictive accuracy: area under the receiver operating characteristic (ROC) curve and prediction error. The methods are ranked by estimated risk. Plots of the risk and ROC curves are given in Figures 1 and 2. There are a number of important results to highlight in this analysis; we will focus on four.
Methods Used in the Super Learner Ordered by Prediction Performance.
Source: 2005 Census of State and Federal Adult Correctional Facilities (N = 805). Note: Methods: bart = Bayesian additive regression trees; earth = multivariate adaptive regression splines; gam = generalized additive model; gbm = boosting; glm = logistic regression; glmnet = lasso; knn = k-nearest neighbors; mean = average outcome; randomForest = random forests.
Other abbreviations: cor(p < .01) = variables significantly correlated with outcome at level α = .01; AUC = area under ROC curve.

Plot of risk/MSE/Brier score.

Plot of ROC curves.
First, as expected based on theoretical properties, the Super Learner (both discrete and weighted) performed as well as the best single input algorithm, which in this case was random forests. Estimated risk, area under ROC curve (AUC), and prediction error were statistically indistinguishable among the discrete and weighted Super Learner and random forests and equaled approximately 0.16, 79 percent, and 23 percent, respectively. In this analysis, weighting combinations of learners did not provide an advantage over simply picking the best (as with discrete Super Learner).
Second, there was much predictive accuracy to be gained over logistic regression, which is the most commonly used prediction method in practice. In particular, logistic regression yielded an estimated risk of nearly 0.19, and Super Learner improved upon this by almost 15 percent. Similarly, the AUC for logistic regression was 68.5 percent, a full 10 percentage points less than that of the Super Learner. Logistic regression did perform better than some algorithms, such as k-nearest neighbors and methods based on fewer variables; it was in the middle of the pack in terms of risk and other measures, as can be seen in both Table 3 and Figure 1. However, there was a lot of room in the data for improving upon logistic regression, and the Super Learner was able to recognize this and adapt accordingly.
Third, as might be expected, the best performance was achieved by incorporating all variables in the analysis; the average risk of methods that used all variables was 0.181, and 9 of the top 10 performing methods overall relied on all variables. The methods that screened variables based on correlation tests gave the next best performance, with an average risk of 0.188. On average, the worst performance was achieved by relying on only either facility-based or inmate-based variables, with facility variable–based methods edging out inmate variable–based in terms of average risk 0.198 to 0.201. Thus, it was not clear from our analysis whether inmate or facility variables mattered most; in fact, this depended somewhat on the method that was used (inmate-based variables were most useful for generalized additive models, multivariate adaptive regression splines, and k-nearest neighbors whereas facility-based variables were most useful for lasso and logistic regression). In Online Appendix, we present variable importance results based on random forests, the best performing algorithm in our application—as noted next.
Finally, some algorithms that did well in our analysis perform poorly in other data sets while some other algorithms that did poorly perform well in others. For example, random forests does very well for our data, but van der Laan and Rose (2011) show that in another data set, it performs more than ten times worse (in terms of risk) than logistic regression. Boosting often performs well, but in our setting, it lags behind random forests; similarly, k-nearest neighbors performs very poorly in our setting, but van der Laan and Rose (2011) show that for a popular high-dimensional microarray data set, it does nearly as well as Super Learning. All of these point to the extreme need for the Super Learner in prediction problems: In any given data set, we simply have no idea which of the state-of-the-art machine learning algorithms will perform best, and which will perform poorly, or if a simple model such as logistic regression will be sufficient. Super Learning let us use the data to combine algorithms in an optimal way, so as to always perform as well as a mathematical “oracle” that knows the hidden true features of the data.
Discussion
Although machine learning has not yet entered the mainstream of quantitative social science research, it has become increasingly popular in many disciplines for both prediction and causal inference purposes (Glymour et al. 2013; McFarland, Lewis, and Goldberg 2016). New algorithms—or tweaks to existing ones—are appearing almost daily with little principled guidance on which algorithm will perform best for a particular study. In this article, we described the implementation, theoretical foundations, and the motivation behind the Super Learner (van der Laan et al. 2007). Compared to making largely arbitrary decisions in choosing algorithms in the hope they will perform well, the Super Learner uses cross-validation to evaluate multiple algorithms and tuning parameters to identify the one or a combination of many that produces superior forecasts. We illustrated the use of the Super Learner by predicting violence among prison inmates.
Most importantly, the Super Learner was at the top of the list of algorithms used in our study. We also found a strong performance by random forests, in line with prior research in criminal justice settings (Berk, Kriegler, and Baek 2006; Berk and Bleich 2013). Yet, this result might simply be due to the idiosyncrasies of our particular data set and the outcome variable we were predicting. There are no guarantees that the random forest algorithm—or any other individual learner, for that matter—will consistently be at the top in any other application of the Super Learner. For instance, in a population-based study predicting mortality from physical activity, random forest was one of the worst performing algorithms, surpassed by lasso, generalized boosted regression, and even parametric main-terms logistic regression (Rose 2013). Within the constraints of the computational capabilities, the researchers should consider as many algorithms as possible.
Comparative evaluations and applications of machine learning approaches in sociology and criminology have mainly involved comparing two or a handful of algorithms. Further, some algorithms—such as random forest and CART—have received a great deal of attention while others much less, and comparative evaluations have typically used logistic regression as the basis for illustrating the advantages of machine learning approaches (e.g., Berk and Bleich 2013; Ngo et al. 2015; Hamilton et al. 2014). These studies are useful, but they might leave the reader with a misleading conclusion that the one algorithm that performs best in a particular comparison will perform better than any other algorithm and in any other data setting. As others have suggested, direct comparison between multiple methods is imperative (Bushway 2013). The value of the Super Learner is that the choice of which algorithm will perform best on any given data set is answered empirically in a principled fashion—while considering many different candidates.
For the most part, machine learning methods are being used for prediction purposes, but a growing body of research is using data-adaptive approaches to identify causal effects from observational data. In simulation studies, the Super Learner has been successful in applications that involve accounting for confounding and selection bias by predicting assignment into treatment conditions. In propensity score matching analysis, the Super Learner performed better than logistic regression, the most popular approach for propensity score estimation, both in terms of reducing bias from misspecified models and for improving balance, especially for highly unbalanced covariates (Pirracchio, Petersen, and van der Laan 2015). In longitudinal settings, the Super Learner is exceptionally useful for estimating inverse-probability-of-treatment weights in marginal structural models (Neugebauer, Schmittdiel, and van der Laan 2014; Neugebauer et al. 2013). The methodological applications and extensions of the Super Learner are still a rapidly evolving field with great promise for practical application in causal inference problems.
Substantively, findings from our study may be placed in a theoretical context from multiple perspectives. Sociology of imprisonment has long been informed by many of the foundational ideas in the discipline (Crewe 2013). Some have suggested that prison violence is best explained from a social control perspective (e.g., Steiner 2009). The extent of formal control available, represented with variables such as the proportion of full time staff in the facility, is related to how well the prison can be monitored and potential violence deterred. Related to that, it is important to consider race and ethnic heterogeneity of inmates. In community studies, racial and ethnic heterogeneity may weaken sense of social capital and introduce social disorganization (Sampson and Groves 1989; Shaw and McKay 1942). Others have argued, based on the general strain theory, that the deprivation and frustration associated with incarceration are key drivers of institutional misconduct (e.g., Morris et al. 2012). Our findings cannot adjudicate between these explanations, but they do suggest that the characteristics describing both inmates and facilities are important in predicting prison violence–rather than one being more important.
We examined prison violence mainly as an illustration of the Super Learner, but it is important to point out some of the limitations of our analysis. The Census does not have any additional information about the incident (e.g., whether it involved injuries), and many incidents are not reported to the correctional authorities (Steiner and Wooldredge 2014). Most likely, the Census documents only the most serious offenses that leave less reporting discretion on the part of administrators (Bierie 2012). Prison administrators may also have wanted to present their facilities in a better light therefore minimizing the number of assaults. Further, while prisons may report that they make available different types of programs, the census says nothing about their quality, content, nor how many inmates have accessed the programs. Finally, there are other measures of prison conditions that the census was not able to capture but that are relevant for predicting prison violence, such as noise levels, lack of privacy, and extent of gang activity in a particular facility, as well as inmate satisfaction with staff (Bierie 2012; Gonçalves et al. 2014; Wolff, Shi, and Siegel 2009). Future research should consider these and other relevant variables that might be more challenging to measure reliably but that may play an important role in understanding origins of inmate violence.
One limitation of machine learning tools is the sometimes overwhelming number of tuning parameters and different available learners—all of which require some level of expert competency; however, this limitation is directly addressed by the Super Learner since it automates the choice of tuning parameters and learning algorithms via cross-validation. Another limitation (but which is shared by the Super Learner) is computational complexity; depending on the number of covariates and sample size, as well as which and how many learners are included, cross-validation-based methods can be computationally expensive. However, computing power is constantly progressing and with reasonable choice of algorithms Super Learner can often be quite fast. For example, our Super Learner analysis took approximately five minutes to run on a standard laptop.
Conclusion
There is now a growing appreciation across the social sciences for the value of data-adaptive methods for discovering knowledge from either big or small data sets. In one of the first social science applications to date, we introduced the Super Leaner, a powerful and principled approach to learning from data based on cross-validation. We encourage sociologists and researchers from related disciplines to use the Super Learner when there is little guidance, if any, on what prediction algorithm will perform best for their particular research question. The promise of the Super Learner does not lie only in prediction; it can be seamlessly combined with popular methods for estimating causal effects from observational data such as propensity score matching and marginal structural modeling. The Super Learner may especially be helpful when predicting behaviors of individuals in settings where norm breaking is costly, both in terms of human life and material damage. In this article, we demonstrated how the Super Learner can be applied to modeling prison violence—a pressing issue in times of unprecedented rate of incarceration.
Supplemental Material
online_Appendix - Principled Machine Learning Using the Super Learner: An Application to Predicting Prison Violence
online_Appendix for Principled Machine Learning Using the Super Learner: An Application to Predicting Prison Violence by Valerio Baćak and Edward H. Kennedy in Sociological Methods & Research
Supplemental Material
Online_Appendix_Figure - Principled Machine Learning Using the Super Learner: An Application to Predicting Prison Violence
Online_Appendix_Figure for Principled Machine Learning Using the Super Learner: An Application to Predicting Prison Violence by Valerio Baćak and Edward H. Kennedy in Sociological Methods & Research
Footnotes
Authors’ Note
A previous version of this manuscript has been presented at the Annual Meeting of American Society of Criminology in 2015.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
