Abstract
Contemporary psychological research increasingly involves machine-learning techniques, including random forests, for their capability in analyzing complex, high-dimensional data sets and modeling nonlinear predictive relations. In this article, we provide a comprehensive review of random-forest methods in psychological research. We begin by introducing the fundamental concepts of decision trees, followed by the theoretical framework of random forests as an ensemble method. Next, we review the methodological development and commonly used software tools for random-forest models. We discuss the practical issues and challenges when implementing random forests in psychological studies. Importantly, we then systematically review the empirical psychological research articles published between 2020 and 2022 that used random forests; we summarize the applications of random forests, with a special emphasis on data structure, software implementation, hyperparameter tuning, and approaches for handling missing data. By synthesizing the theoretical foundation and current empirical practices, in this article, we identify significant methodological gaps in applying random forests to psychological data and hope to initiate much needed conversations on how psychologists can effectively use the random-forest method to advance psychological science.
In recent years, there has been a growing interest in using machine-learning techniques to better understand complex biological, neurological, and behavioral data in psychological studies (Dwyer et al., 2018; Orrù et al., 2020; Rosenbusch et al., 2021; Sleek, 2023; Vélez, 2021; Yarkoni & Westfall, 2017). Among the many machine-learning algorithms, random forest (RF; Breiman, 2001) has become a popular analytical tool for classification, prediction, and feature selection. RF is a nonparametric ensemble learning algorithm that combines multiple decision trees to form a powerful committee of predictive models. RFs are known for their great flexibility in handling large high-dimensional data and complex nonlinear relations, which can otherwise be challenging using traditional parametric statistical methods (Malley et al., 2018; Ryo & Rillig, 2017; Touw et al., 2013). RFs are also known for their advantage in reducing overfitting (Gashler et al., 2008; Segal, 2004) and handling a large volume of variables with a relatively modest sample size (Fife & D’Onofrio, 2023; Matsuki et al., 2016). The RF method has been applied broadly across various areas of psychological studies, such as predicting mental-health outcomes, assessing behavioral patterns, and analyzing neuroimaging data. For instance, among the many applications, researchers have used RFs to identify key predictors of depression (e.g., Betz et al., 2022; Qiu et al., 2022), predict student academic performance (e.g., Wang et al., 2023), and detect dementia based on brain activity (e.g., Ye et al., 2023).
However, despite their increasing popularity, the use of RFs in psychological studies is not without challenges. Methodological evaluations of their performance in psychological-research settings remain limited. In particular, it has not been systematically studied whether RFs, which was originally designed for “big data” scenarios, have desirable performance given the constraints that are typical in psychological-study contexts, such as smaller sample sizes, imbalanced data, and the ubiquitous presence of missing data. The practical implementation of RFs also requires careful consideration of hyperparameter tuning and the choice of software tools that have varying capacities. Given these considerations, in this article, we aim to provide a comprehensive review of RF methods in psychological research, including its theoretical framework, the methodological development, commonly used software tools, and the applications in empirical studies, with a hope to reveal some of the challenges that researchers encounter in practice.
This article is structured as follows. We begin by introducing the basic concepts of decision trees, followed by a discussion of the important practical considerations in applying RFs. Next, we review the available methods and common software tools for implementing RFs and a large collection of empirical psychological studies that used RFs for data analyses. Finally, we discuss the challenges and practical issues associated with using RFs in psychology and propose directions for future research. Most importantly, we hope this article can be the start of many conversations about how RFs can be effectively applied to advance psychological science.
Conceptual Framework
Decision trees
Decision trees are nonparametric statistical-learning methods that automatically take care of complex nonlinear relations and interactions. They are supervised-learning techniques for predicting categorical outcomes (classification trees; e.g., depression status) or continuous outcomes (regression trees; e.g., work burnout score). The tree is constructed by recursive partitioning, a step-by-step procedure in which the predictor space (i.e., feature space)

A hypothetical conceptual example of classification tree. This tree has a depth of 3: From the root node to
Following the partition, each observation in the data should fall into one and only one terminal node. With a single tree, each terminal node is assigned a particular predicted outcome value,
In the following sections, we mainly focus on one of the most widely adopted decision-tree algorithms—classification and regression trees (CARTs; Breiman et al., 1984)—as the base learner of RFs (methodological limitations are briefly discussed in the Discussion section). Note that other alternative decision-tree algorithms also exist, such as the C4.5 classifier (Quinlan, 2014) and conditional inference trees (CTrees; Hothorn et al., 2006). These algorithms share the fundamental decision-tree framework with CART but differ in certain technical aspects, including the criterion for selecting a splitting variable and the internal procedure for handling missing data. Some of these alternatives were reported to better address the inherent biases of CART (Hothorn et al., 2006; Strobl et al., 2008, 2009). Nonetheless, CART still remains the most commonly used base learner in current software implementations of RFs. For a more thorough treatment of the tree-based methods and different algorithms, readers can refer to Mienye and Jere (2024) and Strobl et al. (2009).
Figure 1 provides a simple hypothetical example of classification trees built with CART. Here, the outcome of predictive interest is the depression status (depressed vs. not depressed). The two covariates—sleep quality and stress level—are all measured on a 5-point Likert scale; high values indicate higher levels in the corresponding measure. The tree starts growing from the entire sample space. At each step, the CART algorithm finds the best variable to serve as the splitting variable by considering every possible covariate and every possible split for each covariate. All these possible splits are evaluated and compared simultaneously. The one single split that yields the best fit is then selected for further growing the tree. For classification tree, a possible way to evaluate the fit is to compute the node-impurity measure, such as entropy
or Gini index
where p = P(Y = depressed) is the observed proportion of depressed observations in node t. The goodness of a possible split can therefore be defined as the reduction in the chosen node-impurity measure after splitting the parent node. The split that leads to the largest reduction in node impurity (i.e., the largest improvement in node purity) is selected.
1
In this example, for instance, sleep quality is chosen as the first splitting variable, and the split occurs at a value of 3 (
Now suppose the tree algorithm goes through the recursive-partitioning process and renders the tree model shown in Figure 1 as the best predictive model. After partitioning, the feature space is divided into a total of five terminal nodes (
In this classification tree, the predicted outcome is the majority depression status observed among the training cases in each terminal node. To predict the depression status for a new individual, we simply follow this decision tree to send them down through the branches, guided by the covariates’ values accordingly. In this example, to predict an individual’s depression status, the first question to ask is whether this person has a sleep quality score below 3. If yes, the person moves down the left branch; if not, the person moves to the right. Once in the left child node, the next question is if the stress level is above 3. An answer of yes sends this individual further down to the left branch, ending up in terminal node
How does one evaluate the prediction accuracy of a decision tree? With decision trees, the prediction accuracy is usually assessed by examining the model’s performance on a separate test set or through cross-validation, which involves partitioning the data set into multiple subsets and repeatedly training the model on some while validating it on the remaining ones (see more in Browne, 2000; Koul et al., 2018; Stone, 1974). For classification trees, a standard measure of predictive accuracy is the misclassification rate, which quantifies the proportion of incorrectly classified observations in the entire test set:
where
Extending the discussion to regression trees follows a similar logic except that the outcome is continuous and the measures of goodness are naturally different. To grow a tree rather than using measures such as entropy or Gini index to evaluate the possible splits, regression trees instead minimize the variability of the responses in each node by computing the sum of squared errors (SSE):
where
Ensemble methods: bagging and RFs
Individual decision trees usually suffer from instability, which means they are sensitive to even small changes or outliers in data, especially for the large complex models that are overfit to the training data. That is to say, partitioning based on a slightly different sample can yield an entirely different tree model and thus very different predicted results for the same observations. Therefore, a single tree is unstable with high variance. Such instability is generally undesirable for predictive tasks. To address this problem, ensemble methods have been developed, such as bagging (bootstrap aggregating; Breiman, 1996) and RFs (Breiman, 2001). 3 These approaches explicitly leverage the inherent instability of individual trees to construct a robust collective committee of models. By perturbing the training data, typically through bootstrap sampling, multiple predictive models are generated. These predictions are then aggregated, often through averaging or majority voting, to produce a more stable and accurate ensemble predictor. Ensemble methods are useful because they take the average across multiple trees, which can stabilize the predictions and reduce the impacts of random noises.
Let the training data be denoted as

A hypothetical conceptual example of bagging. (a) The first bootstrap classification tree T*1; (b) The second bootstrap classification tree T*2; (c) The third bootstrap classification tree T*3.
Bagging with regression trees follows a similar logic to that of classification trees. Other than the difference of evaluation measures, a key distinction between bagging regression trees and classification trees lies in how predictions are synthesized. While classification trees rely on a majority vote to determine the final outcome, regression trees aggregate predictions by computing the average of the outcome values predicted by each of the B individual trees:
A methodological extension of bagging is RFs (Breiman, 2001). In bagging, randomness is introduced through the process of bootstrap sampling from the original data. Ideally, the individual trees grown out of these bootstrapped samples should be independent. Thus, bagging can effectively reduce the variance by aggregating over independent predictions while lowering bias by growing large, bushy individual trees. In practice, however, these tree models can be correlated, undermining the variance-reduction effect of bagging. To further de-correlate the individual trees, RFs introduce another layer of randomness on top of bagging. This typically involves randomly selecting a subset of features at each split in the individual trees (Ho, 1998).
The algorithm of RFs closely resembles that of bagging. It begins by drawing B bootstrap samples from the original training set via repeated sampling. But in each bootstrap sample, the way to grow a classification tree or regression tree is slightly different than in bagging. More specifically, at each candidate split, rather than evaluating all p available covariates, the algorithm randomly selects a subset of size m (m < p; when m = p, RF is simply equivalent to bagging). Among these m covariates, the RF algorithm identifies the optimal split based on a node-impurity measure, such as entropy or Gini index for classification or SSE for regression task. The process of random feature selection and recursive partitioning continues until a predefined stopping criterion is reached. Repeating this tree-growing process for each of the B bootstrap samples yields an ensemble of B classification or regression trees, which collectively defines an RF. Predictions for new observations follow the same rules as in bagging: majority voting for classification and averaging predictions for regression. In contrast to a single regression tree that assigns identical predicted outcome to all observations in the same terminal node, the averaged predictions in bagging and RFs typically vary from one observation to another even when some trees can place them in the same leaf. This results in a more finely grained prediction surface.
Assessing the variable importance (VI) is an important and common consideration in implementing RFs. Beyond obtaining prediction results, researchers are often interested in knowing which variables contribute to the classification or which variables are most influential in explaining the response variable. Although tree-based algorithms are mainly used for prediction purpose and are data-driven and exploratory in nature, evaluating the VI can help with feature selection for subsequent analyses and can potentially contribute to future theory development (e.g., Brick et al., 2018). Nonetheless, the evaluation of VI is already challenging in a single-tree model given the nonlinearity and complex interactions; it becomes even less intuitive with ensembles of trees, particularly compared with parametric models that most researchers are more familiar with (e.g., linear regression). One widely used approach is briefly introduced here. For RFs, in each individual tree T*b, there is an out-of-bag (OOB) sample, defined as the collection of observations that are not selected into the bth bootstrap sample:
This process is repeated in each of the B trees for covariate Xj. Finally, aggregating these importance measures across the B trees yields a total importance measure for Xj:
Although it seems computationally cumbersome, this method has been conventially integrated into the RF algorithm, and importance scores are often automatically computed for all variables in software programs. Figure 3 provides a hypothetical example of the permutation-based VI measures, computed as the increase in misclassification rate, with variables ordered from more important to less important.

Variable-importance plot from random forests for the hypothetical example predicting depression.
The RF algorithm involves three main tuning parameters (also referred to as “hyperparameters”) that researchers need to set values for growing a forest: the tree depth (i.e., the number of splits along the longest path from the root node to a terminal node), the number of randomly selected covariates at each potential split (m), and the number of bootstrap samples (or the number of trees grown), B. To ensure the effectiveness of RFs, the hyperparameters’ values have to be carefully specified. Some common suggestions recommend
Practical issues in applying RFs
Although decision trees and RFs have gained an increasing popularity in psychological research, the effectiveness of RFs, which were originally developed outside psychological-research contexts, has not yet been thoroughly investigated. Important questions remain about its applicability and how it can be effectively used to address the practical challenges that are more specific in psychology studies. In applying RFs, psychologists need to navigate through a series of methodological considerations that can substantially influence the effectiveness of the RF models and the interpretation of results.
Goals of implementing RFs
To begin, the purpose for using RFs can vary widely (Probst et al., 2019; Shmueli, 2010; Yarkoni & Westfall, 2017). Many researchers employ RFs as a predictive tool aiming to enhance accuracy in predicting future outcomes. Some predominantly rely on RFs because of its strength and flexibility in handling high-dimensional data in which higher-order nonlinearity and complex interactions are anticipated. Alternatively, others may choose to include RFs as part of comparison between various predictive models, ranging from traditional parametric models (e.g., linear regression or logistic regression) to nonparametric models (e.g., bagging, boosting), to empirically determine which method offers the best predictive accuracy. A third application of RFs is in preliminary variable selection, in which they assess the relative importance of predictors in explaining the outcome variable rather than just serving as a predictive tool. It usually involves using RFs to identify important covariates that will be further examined in inferential statistical modeling (and sometimes predictive modeling as well). Regardless of whether to predict, to explain, to select features, or to achieve multiple goals, RFs’ performance and applicability need to be assessed accordingly depending on the specific research purposes (Shmueli, 2010; Yarkoni & Westfall, 2017).
Data dimensions
Given that psychological research often concerns a very specific target population and in-person data collection, the data conditions in psychological studies can be quite different from those in other fields. Some common challenges include small sample size, data dimensionality, and missing data. Psychologists often need to work with small data sets with limited number of observations, particularly when data collection is costly. With traditional inferential statistics, small sample sizes can be challenging because they can introduce larger bias and reduce estimation precision. Does it also affect the performance of RFs as a predictive model or as a variable-selection method? While RFs are believed to be applicable in “small n large p” conditions (Fife & D’Onofrio, 2023; Matsuki et al., 2016; Strobl et al., 2009), potential challenges may still exist, particularly in terms of model overfitting (e.g., Yarkoni & Westfall, 2017). Recall that RFs fundamentally rely on bootstrap sampling. Each individual tree is trained on a bootstrap sample, which, on average, is expected to contain approximately 63% of the original data (if sampling with replacement). For small data sets, this can substantially reduce the effective training-set size for individual trees. With a limited training set, individual trees can be overfitted, especially if the trees are allowed to grow deep and large. This overfitting can be further exacerbated if irrelevant predictors are included in the training process without prescreening. For example, consider a small study with 30 participants, 10 of whom are clinically depressed. Suppose by pure chance, all of the depressed participants use iPhones to fill out the survey and some of the nondepressed participants use Android phones. With such a limited sample size, the individual tree may try to capture this random data pattern (e.g., suggests using iPhone is an important predictor for depression), rendering the results less generalizable. Unfortunately, sample size is not typically carefully evaluated and justified when predictive models are employed (Dhiman et al., 2023), and there are no clear guidelines regarding what should be considered a good sample size or how to plan for a sufficient sample size with predictive modeling. Some research has shown that small sample size can lead to larger bias when certain a machine-learning algorithm is applied (e.g., support vector machine; Vabalas et al., 2019), but the methodological implications of small n on RFs in psychology studies remains unclear.
On the other hand, it is also not yet clear whether having a small number of predictors will affect the effectiveness of RFs. With high-dimensional data, the key methodological advantage of RFs over bagging is that they de-correlate the trees by randomly selecting a subset of predictors at each split. With the total number of predictors being limited, this advantage may diminish. The same predictors may frequently appear across multiple trees, making the ensemble less independent, potentially affecting the overall performance of this forest. This consideration can be illustrated through a simple example. To make it more intuitive, we ran a simple simulation as a proof as concept: From p predictors, we randomly drew two independent sets of m candidate variables (without replacement) and counted the number of shared predictors; this process was repeated 1,000 times. With p = 10, the average overlap across replications was about 0.38, 2.50, and 6.39 when m = 2, 5, and 8, respectively; that is, with m/p = 0.5, roughly half of the five candidate predictors are expected to be shared by two trees, and with m/p = 0.8, about 80% of them are expected to be identical. In contrast, when p = 50, the average overlap between two independent draws dropped to about 0.09, 0.49, and 1.29 for m = 2, 5, and 8, respectively. Thus, the potential overlap between each tree is substantially reduced. 5 This example shows that when the total number of available predictors p is small, the m/p ratio can grow big easily with even small changes in m, causing the trees to consider largely the same variables at each split and thus become more correlated, undermining the de-correlation property of RFs.
Imbalanced data
In psychology studies, imbalanced data sets are common, particularly when the outcome of interest is relatively rare or underrepresented in the population. For example, in mental-health research, the number of children with suicide attempts (e.g., see Harman et al., 2021) is substantially smaller than the number of healthy control subjects. With such class imbalance, the RF algorithm can be biased because it will prioritize the majority class, resulting in poor prediction of the minority class. This can be problematic because the minority class is more often the focal research interest (e.g., identifying at-risk children for potential suicide).
As a consequence, when the outcome is highly imbalanced, the overall misclassification rate, or accuracy, can be a very misleading indicator of model performance because it is dominated by the majority class. For example, if 95% of the training cases are individuals without suicide attempts, a model that predicts all cases as “no suicide attempts” can achieve a 95% accuracy (a low misclassification rate = 5%), but the model itself is completely useless for identifying individuals at risk (both sensitivity and precision are zero 6 ). In such scenarios, alternative metrics that assess the model performance in each specific outcome class are preferable to better align with the intended research purposes, such as sensitivity (i.e., recall), specificity, 7 precision, and F1 score. 8 In addition, visual inspections can be informative as well. The precision-recall curve, which is conceptually similar to the receiver operating characteristic curve but more appropriate with imbalanced data, can be used to assess the model performance across varying decision thresholds.
To address class imbalance, some resampling techniques can be applied (Chawla et al., 2002; Japkowicz, 2000; Ling & Li, 1998). One possible approach is to randomly oversample the minority class or randomly undersample the majority class (Japkowicz, 2000; Ling & Li, 1998). Random oversampling balances the data set by duplicating observations from the minority class until the resampled minority class consists of as many data points as the majority class; in contrast, random undersampling balances the data set by sampling only a smaller proportion from the majority class until their size matches the minority class. As a more sophisticated alternative, synthetic minority oversampling technique (SMOTE; Chawla et al., 2002) further perturbs the data by randomly generating synthetic minority class samples through interpolation between each minority unit and its nearest neighbors in the feature space, which can be identified using a distance metric, such as Euclidean distance, until the sample size of the minority class and majority class size are equalized.
Missing data
Missing data are also typical in psychological studies involving human subjects because of participant nonresponse, dropouts, or other practical issues. There are different strategies for handling missing data in RFs, each with its own advantages and disadvantages. Listwise deletion is the most widely used approach, which simply discards all observations with incomplete data before fitting RF models. However, listwise deletion can be a huge waste of available information and can be infeasible if the sample size is small to begin with.
Among other more modern missing-data-handling methods, by far the most popular approach in CART is surrogate splits (Hapfelmeier et al., 2014; Hothorn et al., 2006). With this approach, missing data are handled internally in each tree. Surrogates are defined locally at each split; when the best splitting variable is selected, other candidate splitting variables that best mimic the current optimal splitting result are ranked and labeled as the surrogates. When the primary splitting predictor is missing for a given observation, RF algorithm instead uses the best surrogate available to assign this observation further down through the tree. The methodological challenges of surrogate splits have been noted, however, given the computational burden, and thus, many other approaches have been proposed (Tang & Ishwaran, 2017). Instead of using surrogate splits, some software (e.g., the
An alternative and common approach for handling missing data is to preimpute the missing data and fit the analytical RF model to the complete preimputed data. In some software implementation (e.g., the
Missing values can also be imputed adaptively during the analytic process (i.e., on the fly) rather than beforehand. For instance, the on-the-fly-imputation (Ishwaran et al., 2008; Tang & Ishwaran, 2017) algorithm uses only complete data to determine the best split at each step. Once a splitting variable is selected, for observations with missing data on this selected splitting variable, a random value is drawn from the nonmissing in-bag data to “impute” for this missing value. This observation is then assigned to a child node according to this temporarily imputed value. These temporary imputed values are discarded after this observation is passed down to a child node; thus, the missingness in this observation is preserved onward.
Although missing data have been studied extensively in conventional inferential-statistical modeling (e.g., Enders, 2022, 2025), it remains unclear which approach is best suitable for missing-data handling when implementing RFs in psychological studies. The optimal choice likely depends on the underlying missingness mechanism (missing completely at random, missing at random, missing not at random; Rubin, 1976), the missing data rate, and the specific structure of the data.
Software Implementations
Besides the theoretical considerations noted above, on a more practical level, psychologists must also decide on which software package to use. Different packages come with different modeling engines, base learners, default configurations, and supported functionalities. These choices can directly affect model results. Thus, researchers need to make informed decisions such that the software implementation is best aligned with the research goals.
Software tools
Many common software, such as Python, R, SAS, MATLAB, and SPSS, can be used to implement the RF algorithms. Among them, R and Python are two of the most popular statistical tools and programming languages, both of which are open-source environments and offer various specialized packages for RFs.
In R, multiple stand-alone packages have been developed for different modeling engines. For instance, the classic RF algorithm (Breiman, 2001) that uses CART as the base learner is implemented in both the
Commonly Used Software Packages for Random Forests
Note: For cells with two default values, the one denoted with underline is the default for classification tasks, and the other is the default for regression tasks. CART = classification and regression tree; CTree = conditional inference tree.
Besides the differences outlined in Table 1 that are directly relevant to model building, these packages also vary in their approaches of handling data, which also affect the results. For example, the options available for dealing with imbalanced data can differ substantially from package to package. Typically, data processing and model fitting are separate steps performed sequentially. In the R environment, the two wrapper packages (
Packages also differ substantially in how they handle missing data. In terms of the stand-alone modeling engines, the
When the wrapper R packages are used, researchers have three general options to handle missing data. They can choose to rely on the engine’s native functionalities as described above, use the wrapper’s preprocessing tools to process missing values directly in the pipeline, or impute the missing values using other preferred packages before model fitting. To begin with, in
Finally, in Python, starting from Version 1.40 of the
Hyperparameter tuning
After the data are properly preprocessed and loaded into the software of choice, the next step in implementing the RF algorithm is to determine the values of the hyperparameters. The current development of software usually provides some default values of the hyperparameters. Table 1 provides a summary of the hyperparameter options and default values used in the stand-alone modeling engine packages (
Regarding the number of randomly selected covariates for candidate splitting (m), although the default setting is a convenient option (typically
Tree complexity is another important hyperparameter in decision trees and RFs, which is usually controlled by setting the terminal-node size. Setting a smaller terminal-node size produces a larger tree with more splits, whereas a larger terminal-node size will effectively limit the tree depth. In many software, the terminal-node size is set to 1 for classification and 5 for regression by default. In addition to terminal-node size, other hyperparameters can also be used to control the tree complexity. These typically include the smallest node size required for a possible further split
11
(e.g.,
The number of trees is also a key factor to consider in growing an RF. Most software implementations default to 500 trees. But unlike other tuning parameters, increasing the number of trees does not generally lead to overfitting. In fact, research suggests that it is preferable to set it to a reasonably large value for the best predictive performance (Probst & Boulesteix, 2018) as long as the computational resource allows. However, once the prediction performance stabilizes, adding more trees will have diminishing returns in improving the predictive accuracy. 12
The performance of RF models is highly dependent on the specification of hyperparameters, and therefore, to achieve the optimal results, they must be carefully tuned to adapt to the data and the modeling context. For example, having a smaller data with smaller n may necessitate fitting fewer trees in a forest, enforcing a larger size of terminal nodes, and limiting the depth of each tree to avoid overfitting. On the other hand, having a smaller number of predictors may require the users to increase the number of randomly selected variables at each step when growing an RF. In the presence of missing data, it is also crucial to confirm whether the chosen software supports internal missing-data handling, and if so, researchers may need to specify additional tuning parameters as needed (e.g., the maximum number of surrogates). Finally, all tuning parameters can affect the computational time, which can be an important practical consideration for researchers. Employing a smaller m, limiting the tree depth, and growing fewer trees in an RF can reduce the computational burden.
With all that, are there any principled ways to find the optimal values of the hyperparameters? Several different strategies can be used for hyperparameter tuning (Owen, 2022). The most straightforward approach is manual tuning, in which researchers manually change the value of one or more tuning parameters and check which value results in the most accurate prediction, usually via cross-validation or OOB error. An extension of manual search is grid search, which automatically loops through all the possible hyperparameter-value combinations in a predefined search space set by the researcher (e.g., tree depths of 5, 10, 15 combined with 50, 100, 200 trees, yielding nine combinations to test). Essentially, grid search is an automated version of manual hyperparameter tuning by iterating over each combination using nested loops. Although both manual search and grid search are conceptually straightforward and easy to implement, they can be time-consuming and rely on researchers to provide reasonably good candidate values to do the comparison in the first place. As the number of hyperparameters and possible values increase, they can become computationally expensive very quickly. Extending the grid search, a more efficient option for automated hyperparameter tuning is random search, which randomly picks hyperparameter values from given probability distributions (e.g., a uniform distribution) rather than relying on any specific user-supplied values. At each iteration, a value for each tuned hyperparameter is randomly and independently picked. Random search thus requires less prior knowledge from the users about the hyperparameter values and is more computationally efficient because it does not test all possible combinations of the values. Random search is also reported to frequently outperform a basic grid search when dealing with many hyperparameters or wide ranges of values (Owen, 2022). However, researchers still have to manually define the total number of iterations (or the number of randomly sampled hyperparameter values) when implementing random search.
Manual search, grid search, and random search can all be considered as exhaustive search strategies. In addition to these exhaustive search approaches, another well-established and successful tunning strategy is sequential model-based optimization (SMBO; Jones et al., 1998), also referred to as Bayesian optimization (BO), which employs a more adaptive strategy through iterations such that the next iteration is informed by previous iterations. In its implementation, SMBO starts by drawing several random values from the hyperparameter space and evaluating the RF performance accordingly. A surrogate model (a probabilistic regression model, such as Gaussian process) is fit to these initial training results, roughly assessing how changes in hyperparameters affect the model’s prediction accuracy. It then proposes the next set of hyperparameter values within the predefined hyperparameter space, where the proposed values have the best expected prediction result under the current surrogate model. The proposed hyperparameter values are empirically evaluated for prediction accuracy on the training data, and this new training result is added to the previous training results to further update the surrogate model. This process repeats iteratively to find the optimal hyperparameter values (for a more detailed and intuitive illustration of SMBO, see Appendix A in the Supplemental Material available online). SMBO is more computationally efficient than the other approaches, making it particularly appealing when there are a large number of possible hyperparameter configurations. It does, however, require the researchers to have a better statistical knowledge of the process to properly implement the procedure.
For a summary of the hyperparameter-tuning options available across the various packages in R and Python, see Table 2. They differ in terms of the available tuning options and the flexibility of automated hyperparameter search. Of all the packages reviewed here, only
Hyperparameter Tuning for Random Forest in Commonly Used Software Packages
Note: x = not internally supported; SMBO = sequential model-based optimization; BO = Bayesian optimization . OOB = out-of-bag.
Review of RFs in Empirical Psychological Studies
In the previous sections, we reviewed the theoretical foundations and key practical considerations for applying RFs. Although RF methods hold great promise for advancing the field of psychological science, their effective implementation requires thoughtful considerations of the specific research contexts, data conditions, and software choices. Our review also underscores a critical gap in the literature: There is a lack of systematic methodological investigations of RFs in psychological-research contexts; the field calls for practical guidelines tailored to the unique challenges inherent to psychological research. To help bridge this gap, we reviewed a large collection of published empirical psychological studies that used RFs as part of their data analyses. By documenting current empirical practices through a systematic review, our goal is to provide applied researchers with not only examples of effective implementation and common pitfalls but also, more importantly, a data-driven reference for methodologists to design future studies that are context-relevant and can directly address the field’s most pressing needs.
The articles reviewed in the current study were selected following the procedure described below. As the first step, an advanced search for research articles was conducted in APA PsycInfo on January 1, 2023, using the keyword “random forest” from years 2020 to 2022. This procedure resulted in 733 published articles. From this pool, articles were dropped if they did not meet any one of the following screening criteria: (a) It must be an empirical study (e.g., meta-analysis and systematic reviews were dropped), (b) it must be psychological research (e.g., research in other fields, such as computer science and medicine, were dropped), and (c) it used RF models as part of data analyses. In the end, a total number of 637 published research articles were selected for review, which consisted of 708 studies because multiple studies can be separately reported in one single article. The 708 published empirical studies were defined as the analytical sample for this review. The results of this review are summarized in this section.
In terms of the prediction tasks, the majority of the published studies (75.71%) used RFs for classifying categorical outcomes, whereas 24.29% used RFs for regression tasks predicting continuous outcomes. Regarding software implementation, only 414 studies (58.47%) specified the software used for fitting RF models, and 41.53% of the studies did not provide software information (Fig. 4). Among the 414 studies that reported software details, the predominating majority of them used R and Python, accounting for 50.48% and 44.44%, respectively. A few studies also used MATLAB (e.g., Lohani & Rana, 2023) or SPSS (e.g., Gök et al., 2023).

Software used in applied-psychological studies.
Depending on the nature of the research context, the analytical sample sizes varied drastically from one study to another. It ranged from fewer than 10 (e.g., Abreu et al., 2021; Ranjan et al., 2021) to more than 8 million (e.g., Shiner et al., 2022). For the distribution of the analytical sample sizes across the reviewed studies, see Table 3 and Figure 5. Across all the studies, the median sample size was 585; the 10th, 25th, 75th, and 90th percentile sample sizes were 52, 134, 3,309, and 19,804, respectively. Although most studies employing RFs involved large samples, small sample sizes were not uncommon; overall, 5.51% of the studies used fewer than 30 participants, and 12.57% had sample sizes between 30 and 100. Among the 128 studies that involved a sample size
Analytical Sample Sizes in Applied-Psychological Studies

Analytical sample sizes in applied-psychological studies.
Not surprisingly, depending on the research context, the number of features also varied widely between studies. Some studies had a collection of fewer than 10 features (e.g., Smucny et al., 2021), whereas others had a large number of input features greater than 30,000 (e.g., Dai et al., 2021). But it was most common for empirical studies using RFs to include fewer than 50 input features, as evidenced in 51.28% of the studies. Across all the studies, the median number of features was 20; the 10th, 25th, 75th, and 90th percentiles of the number of features were 6, 10, 45, and 212, respectively. For the total number of input features and the counts of continuous and categorical features separately, see Figure 6 and Tables 4 and 5. Again, a substantial proportion of the studies (30.65%) did not report the number of input features in their RF models.

Total number of features used in applied-psychological studies.
Number of Continuous Features Used in Applied-Psychological Studies
Number of Categorical Features Used in Applied-Psychological Studies
To examine the prevalence of large-n-small-p versus small-n-large-p problems, we computed the n/p ratio (the ratio between the sample size n and the total number of input features p) for each study, which is visually summarized in Figure 7. The range of the ratio varied widely across the studies. For example, Dai et al. (2021) included 31,672 features for predicting depression based on data from 189 subjects, which is small-n-large-p; on the other hand, as another example, Götz et al. (2020) studied the prediction of personality with only 13 input features with data from 3,387,014 individuals, which is large-n-small-p. Again, 33.33% of the reviewed studies did not provide adequate information regarding n or p. For the studies that did reported the relevant information, a substantial proportion (24.44%) was considered to be dealing with the small-n-large-p scenario (n/p ratio falling below 10; see Matsuki et al., 2016); on the other hand, a large proportion of the studies (42.23%) commonly operated with large-n-small-p (n/p ratio above 10, n >> p), which is, in general, more typical in machine-learning applications. Across all the studies, the median n/p ratio was 21.9; the 10th, 25th, 75th, and 90th percentiles of the n/p ratio were 1.53, 4.61, 109, and 633, respectively.

The n/p ratio in applied-psychological studies.
Regarding the hyperparameters, we focused on two key hyperparameters of RF models in our review—number of trees (ntree) and number of randomly selected covariates for candidate splitting (m). Among the reviewed studies, 69.07% of them did not report the number of trees (Fig. 8), and 89.12% of the studies omitted details about the strategy for random feature selection at each split. Even among those that did provide this information, many relied on default software settings: 37.9% of them used the default number of trees (e.g., setting the number of trees to 500 in R and 100 in Scikit-learn), and 58.44% of them used the default setting for m (e.g.,

The number of trees used in applied-psychological studies.
For the missing-data-handling strategies used in the applied studies, see Figure 9. As shown, 87.15% of the studies did not report the missing-data percentage, and 67.80% did not mention how missing data were handled in RF models. Out of the 228 studies that mentioned the missing-data-handling approaches, most of them (42.98%) used listwise deletion and discarded cases with missing data; 42.54% of them applied some form of imputation to fill in missing data before model fitting 14 (e.g., k-nearest neighbor imputation; Ye et al., 2023). Some studies adopted a mixed strategy such that they discarded variables conditionally on the missing-data percentage and imputed missingness only if the variables had a missing rate below a certain threshold (e.g., variables with more than 25% missing data were excluded from analysis, and the others were imputed; Karabacak & Margetis, 2024).

Missing-data-handling approaches.
Discussion
Machine-learning techniques are playing an increasingly important role in transforming psychological science and serve as a great addition to conventional statistical methods and explanatory modeling (Dwyer et al., 2018; Orrù et al., 2020; Rosenbusch et al., 2021; Sleek, 2023; Vélez, 2021; Yarkoni & Westfall, 2017). For instance, they have been shown to be effective in enhancing the diagnostic, prognostic, and treatment decisions in clinical settings, particularly by tailoring personalized intervention strategies to meet the needs of the individual patient (Dwyer et al., 2018).
In this article, we focus on a widely used machine-learning technique, RF, and its applications in empirical psychological studies. This review highlights both the widespread adoption and substantial methodological inconsistencies currently present in the field. Although RF methods hold considerable promise in advancing psychological science, our findings reveal several critical gaps and methodological challenges that warrant attention.
A notable concern identified in this review is the inconsistent reporting and lack of methodological transparency across studies. The quality of technical reporting is poor overall, which was also found in previous work that reviewed the application of predictive modeling in clinical studies (e.g., Bouwmeester et al., 2012; Mallett et al., 2010). Many articles omitted essential methodological details when applying RFs, such as the software used, hyperparameter values, hyperparameter-tuning strategies, and handling of missing data. The omission of this key information can lead to replication crisis because replication by future researchers is impossible without these methodological details. The lack of transparency also severely limits the generalizability of findings, renders the results hardly comparable across research findings, and thus hinders the potential of machine-learning methods to transform psychological science as a field.
Furthermore, our review reveals the diverse data conditions and research contexts under which RF methods are applied in psychological studies, reflecting substantial variability in sample sizes and feature input. Although most RF applications are conducted in large-n scenarios, our literature review shows that a nontrivial proportion of psychological studies operate with small samples. The small-n-large-p condition has long been a common challenge in conventional statistical modeling. Currently, it remains unclear whether RFs are robust under this condition. Given the various purposes RF models possibly serve (including prediction, classification, missing-data imputation, and feature selection), systematically evaluating their performance in small-sample scenarios for different purposes is much needed in future methodological investigations.
Addressing these noted gaps requires establishing clear, psychology-specific practice guidelines. We recommend that future applied research should prioritize rigorous reporting standards, including a detailed documentation for software choices, the base learner used for growing trees, hyperparameter decisions, data-preprocessing procedures, and other important technical details that are relevant in a study. Most importantly, future research should develop and adopt standardized reporting frameworks for machine-learning applications in psychology. Existing guidelines, such as the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis; Collins et al., 2015) statement, provide a useful reference. But psychology-specific guidelines that address the unique considerations in psychological-research contexts remain very necessary. In addition, given the considerable variability across available software tools, psychological researchers should also consider preregistering their analysis plans. In particular, researchers should preregister the key aspects of their RF models, including the intended purpose of using RF (e.g., prediction vs. feature selection), the hyperparameter-tuning strategy, the cross-validation plan, model-performance evaluation criteria, and alternative comparison models if applicable. We also encourage sharing analysis code whenever feasible; importantly, the code should explicitly document the hyperparameter values (whether or not defaults are used) and include the missing-data processing steps when applicable. This will complement, although not replace, the formal reporting of key analytic decisions in the main text. Together, these practices will significantly enhance reproducibility, transparency, and collective scientific advancement in the field.
Furthermore, for future methodological work, systematic investigations into the performance of RFs under varying empirical conditions, such as varying sample sizes, feature dimensionality, and missing-data-handling strategies, are in pressing need. Such methodological investigations will help guide psychological researchers to implement RF methods more effectively, thus maximizing their research contributions. It is our hope that the review of applied studies can provide a useful data-driven reference for designing methodological investigations with greater practical relevance. On the other hand, developing tutorial materials featuring psychological-data examples that clearly demonstrate the workflow (from data preparation to results interpretation) and tutorial materials that introduce the state-of-the-art RF techniques are important tasks for methodological work, too. Particularly, many modern, advanced techniques remain unfamiliar to psychological researchers, who, without such knowledge, often then default to more convenient yet suboptimal practices (e.g., listwise deletion for missing data). Accessible tutorials would therefore bridge the knowledge gap and promote methodological rigor.
Note, however, that although in this article we focus on RFs as a widely adopted analytical tool in psychological research, there are some limitations and inherent biases in the traditional RF algorithms, particularly with CART as the base learner. One major concern is selection bias when growing the trees. CART’s greedy search over all possible splits was found to favor covariates with many possible splits (or categories) and variables with many missing values (Hothorn et al., 2006; Kim & Loh, 2001; Strobl et al., 2007, 2008). This would not only bias the VI measures but also limit the interpretability of the model. In addition, the bootstrap-sampling-with-replacement procedure commonly used in RFs was also found to introduce bias by favoring covariates with more categories, thereby artificially inflating their importance (Strobl et al., 2007). Furthermore, the typical permutation-based VI measures can also yield biases because of the correlations among covariates (Strobl et al., 2008). Specifically, the VI measure reflects not only a variable’s unique association with the outcome but also its association with other correlated predictors that are related to the outcome, thus leading to an overestimation of a covariate’s independent importance. To address these issues, alternative frameworks, such as conditional inference forests, have been developed, which reduce the bias in variable selection by using a statistical-inference-based approach for splitting rather than a greedy search (Hothorn et al., 2006) and is thus recommended in the literature (e.g., Strobl et al., 2007). Methodological investigations further recommend using conditional inference forests with subsampling without replacement to further reduce the bias (Strobl et al., 2007) and employing conditional permutation schemes for VI (Nason et al., 2004; Strobl et al., 2008) to more accurately evaluate the importance of a covariate conditional on other covariates. Although a comprehensive examination of relevant methods is beyond the scope of this article, researchers should be aware of the potential biases when implementing RF. This again underscores the importance of thoughtful model specification rather than relying solely on software defaults.
On the other hand, besides RFs, other ensemble methods, including boosting algorithms, such as AdaBoost (Freund & Schapire, 1997; Schapire, 2013), Gradient Boosting Machine (Friedman, 2001), XGBoost (Chen & Guestrin, 2016), and LightGBM (Ke et al., 2017), are powerful alternatives that also have great potential for helping advance psychological science. Unlike bagging and RFs, which build trees independently, boosting algorithms build trees sequentially, with each iteration depending on the previous one. Each RF or boosting algorithm offers its own advantages. For example, LightGBM is designed for efficiency that reduces the computational time, particularly with large data sets. Although boosting methods often show excellent prediction accuracy, they do require more careful parameter tuning. In contrast, RFs tend to be more robust with less intensive tuning and are found to perform competitively across a wide range of problems (Bentéjac et al., 2021). Nonetheless, given the important role that boosting methods play in machine learning, we encourage future research to further investigate the application of boosting in psychological science and to develop practical guidelines for more effective use of boosting in advancing psychological science.
We acknowledge that this study is limited by the currently available data in several aspects. These limitations point to important directions for future research that can build on our findings. For starters, with the limited information, we were unable to identify more nuanced relations between key practical aspects of applying RFs. As an example, we were unable to find a clear relation between sample size and missing-data-handling approaches largely because the majority of reviewed studies did not report how missing data were handled. As the data-analyses reporting becomes more transparent and better guided by clearer guidelines, future reviews with more complete data can thus examine such more nuanced relations. Another limitation is that we did not systematically collect data on more detailed application-level characteristics that are important for understanding how RFs are used in practice. Future work could extend the current study by explicitly reviewing and synthesizing additional aspects in applying RFs, including but not limited to the analytic role of RFs (e.g., whether they are used for feature selection, stand-alone prediction, or benchmarking/comparison with other methods), common competitor models being compared with RFs (e.g., parametric models, regularized regressions, boosting), validation strategies (e.g., separate testing set, k-fold cross-validation, OOB error), and the metrics and decision thresholds used to evaluate good model performance. Third, we did not stratify findings by psychological subfields in the current study. Future systematic reviews can address this limitation by conducting subfield-specific analyses to identify potential challenges that are more unique or prominent in certain fields. For example, imbalanced data may be more commonly encountered in clinical psychology, particularly when making diagnostics is the goal. Such subfield analyses would provide more useful practical guidance for researchers working in specific areas of psychological science. Finally, in the current study, we chose to focus on reviewing the information formally reported in the main text of published articles. We did not systematically analyze any external code files. Although code sharing is highly encouraged for transparency and reproducibility, they cannot replace the clear in-text reporting of key analytic details. But future work may benefit by extending our approach by systematically reviewing the external syntax files to look for additional implementation details.
In conclusion, RFs offer promising methodological opportunities to advance psychological science; however, fully achieving its potential power depends on methodological rigor and reporting transparency in the field. We hope this review motivates meaningful discussions and highlights potential directions for future research to enhance the application of RFs and machine-learning methods, in general, in psychological studies.
Supplemental Material
sj-pdf-1-amp-10.1177_25152459251404358 – Supplemental material for Advancing Psychological Research With Random Forests: A Review of Methods, Tools, and Applications
Supplemental material, sj-pdf-1-amp-10.1177_25152459251404358 for Advancing Psychological Research With Random Forests: A Review of Methods, Tools, and Applications by Yi Feng, Han Du, Jiarui Song, Yina Sun, Yiting Wang and Aedan Joel in Advances in Methods and Practices in Psychological Science
Footnotes
Transparency
Action Editor: Yasemin Kisbu-Sakarya
Editor: David A. Sbarra
Author Contributions
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
