Advancing Psychological Research With Random Forests: A Review of Methods,Tools,and Applications

Abstract

Contemporary psychological research increasingly involves machine-learning techniques, including random forests, for their capability in analyzing complex, high-dimensional data sets and modeling nonlinear predictive relations. In this article, we provide a comprehensive review of random-forest methods in psychological research. We begin by introducing the fundamental concepts of decision trees, followed by the theoretical framework of random forests as an ensemble method. Next, we review the methodological development and commonly used software tools for random-forest models. We discuss the practical issues and challenges when implementing random forests in psychological studies. Importantly, we then systematically review the empirical psychological research articles published between 2020 and 2022 that used random forests; we summarize the applications of random forests, with a special emphasis on data structure, software implementation, hyperparameter tuning, and approaches for handling missing data. By synthesizing the theoretical foundation and current empirical practices, in this article, we identify significant methodological gaps in applying random forests to psychological data and hope to initiate much needed conversations on how psychologists can effectively use the random-forest method to advance psychological science.

Keywords

random forests machine learning decision trees hyperparameter tuning psychological research

In recent years, there has been a growing interest in using machine-learning techniques to better understand complex biological, neurological, and behavioral data in psychological studies (Dwyer et al., 2018; Orrù et al., 2020; Rosenbusch et al., 2021; Sleek, 2023; Vélez, 2021; Yarkoni & Westfall, 2017). Among the many machine-learning algorithms, random forest (RF; Breiman, 2001) has become a popular analytical tool for classification, prediction, and feature selection. RF is a nonparametric ensemble learning algorithm that combines multiple decision trees to form a powerful committee of predictive models. RFs are known for their great flexibility in handling large high-dimensional data and complex nonlinear relations, which can otherwise be challenging using traditional parametric statistical methods (Malley et al., 2018; Ryo & Rillig, 2017; Touw et al., 2013). RFs are also known for their advantage in reducing overfitting (Gashler et al., 2008; Segal, 2004) and handling a large volume of variables with a relatively modest sample size (Fife & D’Onofrio, 2023; Matsuki et al., 2016). The RF method has been applied broadly across various areas of psychological studies, such as predicting mental-health outcomes, assessing behavioral patterns, and analyzing neuroimaging data. For instance, among the many applications, researchers have used RFs to identify key predictors of depression (e.g., Betz et al., 2022; Qiu et al., 2022), predict student academic performance (e.g., Wang et al., 2023), and detect dementia based on brain activity (e.g., Ye et al., 2023).

However, despite their increasing popularity, the use of RFs in psychological studies is not without challenges. Methodological evaluations of their performance in psychological-research settings remain limited. In particular, it has not been systematically studied whether RFs, which was originally designed for “big data” scenarios, have desirable performance given the constraints that are typical in psychological-study contexts, such as smaller sample sizes, imbalanced data, and the ubiquitous presence of missing data. The practical implementation of RFs also requires careful consideration of hyperparameter tuning and the choice of software tools that have varying capacities. Given these considerations, in this article, we aim to provide a comprehensive review of RF methods in psychological research, including its theoretical framework, the methodological development, commonly used software tools, and the applications in empirical studies, with a hope to reveal some of the challenges that researchers encounter in practice.

This article is structured as follows. We begin by introducing the basic concepts of decision trees, followed by a discussion of the important practical considerations in applying RFs. Next, we review the available methods and common software tools for implementing RFs and a large collection of empirical psychological studies that used RFs for data analyses. Finally, we discuss the challenges and practical issues associated with using RFs in psychology and propose directions for future research. Most importantly, we hope this article can be the start of many conversations about how RFs can be effectively applied to advance psychological science.

Conceptual Framework

Decision trees

Decision trees are nonparametric statistical-learning methods that automatically take care of complex nonlinear relations and interactions. They are supervised-learning techniques for predicting categorical outcomes (classification trees; e.g., depression status) or continuous outcomes (regression trees; e.g., work burnout score). The tree is constructed by recursive partitioning, a step-by-step procedure in which the predictor space (i.e., feature space) X is repeatedly partitioned into subspaces to form more homogeneous subgroups of observations for the purpose of prediction. Its structure resembles an upside-down tree, featuring a root at the top and branches with leaves growing downward, as illustrated in Figure 1a. This structural property is why they are referred to as “decision trees.” It first starts with the root node (i.e., the entire data set), $ℜ_{0} = X = {x \in ℝ^{d}}$ ; at each subsequent step, a splitting variable X_j is selected along with a splitting criterion θ_j; on this basis, the parent node is divided into nonoverlapping regions accordingly: $ℜ_{{child}_{1}} = {x \in ℜ_{parent} | X_{j} \leq θ_{j}}$ and $ℜ_{{child}_{2}} = {x \in ℜ_{parent} | X_{j} > θ_{j}}$ , as illustrated in Figure 1b. These subregions, also called “child nodes,” are expected to be more homogeneous in terms of the outcome Y. The most commonly used partitioning algorithm is binary recursive partitioning, in which at each split, the parent node is being divided strictly into two child nodes. The splitting stops when it hits a predetermined stopping criterion (e.g., each subregion should contain no fewer than 10 observations). The regions or nodes that are not further split are called “terminal nodes” ( $ℜ_{t}$ ). All terminal nodes altogether define a partition of the data (see the dashed lines in Fig. 1b).

Fig. 1.

A hypothetical conceptual example of classification tree. This tree has a depth of 3: From the root node to $ℜ_{1}$ or $ℜ_{2}$ , the path goes through three splits. (a) A classification tree for predicting depression status. (b) The corresponding partition of the predictor space.

Following the partition, each observation in the data should fall into one and only one terminal node. With a single tree, each terminal node is assigned a particular predicted outcome value, ${\hat{Y}}_{t}$ . For categorical outcomes, the predicted value is usually computed as the majority vote in the terminal node; for continuous outcomes, the predicted value is usually computed as the average in each terminal node. Therefore, with a single tree, all observations that fall into the same terminal node will have the same predicted value. On the other hand, it is also possible for different terminal nodes to have identical predicted value, and thus, observations ending up in different terminal nodes can be predicted to have the same outcome.

In the following sections, we mainly focus on one of the most widely adopted decision-tree algorithms—classification and regression trees (CARTs; Breiman et al., 1984)—as the base learner of RFs (methodological limitations are briefly discussed in the Discussion section). Note that other alternative decision-tree algorithms also exist, such as the C4.5 classifier (Quinlan, 2014) and conditional inference trees (CTrees; Hothorn et al., 2006). These algorithms share the fundamental decision-tree framework with CART but differ in certain technical aspects, including the criterion for selecting a splitting variable and the internal procedure for handling missing data. Some of these alternatives were reported to better address the inherent biases of CART (Hothorn et al., 2006; Strobl et al., 2008, 2009). Nonetheless, CART still remains the most commonly used base learner in current software implementations of RFs. For a more thorough treatment of the tree-based methods and different algorithms, readers can refer to Mienye and Jere (2024) and Strobl et al. (2009).

Figure 1 provides a simple hypothetical example of classification trees built with CART. Here, the outcome of predictive interest is the depression status (depressed vs. not depressed). The two covariates—sleep quality and stress level—are all measured on a 5-point Likert scale; high values indicate higher levels in the corresponding measure. The tree starts growing from the entire sample space. At each step, the CART algorithm finds the best variable to serve as the splitting variable by considering every possible covariate and every possible split for each covariate. All these possible splits are evaluated and compared simultaneously. The one single split that yields the best fit is then selected for further growing the tree. For classification tree, a possible way to evaluate the fit is to compute the node-impurity measure, such as entropy

E_{t} = - p \log p - (1 - p) \log (1 - p)

(1)

or Gini index

{Gini}_{t} = 2 p (1 - p),

(2)

where p = P(Y = depressed) is the observed proportion of depressed observations in node t. The goodness of a possible split can therefore be defined as the reduction in the chosen node-impurity measure after splitting the parent node. The split that leads to the largest reduction in node impurity (i.e., the largest improvement in node purity) is selected.¹ In this example, for instance, sleep quality is chosen as the first splitting variable, and the split occurs at a value of 3 ( ${x \in ℜ_{0} | X_{SleepQuality} < 3}$ and ${x \in ℜ_{0} | X_{SleepQuality} \geq 3}$ ), which means this particular split at the root node yields the best result overall compared with other possible splits across all variables and all parameter spaces.

Now suppose the tree algorithm goes through the recursive-partitioning process and renders the tree model shown in Figure 1 as the best predictive model. After partitioning, the feature space is divided into a total of five terminal nodes ( $ℜ_{1} - ℜ_{5}$ ; each contains 36%, 8%, 14%, 7%, and 35% of the observations, respectively). Recall that the goal of recursive partitioning is to create more homogeneous subgroups after splitting. In this example, 42% of the observations are diagnosed as “depressed,” and 58% of the observations are “not depressed” at the root node, which translates into a Gini index of 0.487. After the initial split at $X_{SleepQuality} = 3$ , 62% of the observations in the left child node are reported depressed, yielding a reduced Gini index of 0.471; in the right child node, only 14% of the observations are reported depressed, resulting in a greatly reduced Gini index of 0.241. Following further recursive partitioning, the node-purity measure is greatly improved, with 100%, 0%, 0%, 84%, and 0% of observations diagnosed as “depressed” in each terminal node, respectively. This partitioning results in zero variation in the observed outcome across most terminal nodes.²

In this classification tree, the predicted outcome is the majority depression status observed among the training cases in each terminal node. To predict the depression status for a new individual, we simply follow this decision tree to send them down through the branches, guided by the covariates’ values accordingly. In this example, to predict an individual’s depression status, the first question to ask is whether this person has a sleep quality score below 3. If yes, the person moves down the left branch; if not, the person moves to the right. Once in the left child node, the next question is if the stress level is above 3. An answer of yes sends this individual further down to the left branch, ending up in terminal node $ℜ_{1}$ ; in contrast, an answer of no sends the person to the right branch, reaching terminal node $ℜ_{2}$ instead. If this individual is ultimately sent to terminal node $ℜ_{2}$ , for instance, given all training observations in that node are not depressed (100% not depressed), our prediction for this new individual should be “not depressed.” Consider another individual with a sleep-quality score of 1 and a stress level of 4. Following the same logic, this person will be sent down to terminal node $ℜ_{1}$ , where all training observations are depressed. Consequently, the second individual is predicted to be depressed.

How does one evaluate the prediction accuracy of a decision tree? With decision trees, the prediction accuracy is usually assessed by examining the model’s performance on a separate test set or through cross-validation, which involves partitioning the data set into multiple subsets and repeatedly training the model on some while validating it on the remaining ones (see more in Browne, 2000; Koul et al., 2018; Stone, 1974). For classification trees, a standard measure of predictive accuracy is the misclassification rate, which quantifies the proportion of incorrectly classified observations in the entire test set:

\frac{1}{N_{test}} \sum_{i = 1}^{N_{test}} I (\hat{Y_{i}} \neq Y_{i}),

(3)

where ${\hat{Y}}_{i}$ is the final predicted class label for individual i and $I ({\hat{Y}}_{i} = Y_{i})$ is an indicator function that takes a value of 1 if the prediction is correct and 0 otherwise. $N_{test}$ is the total number of observations in the test set. Building increasingly complex trees is expected to effectively reduce misclassification in the training data. However, highly complex predictive models are also known to have large variance because they are sensitive to even small random fluctuations in the sample data. This is famously known as the bias-variance trade-off. Overly complex trees tend to capture not only the meaningful data patterns but also the random noises, leading to poor model performance when presented with new observations. Therefore, with a single tree, to reduce the risk of overfitting, it is important to assess a model’s predictive accuracy on data not used for growing the tree. With a single tree, it is a common strategy to first grow a large (i.e., bushy) tree and then prune off the branches, which helps address the issue of overfitting. A pruned tree is a subtree of the large bushy tree by collapsing its nodes (i.e., undo the splitting) from bottom up toward the root node. This is usually done through cost-complexity pruning, which aims to balance model fit against complexity by penalizing complex models, combined with cross-validation to select the optimal level of pruning.

Extending the discussion to regression trees follows a similar logic except that the outcome is continuous and the measures of goodness are naturally different. To grow a tree rather than using measures such as entropy or Gini index to evaluate the possible splits, regression trees instead minimize the variability of the responses in each node by computing the sum of squared errors (SSE):

{SSE}_{t} = \sum_{i \in ℜ_{t}} {(Y_{i} - {\bar{Y}}_{. t})}^{2},

(4)

where ${\bar{Y}}_{. t} = \frac{1}{n_{t}} \sum_{i \in ℜ_{t}} Y_{i}$ is the average of the observed outcome values within node t. SSE_t is thus the squared difference between the observed outcome and the predicted outcome, summed over all observations in each node t (n_t denotes the total number of observations falling into node t). The split that yields the minimum overall SSE is then selected. To evaluate the prediction accuracy with test data, the test SSE can be similarly computed:

{SSE}_{test} = \sum_{i}^{N_{test}} {(Y_{i} - {\hat{Y}}_{i})}^{2} .

(5)

Ensemble methods: bagging and RFs

Individual decision trees usually suffer from instability, which means they are sensitive to even small changes or outliers in data, especially for the large complex models that are overfit to the training data. That is to say, partitioning based on a slightly different sample can yield an entirely different tree model and thus very different predicted results for the same observations. Therefore, a single tree is unstable with high variance. Such instability is generally undesirable for predictive tasks. To address this problem, ensemble methods have been developed, such as bagging (bootstrap aggregating; Breiman, 1996) and RFs (Breiman, 2001).³ These approaches explicitly leverage the inherent instability of individual trees to construct a robust collective committee of models. By perturbing the training data, typically through bootstrap sampling, multiple predictive models are generated. These predictions are then aggregated, often through averaging or majority voting, to produce a more stable and accurate ensemble predictor. Ensemble methods are useful because they take the average across multiple trees, which can stabilize the predictions and reduce the impacts of random noises.

Let the training data be denoted as $D_{train} = {X_{i}, Y_{i}}$ of size n, i = 1, 2, . . ., n. The bagging procedure draws B bootstrap samples $D^{*} = {X_{i}^{*^{b}}, Y_{i}^{*^{b}} | b = 1, 2, \dots, B}$ from D_train, typically by repeated sampling with replacement.⁴ Each bootstrap sample often contains n observations, thus preserving the structure of the original data. Suppose Y_i is a binary outcome (e.g., depression status). For each bootstrap sample $D^{* b} = {X_{i}^{* b}, Y_{i}^{* b}}$ , a classification tree T^*b is independently grown, thereby generating B classification trees in total. Unlike the single-tree scenario, each tree is usually grown without pruning. Theoretically, the bootstrapped sample is expected to reflect the same data-generating process as the original data. But the procedure of bootstrapping introduces random variations between each data set, thus leading to variations among the trees grown (Fig. 2). To predict the outcome for a new observation X_test, one needs to pass the observation through each classification tree T^*b, sending it down to a unique terminal node in each tree. Therefore, this new observation will obtain B predicted values regarding outcome Y. For instance, consider a hypothetical new observation with X_SleepQuality = 2, X_{PhysicalActivity} = 2, X_Stress = 3, and X_{SocialSupport} = 2. With the first bootstrap classification tree T^*1(Fig. 2a), the prediction is ${\hat{Y}}^{* 1} = depressed$ ; the second bootstrap classification tree T^*2 (Fig. 2b) predicts ${\hat{Y}}^{* 2} = not depressed$ , so on and so forth. After collecting the prediction results from all the B classification trees, one simply counts the number of “votes” each outcome category receives. The final predicted class for the test case X_test is determined by the majority vote across the B trees. For instance, if 100 bootstrapped samples and classification trees are grown and 73 of them predict ${\hat{Y}}_{test} = depressed$ , then the final prediction outcome from this bagging procedure is “depressed.”

Fig. 2.

A hypothetical conceptual example of bagging. (a) The first bootstrap classification tree T^*1; (b) The second bootstrap classification tree T^*2; (c) The third bootstrap classification tree T^*3.

Bagging with regression trees follows a similar logic to that of classification trees. Other than the difference of evaluation measures, a key distinction between bagging regression trees and classification trees lies in how predictions are synthesized. While classification trees rely on a majority vote to determine the final outcome, regression trees aggregate predictions by computing the average of the outcome values predicted by each of the B individual trees:

{\hat{Y}}_{bag} (X_{test}) = \frac{1}{B} \sum_{b = 1}^{B} {\hat{Y}}^{* b} (X_{test}) .

(6)

A methodological extension of bagging is RFs (Breiman, 2001). In bagging, randomness is introduced through the process of bootstrap sampling from the original data. Ideally, the individual trees grown out of these bootstrapped samples should be independent. Thus, bagging can effectively reduce the variance by aggregating over independent predictions while lowering bias by growing large, bushy individual trees. In practice, however, these tree models can be correlated, undermining the variance-reduction effect of bagging. To further de-correlate the individual trees, RFs introduce another layer of randomness on top of bagging. This typically involves randomly selecting a subset of features at each split in the individual trees (Ho, 1998).

The algorithm of RFs closely resembles that of bagging. It begins by drawing B bootstrap samples from the original training set via repeated sampling. But in each bootstrap sample, the way to grow a classification tree or regression tree is slightly different than in bagging. More specifically, at each candidate split, rather than evaluating all p available covariates, the algorithm randomly selects a subset of size m (m < p; when m = p, RF is simply equivalent to bagging). Among these m covariates, the RF algorithm identifies the optimal split based on a node-impurity measure, such as entropy or Gini index for classification or SSE for regression task. The process of random feature selection and recursive partitioning continues until a predefined stopping criterion is reached. Repeating this tree-growing process for each of the B bootstrap samples yields an ensemble of B classification or regression trees, which collectively defines an RF. Predictions for new observations follow the same rules as in bagging: majority voting for classification and averaging predictions for regression. In contrast to a single regression tree that assigns identical predicted outcome to all observations in the same terminal node, the averaged predictions in bagging and RFs typically vary from one observation to another even when some trees can place them in the same leaf. This results in a more finely grained prediction surface.

Assessing the variable importance (VI) is an important and common consideration in implementing RFs. Beyond obtaining prediction results, researchers are often interested in knowing which variables contribute to the classification or which variables are most influential in explaining the response variable. Although tree-based algorithms are mainly used for prediction purpose and are data-driven and exploratory in nature, evaluating the VI can help with feature selection for subsequent analyses and can potentially contribute to future theory development (e.g., Brick et al., 2018). Nonetheless, the evaluation of VI is already challenging in a single-tree model given the nonlinearity and complex interactions; it becomes even less intuitive with ensembles of trees, particularly compared with parametric models that most researchers are more familiar with (e.g., linear regression). One widely used approach is briefly introduced here. For RFs, in each individual tree T^*b, there is an out-of-bag (OOB) sample, defined as the collection of observations that are not selected into the bth bootstrap sample: $D_{OOB}^{* b} = {X^{*}, Y^{*} | X^{*}, Y^{*} \notin D^{* b}}$ . These OOB observations are dropped down the bth tree T^*b to obtain predicted values. The prediction error (e.g., misclassification rate for classification and SSE for regression) can thus be computed for this OOB sample, ${PE}_{OOB}^{* b}$ . Next, for a given covariate X_j, its observed values are randomly permuted across all the observations in this OOB sample $D_{OOB}^{* b}$ while the rest of the data remain unchanged. This updated OOB data set, in which X_j has been randomly permuted, is denoted as PD_j^*b; it is then passed through the bth tree again to generate a new set of predictions. The new prediction error is computed as ${PE}_{OOB j}^{* b}$ , which is expected to be larger than the original OOB error ${PE}_{OOB}^{* b}$ without permutations. The rationale is if a variable is important for making correct predictions, randomly permutating its values in the sample will greatly reduce the prediction accuracy. A larger decrease in model accuracy thus indicates a more important predictor. Therefore, the importance measure of covariate X_j in the bth tree can be defined as

{VI}_{j}^{* b} = {PE}_{{OOB}_{j}}^{* b} - {PE}_{OOB}^{* b} .

(7)

This process is repeated in each of the B trees for covariate X_j. Finally, aggregating these importance measures across the B trees yields a total importance measure for X_j:

V I_{j} = \frac{1}{B} \sum_{b = 1}^{B} V I_{j}^{* b} .

(8)

Although it seems computationally cumbersome, this method has been conventially integrated into the RF algorithm, and importance scores are often automatically computed for all variables in software programs. Figure 3 provides a hypothetical example of the permutation-based VI measures, computed as the increase in misclassification rate, with variables ordered from more important to less important.

Fig. 3.

Variable-importance plot from random forests for the hypothetical example predicting depression.

The RF algorithm involves three main tuning parameters (also referred to as “hyperparameters”) that researchers need to set values for growing a forest: the tree depth (i.e., the number of splits along the longest path from the root node to a terminal node), the number of randomly selected covariates at each potential split (m), and the number of bootstrap samples (or the number of trees grown), B. To ensure the effectiveness of RFs, the hyperparameters’ values have to be carefully specified. Some common suggestions recommend $m = \sqrt{p}$ for classification and $m = p / 3$ for regression (Hastie et al., 2009, p. 592), but the optimal values really depend on the specific context and data set. Hyperparameters can be empirically determined using cross-validation, although this process can become computationally expensive. A more detailed discussion on hyperparameter tuning is provided in the next section.

Practical issues in applying RFs

Although decision trees and RFs have gained an increasing popularity in psychological research, the effectiveness of RFs, which were originally developed outside psychological-research contexts, has not yet been thoroughly investigated. Important questions remain about its applicability and how it can be effectively used to address the practical challenges that are more specific in psychology studies. In applying RFs, psychologists need to navigate through a series of methodological considerations that can substantially influence the effectiveness of the RF models and the interpretation of results.

Goals of implementing RFs

To begin, the purpose for using RFs can vary widely (Probst et al., 2019; Shmueli, 2010; Yarkoni & Westfall, 2017). Many researchers employ RFs as a predictive tool aiming to enhance accuracy in predicting future outcomes. Some predominantly rely on RFs because of its strength and flexibility in handling high-dimensional data in which higher-order nonlinearity and complex interactions are anticipated. Alternatively, others may choose to include RFs as part of comparison between various predictive models, ranging from traditional parametric models (e.g., linear regression or logistic regression) to nonparametric models (e.g., bagging, boosting), to empirically determine which method offers the best predictive accuracy. A third application of RFs is in preliminary variable selection, in which they assess the relative importance of predictors in explaining the outcome variable rather than just serving as a predictive tool. It usually involves using RFs to identify important covariates that will be further examined in inferential statistical modeling (and sometimes predictive modeling as well). Regardless of whether to predict, to explain, to select features, or to achieve multiple goals, RFs’ performance and applicability need to be assessed accordingly depending on the specific research purposes (Shmueli, 2010; Yarkoni & Westfall, 2017).

Data dimensions

Given that psychological research often concerns a very specific target population and in-person data collection, the data conditions in psychological studies can be quite different from those in other fields. Some common challenges include small sample size, data dimensionality, and missing data. Psychologists often need to work with small data sets with limited number of observations, particularly when data collection is costly. With traditional inferential statistics, small sample sizes can be challenging because they can introduce larger bias and reduce estimation precision. Does it also affect the performance of RFs as a predictive model or as a variable-selection method? While RFs are believed to be applicable in “small n large p” conditions (Fife & D’Onofrio, 2023; Matsuki et al., 2016; Strobl et al., 2009), potential challenges may still exist, particularly in terms of model overfitting (e.g., Yarkoni & Westfall, 2017). Recall that RFs fundamentally rely on bootstrap sampling. Each individual tree is trained on a bootstrap sample, which, on average, is expected to contain approximately 63% of the original data (if sampling with replacement). For small data sets, this can substantially reduce the effective training-set size for individual trees. With a limited training set, individual trees can be overfitted, especially if the trees are allowed to grow deep and large. This overfitting can be further exacerbated if irrelevant predictors are included in the training process without prescreening. For example, consider a small study with 30 participants, 10 of whom are clinically depressed. Suppose by pure chance, all of the depressed participants use iPhones to fill out the survey and some of the nondepressed participants use Android phones. With such a limited sample size, the individual tree may try to capture this random data pattern (e.g., suggests using iPhone is an important predictor for depression), rendering the results less generalizable. Unfortunately, sample size is not typically carefully evaluated and justified when predictive models are employed (Dhiman et al., 2023), and there are no clear guidelines regarding what should be considered a good sample size or how to plan for a sufficient sample size with predictive modeling. Some research has shown that small sample size can lead to larger bias when certain a machine-learning algorithm is applied (e.g., support vector machine; Vabalas et al., 2019), but the methodological implications of small n on RFs in psychology studies remains unclear.

On the other hand, it is also not yet clear whether having a small number of predictors will affect the effectiveness of RFs. With high-dimensional data, the key methodological advantage of RFs over bagging is that they de-correlate the trees by randomly selecting a subset of predictors at each split. With the total number of predictors being limited, this advantage may diminish. The same predictors may frequently appear across multiple trees, making the ensemble less independent, potentially affecting the overall performance of this forest. This consideration can be illustrated through a simple example. To make it more intuitive, we ran a simple simulation as a proof as concept: From p predictors, we randomly drew two independent sets of m candidate variables (without replacement) and counted the number of shared predictors; this process was repeated 1,000 times. With p = 10, the average overlap across replications was about 0.38, 2.50, and 6.39 when m = 2, 5, and 8, respectively; that is, with m/p = 0.5, roughly half of the five candidate predictors are expected to be shared by two trees, and with m/p = 0.8, about 80% of them are expected to be identical. In contrast, when p = 50, the average overlap between two independent draws dropped to about 0.09, 0.49, and 1.29 for m = 2, 5, and 8, respectively. Thus, the potential overlap between each tree is substantially reduced.⁵ This example shows that when the total number of available predictors p is small, the m/p ratio can grow big easily with even small changes in m, causing the trees to consider largely the same variables at each split and thus become more correlated, undermining the de-correlation property of RFs.

Imbalanced data

In psychology studies, imbalanced data sets are common, particularly when the outcome of interest is relatively rare or underrepresented in the population. For example, in mental-health research, the number of children with suicide attempts (e.g., see Harman et al., 2021) is substantially smaller than the number of healthy control subjects. With such class imbalance, the RF algorithm can be biased because it will prioritize the majority class, resulting in poor prediction of the minority class. This can be problematic because the minority class is more often the focal research interest (e.g., identifying at-risk children for potential suicide).

As a consequence, when the outcome is highly imbalanced, the overall misclassification rate, or accuracy, can be a very misleading indicator of model performance because it is dominated by the majority class. For example, if 95% of the training cases are individuals without suicide attempts, a model that predicts all cases as “no suicide attempts” can achieve a 95% accuracy (a low misclassification rate = 5%), but the model itself is completely useless for identifying individuals at risk (both sensitivity and precision are zero⁶). In such scenarios, alternative metrics that assess the model performance in each specific outcome class are preferable to better align with the intended research purposes, such as sensitivity (i.e., recall), specificity,⁷ precision, and F1 score.⁸ In addition, visual inspections can be informative as well. The precision-recall curve, which is conceptually similar to the receiver operating characteristic curve but more appropriate with imbalanced data, can be used to assess the model performance across varying decision thresholds.

To address class imbalance, some resampling techniques can be applied (Chawla et al., 2002; Japkowicz, 2000; Ling & Li, 1998). One possible approach is to randomly oversample the minority class or randomly undersample the majority class (Japkowicz, 2000; Ling & Li, 1998). Random oversampling balances the data set by duplicating observations from the minority class until the resampled minority class consists of as many data points as the majority class; in contrast, random undersampling balances the data set by sampling only a smaller proportion from the majority class until their size matches the minority class. As a more sophisticated alternative, synthetic minority oversampling technique (SMOTE; Chawla et al., 2002) further perturbs the data by randomly generating synthetic minority class samples through interpolation between each minority unit and its nearest neighbors in the feature space, which can be identified using a distance metric, such as Euclidean distance, until the sample size of the minority class and majority class size are equalized.

Missing data

Missing data are also typical in psychological studies involving human subjects because of participant nonresponse, dropouts, or other practical issues. There are different strategies for handling missing data in RFs, each with its own advantages and disadvantages. Listwise deletion is the most widely used approach, which simply discards all observations with incomplete data before fitting RF models. However, listwise deletion can be a huge waste of available information and can be infeasible if the sample size is small to begin with.

Among other more modern missing-data-handling methods, by far the most popular approach in CART is surrogate splits (Hapfelmeier et al., 2014; Hothorn et al., 2006). With this approach, missing data are handled internally in each tree. Surrogates are defined locally at each split; when the best splitting variable is selected, other candidate splitting variables that best mimic the current optimal splitting result are ranked and labeled as the surrogates. When the primary splitting predictor is missing for a given observation, RF algorithm instead uses the best surrogate available to assign this observation further down through the tree. The methodological challenges of surrogate splits have been noted, however, given the computational burden, and thus, many other approaches have been proposed (Tang & Ishwaran, 2017). Instead of using surrogate splits, some software (e.g., the ranger R package) implements an algorithm in which missing values are initially ignored when evaluating potential splits. Once the best split is determined, observations with missing values on the splitting variable are temporarily assigned to both child nodes, and the assignment that optimizes the split results is chosen. This optimal assignment is then stored as the default assignment so that when new observations (i.e., future data points) have missing values for that variable come in, they are directed to the chosen child node during prediction. Alternatively, some other software (e.g., the partykit R package) offers an option to treat the missing value as a unique value/category in its own right. This approach is referred to as “missingness incorporated in attributes” (MIA; Twala et al., 2008), which treats missing status as informative and allows it to predict the outcome. All these approaches handle missingness internally without the need to remove or impute the missing data.

An alternative and common approach for handling missing data is to preimpute the missing data and fit the analytical RF model to the complete preimputed data. In some software implementation (e.g., the randomForest R package), the preimputation involves iterative steps that start with simple imputation in which the median or the mode of observed values is used to impute the missing variable. An RF is grown using this simple-imputed data, and a proximity matrix is calculated based on how frequently two cases share the same terminal node across the trees. The proximity matrix is then used to update the initially imputed values (e.g., the missing variable is imputed as a weighted mean of the observed values across all the other cases, weighted by the proximity measure). This process is iterated until convergence. Another available option for missing data before imputation is via the missForest algorithm (Stekhoven & Bühlmann, 2012), which treats missing data as a prediction problem. It grows an RF for each variable with missing values using all other variables as predictors; the trained forest is then used to predict the missing values. This process continues until convergence is achieved. A computationally faster version of missForest has also been proposed as mForest, which employs multivariate forests for imputation and thus reduces the number of RFs needed (Tang & Ishwaran, 2017).

Missing values can also be imputed adaptively during the analytic process (i.e., on the fly) rather than beforehand. For instance, the on-the-fly-imputation (Ishwaran et al., 2008; Tang & Ishwaran, 2017) algorithm uses only complete data to determine the best split at each step. Once a splitting variable is selected, for observations with missing data on this selected splitting variable, a random value is drawn from the nonmissing in-bag data to “impute” for this missing value. This observation is then assigned to a child node according to this temporarily imputed value. These temporary imputed values are discarded after this observation is passed down to a child node; thus, the missingness in this observation is preserved onward.

Although missing data have been studied extensively in conventional inferential-statistical modeling (e.g., Enders, 2022, 2025), it remains unclear which approach is best suitable for missing-data handling when implementing RFs in psychological studies. The optimal choice likely depends on the underlying missingness mechanism (missing completely at random, missing at random, missing not at random; Rubin, 1976), the missing data rate, and the specific structure of the data.

Software Implementations

Besides the theoretical considerations noted above, on a more practical level, psychologists must also decide on which software package to use. Different packages come with different modeling engines, base learners, default configurations, and supported functionalities. These choices can directly affect model results. Thus, researchers need to make informed decisions such that the software implementation is best aligned with the research goals.

Software tools

Many common software, such as Python, R, SAS, MATLAB, and SPSS, can be used to implement the RF algorithms. Among them, R and Python are two of the most popular statistical tools and programming languages, both of which are open-source environments and offer various specialized packages for RFs.

In R, multiple stand-alone packages have been developed for different modeling engines. For instance, the classic RF algorithm (Breiman, 2001) that uses CART as the base learner is implemented in both the randomForest package (Liaw & Wiener, 2002) and the ranger package (Wright & Ziegler, 2017); ranger is a computationally faster option. Alternatively, conditional inference forests, which uses CTree as the base learner and select splits via statistical tests, are implemented in the partykit package and in its predecessor party package (Hothorn et al., 2006; Hothorn & Zeileis, 2015). On the other hand, there are also two wrapper packages, caret (Kuhn, 2008) and tidymodels (Kuhn & Wickham, 2020), that integrate a collection of R packages for machine-learning tasks. These wrapper packages in R do not have their own unique modeling engines for RFs; rather, they are designed to call up the external modeling engines in R⁹ while facilitating a more unified workflow of model specification, data preprocessing, hyperparameter tuning, and model fitting. Parallelly, in Python, the Scikit- learn library implements the classic CART-based RF algorithm via the RandomForestClassifier and RandomForestRegressor functions (Pedregosa et al., 2011). For a summary of these popular packages, including their base learners, key function arguments, and default hyperparameter values, see Table 1.

Table 1.

Commonly Used Software Packages for Random Forests

Software package		R			Python
Software package		randomForest	ranger	partykit	Scikit-learn
Key function name		randomForest ()	ranger()	cforest ()	Randomforestclassifier ()	RandomForestRegressor ()
Base learner algorithm		CART	CART	CTree	CART
Hyperparameter arguments and defaults	Number of trees	ntree = 500	num.trees = 500	ntree = 500	n_estimators = 100
	Number of variables randomly selected	mtry = sqrt(p); mtry = p/3	mtry = sqrt (p)	mtry = sqrt(p)	max_features = "sqrt"	max_features = n_features
	Sample with replacement	replace = TRUE	replace = TRUE	replace = FALSE
	Size of the bootstrap sample	sampsize = n	sample.fraction = 1	fraction = 0.632	max_samples = None
	Minimum size of terminal nodes	nodesize = 1; nodesize = 5	min.bucket = 1	minbucket = 7	min_samples_leaf = 1
	Maximum number of terminal nodes	maxnodes = NULL			max_leaf_nodes = None
	Minimal node size for split		min.node.size = 1; min.node.size = 5	minsplit = 20	min_samples_split = 2
	Maximal tree depth		max.depth = NULL	maxdepth = Inf	max_depth = None
Software package		R
Software package		caret			tidymodels
Key function name		train(. . ., method = "rf", . . .)	train(. . ., method = "ranger", . . .)	train(. . ., method = "cforest", . . .)	rand_forest(. . .) %>%set_engine("randomForest")	rand_forest(. . .) %>% set_engine("ranger")	rand_forest(. . .) %>% set_engine("partykit")
Base learner algorithm		CART	CART	CTree	CART	CART	CTree
Hyperparameter arguments and defaults	Number of trees	ntree = 500	num.trees = 500	ntree = 500	trees = 500	trees = 500	trees = 500
	Number of variables randomly selected	By default, caret conducts a grid search with three levels for a better mtry value.			mtry = sqrt (p); mtry = p/3	mtry = sqrt (p)	mtry = sqrt (p)
	Sample with replacement	replace = TRUE	replace = TRUE	replace = TRUE	Can be set up in set_engine() accordingly. The argument names and default values are defined locally in each engine.
	Size of the bootstrap sample	sampsize = n	sample.fraction = 1	fraction = 0.632
	Minimum size of terminal nodes	nodesize = 1; nodesize = 5	min.bucket = 1	minbucket = 7
	Maximum number of terminal nodes	maxnodes = NULL
	Minimal node size for split		min.node.size = 1; min.node.size = 5	minsplit = 20	min_n = 10; min_n = 5	min_n = 10; min_n = 5	min_n = 20
	Maximal tree depth		max.depth = NULL	maxdepth = 0	Can be set up in set_engine() accordingly. The argument names and default values are defined locally within each engine.

Note: For cells with two default values, the one denoted with underline is the default for classification tasks, and the other is the default for regression tasks. CART = classification and regression tree; CTree = conditional inference tree.

Besides the differences outlined in Table 1 that are directly relevant to model building, these packages also vary in their approaches of handling data, which also affect the results. For example, the options available for dealing with imbalanced data can differ substantially from package to package. Typically, data processing and model fitting are separate steps performed sequentially. In the R environment, the two wrapper packages ( caret and tidymodels ) are designed to streamline the entire process; thereby, both provide convenient internal functions that can be used to oversample or undersample the training data to address the class imbalance (e.g., step_upsample , step_downsample , and step_smote in tidymodels; upSample and downSample in caret ). In contrast, other stand-alone R packages generally do not offer such a fully built-in functionality for handling imbalanced data explicitly before model fitting. Although some limited controls are possible in randomForest and ranger , such as stratified bootstrap sampling with equal sizes at the tree level (as an approximate of simple undersampling), these cannot sufficiently replace the resampling techniques described earlier. Therefore, researchers usually need to rely on external R packages to perform the data-resampling step. Examples of external packages that can be used for resampling include ROSE (Lunardon et al., 2014) and smotefamily (Siriseriwan, 2024). In Python, on the other hand, the imbalanced-learn library (Lemaître et al., 2017) provides useful functions for random oversampling ( RandomOverSampler ), random undersampling ( RandomUnderSampler ), and SMOTE ( SMOTE ) before the RF model is fit via Scikit-learn .

Packages also differ substantially in how they handle missing data. In terms of the stand-alone modeling engines, the randomForest R package cannot handle missing values during the analysis process, and therefore, users need to either remove missing data or impute missing data beforehand as a separate step before passing data to the model. It provides internal functions for this purpose, including the na.roughfix() function that does simple imputation for the outcome and predictors and the rfImpute() function that can iteratively impute the missing data in predictors based on proximity measures. But researchers can also employ any external software for missing-data imputation. For instance, a more advanced extension for RF-based missing-data imputation, the missForest algorithm, is implemented in the missForest R package (Stekhoven & Bühlmann, 2012). The ranger and partykit R packages, on the other hand, offer better internal functionalities for handling missing data on the fly without the need of preimputation or preprocessing. In ranger , this can be specified via the na.action = “na.learn” argument, which, as detailed in previous section, allows the RF algorithm to dynamically “learn” the optimal branch for observations to go down the tree when there are missing values in a splitting variable. In the partykit R package, surrogate splits and/or the MIA procedure can be enabled to handle missing data in predictor variables (by setting a positive integer for the maxsurrogate argument or setting MIA = TRUE , respectively). However, both ranger and partykit do not allow missing values in the outcome variable, so any observations with missing outcomes must be removed or imputed before model fitting.

When the wrapper R packages are used, researchers have three general options to handle missing data. They can choose to rely on the engine’s native functionalities as described above, use the wrapper’s preprocessing tools to process missing values directly in the pipeline, or impute the missing values using other preferred packages before model fitting. To begin with, in caret and tidymodels , missing data can be handled in the same way as in each specific modeling engine as long as the proper arguments aforementioned are passed to the corresponding modeling engine.¹⁰ Alternatively, both caret and tidymodels provide their own built-in functionalities in preprocessing missing data (e.g., preProcess() in caret; step_impute_ and step_unknown() functions in tidymodels ).

Finally, in Python, starting from Version 1.40 of the Scikit-learn library, the RandomForestClassifier and RandomForestRegressor functions also support internal missing-data handling when the predictor variables contain missing data. The underlying procedure is similar to what is implemented in the ranger package with na.learn , in which missing data are handled on the fly. Likewise, missing data in outcome variables are not allowed.

Hyperparameter tuning

After the data are properly preprocessed and loaded into the software of choice, the next step in implementing the RF algorithm is to determine the values of the hyperparameters. The current development of software usually provides some default values of the hyperparameters. Table 1 provides a summary of the hyperparameter options and default values used in the stand-alone modeling engine packages ( randomForest , ranger , partykit ), the wrapper R packages ( caret and tidymodels ), and Python’s Scikit-learn library. If the users do not explicitly specify any hyperparameter values, it will automatically use the software default setting. However, the optimal hyperparameter values are inherently data dependent, and most likely, the default setup will not yield the best performance of RFs. A more robust approach is to tune the hyperparameters either by manual cross-validation or via the automated optimization procedures. But at the end of the day, the choice has to be carefully considered because the value of hyperparameters can largely influence the performance of RFs.

Regarding the number of randomly selected covariates for candidate splitting (m), although the default setting is a convenient option (typically $m = \sqrt{p}$ for classification and $m = p / 3$ for regression), it is by no means universally optimal. In certain cases, for example, if most of the variables in the data are relevant and important for predicting the outcome, setting a smaller m can benefit the model prediction such that not all trees are dominated by the strongest predictor; on the contrary, if there are many irrelevant variables in the data, using a larger m can improve the prediction results by ensuring that at least some important predictors are included in each tree (Probst et al., 2019).

Tree complexity is another important hyperparameter in decision trees and RFs, which is usually controlled by setting the terminal-node size. Setting a smaller terminal-node size produces a larger tree with more splits, whereas a larger terminal-node size will effectively limit the tree depth. In many software, the terminal-node size is set to 1 for classification and 5 for regression by default. In addition to terminal-node size, other hyperparameters can also be used to control the tree complexity. These typically include the smallest node size required for a possible further split¹¹ (e.g., minsplit in partykit R package), the total number of terminal nodes in a tree (e.g., maxnodes in randomForest R package), or more straightforwardly, the maximum depth of a tree (e.g., maxdepth in partykit R package). All these hyperparameters can be specified in combination simultaneously for an RF model, and the tree will be stopped from further splitting if any of their conditions are met.

The number of trees is also a key factor to consider in growing an RF. Most software implementations default to 500 trees. But unlike other tuning parameters, increasing the number of trees does not generally lead to overfitting. In fact, research suggests that it is preferable to set it to a reasonably large value for the best predictive performance (Probst & Boulesteix, 2018) as long as the computational resource allows. However, once the prediction performance stabilizes, adding more trees will have diminishing returns in improving the predictive accuracy.¹²

The performance of RF models is highly dependent on the specification of hyperparameters, and therefore, to achieve the optimal results, they must be carefully tuned to adapt to the data and the modeling context. For example, having a smaller data with smaller n may necessitate fitting fewer trees in a forest, enforcing a larger size of terminal nodes, and limiting the depth of each tree to avoid overfitting. On the other hand, having a smaller number of predictors may require the users to increase the number of randomly selected variables at each step when growing an RF. In the presence of missing data, it is also crucial to confirm whether the chosen software supports internal missing-data handling, and if so, researchers may need to specify additional tuning parameters as needed (e.g., the maximum number of surrogates). Finally, all tuning parameters can affect the computational time, which can be an important practical consideration for researchers. Employing a smaller m, limiting the tree depth, and growing fewer trees in an RF can reduce the computational burden.

With all that, are there any principled ways to find the optimal values of the hyperparameters? Several different strategies can be used for hyperparameter tuning (Owen, 2022). The most straightforward approach is manual tuning, in which researchers manually change the value of one or more tuning parameters and check which value results in the most accurate prediction, usually via cross-validation or OOB error. An extension of manual search is grid search, which automatically loops through all the possible hyperparameter-value combinations in a predefined search space set by the researcher (e.g., tree depths of 5, 10, 15 combined with 50, 100, 200 trees, yielding nine combinations to test). Essentially, grid search is an automated version of manual hyperparameter tuning by iterating over each combination using nested loops. Although both manual search and grid search are conceptually straightforward and easy to implement, they can be time-consuming and rely on researchers to provide reasonably good candidate values to do the comparison in the first place. As the number of hyperparameters and possible values increase, they can become computationally expensive very quickly. Extending the grid search, a more efficient option for automated hyperparameter tuning is random search, which randomly picks hyperparameter values from given probability distributions (e.g., a uniform distribution) rather than relying on any specific user-supplied values. At each iteration, a value for each tuned hyperparameter is randomly and independently picked. Random search thus requires less prior knowledge from the users about the hyperparameter values and is more computationally efficient because it does not test all possible combinations of the values. Random search is also reported to frequently outperform a basic grid search when dealing with many hyperparameters or wide ranges of values (Owen, 2022). However, researchers still have to manually define the total number of iterations (or the number of randomly sampled hyperparameter values) when implementing random search.

Manual search, grid search, and random search can all be considered as exhaustive search strategies. In addition to these exhaustive search approaches, another well-established and successful tunning strategy is sequential model-based optimization (SMBO; Jones et al., 1998), also referred to as Bayesian optimization (BO), which employs a more adaptive strategy through iterations such that the next iteration is informed by previous iterations. In its implementation, SMBO starts by drawing several random values from the hyperparameter space and evaluating the RF performance accordingly. A surrogate model (a probabilistic regression model, such as Gaussian process) is fit to these initial training results, roughly assessing how changes in hyperparameters affect the model’s prediction accuracy. It then proposes the next set of hyperparameter values within the predefined hyperparameter space, where the proposed values have the best expected prediction result under the current surrogate model. The proposed hyperparameter values are empirically evaluated for prediction accuracy on the training data, and this new training result is added to the previous training results to further update the surrogate model. This process repeats iteratively to find the optimal hyperparameter values (for a more detailed and intuitive illustration of SMBO, see Appendix A in the Supplemental Material available online). SMBO is more computationally efficient than the other approaches, making it particularly appealing when there are a large number of possible hyperparameter configurations. It does, however, require the researchers to have a better statistical knowledge of the process to properly implement the procedure.

For a summary of the hyperparameter-tuning options available across the various packages in R and Python, see Table 2. They differ in terms of the available tuning options and the flexibility of automated hyperparameter search. Of all the packages reviewed here, only tidymodels offers native SMBO/BO options. When using other packages, researchers have to employ external packages or libraries that specialize in these optimization techniques. For example, the randomForest , ranger ,¹³ party , and caret R packages can be used along with the mlrMBO package (Bischl et al., 2018) and mlr package (Bischl et al., 2016) to perform SMBO/BO. Likewise, the scikit-optimize ( skopt ; Head et al., 2021) and Optuna library (Akiba et al., 2019) in Python also offer automated hyperparameter optimization via SMBO that can apply to scikit-learn ’s RF models.

Table 2.

Hyperparameter Tuning for Random Forest in Commonly Used Software Packages

		Hyperparameter-tuning options
Software package		Manual search	Grid search	Random search	Internal SMBO/BO	External SMBO/BO
R	randomForest	Users can manually loop over different values of hyperparameters and compare the OOB error or other evaluation measures	The helper function tuneRF() can be used to find an optimal value for mtry.	x	x	tuneParams() in mlrMBO package; tuneRanger() in tuneRanger package (for ranger package only)
	ranger		x	x	x
	partykit		x	x	x
	caret		User can call up the grid-search procedure by specifying the tuneGrid or tuneLength argument in train() function; more detailed control can be specified in the trControl argument.	User can call up the random-search procedure by specifying search = “random” in the trControl argument.	x
	tidymodels		Users can set relevant hyperparameter arguments to =tune() in rand_forest(. . .) and conduct grid search using the tune_grid() function, with search grid defined by grid_regular().	Users can set relevant hyperparameter arguments to =tune() in rand_forest( . . . ) and conduct random search using the tune_grid()function, with search grid defined by grid_random().	Users can set relevant hyperparameter arguments to =tune() in rand_forest( . . . ) and conduct BO using the tune_bayes()function.
Python	Scikit-learn		GridSearchCV()	RandomizedSearchCV()	x	BayesSearchCV() from scikit-optimize library; study.optimize() from Optuna library.

Note: x = not internally supported; SMBO = sequential model-based optimization; BO = Bayesian optimization . OOB = out-of-bag.

Review of RFs in Empirical Psychological Studies

In the previous sections, we reviewed the theoretical foundations and key practical considerations for applying RFs. Although RF methods hold great promise for advancing the field of psychological science, their effective implementation requires thoughtful considerations of the specific research contexts, data conditions, and software choices. Our review also underscores a critical gap in the literature: There is a lack of systematic methodological investigations of RFs in psychological-research contexts; the field calls for practical guidelines tailored to the unique challenges inherent to psychological research. To help bridge this gap, we reviewed a large collection of published empirical psychological studies that used RFs as part of their data analyses. By documenting current empirical practices through a systematic review, our goal is to provide applied researchers with not only examples of effective implementation and common pitfalls but also, more importantly, a data-driven reference for methodologists to design future studies that are context-relevant and can directly address the field’s most pressing needs.

The articles reviewed in the current study were selected following the procedure described below. As the first step, an advanced search for research articles was conducted in APA PsycInfo on January 1, 2023, using the keyword “random forest” from years 2020 to 2022. This procedure resulted in 733 published articles. From this pool, articles were dropped if they did not meet any one of the following screening criteria: (a) It must be an empirical study (e.g., meta-analysis and systematic reviews were dropped), (b) it must be psychological research (e.g., research in other fields, such as computer science and medicine, were dropped), and (c) it used RF models as part of data analyses. In the end, a total number of 637 published research articles were selected for review, which consisted of 708 studies because multiple studies can be separately reported in one single article. The 708 published empirical studies were defined as the analytical sample for this review. The results of this review are summarized in this section.

In terms of the prediction tasks, the majority of the published studies (75.71%) used RFs for classifying categorical outcomes, whereas 24.29% used RFs for regression tasks predicting continuous outcomes. Regarding software implementation, only 414 studies (58.47%) specified the software used for fitting RF models, and 41.53% of the studies did not provide software information (Fig. 4). Among the 414 studies that reported software details, the predominating majority of them used R and Python, accounting for 50.48% and 44.44%, respectively. A few studies also used MATLAB (e.g., Lohani & Rana, 2023) or SPSS (e.g., Gök et al., 2023).

Fig. 4.

Software used in applied-psychological studies.

Depending on the nature of the research context, the analytical sample sizes varied drastically from one study to another. It ranged from fewer than 10 (e.g., Abreu et al., 2021; Ranjan et al., 2021) to more than 8 million (e.g., Shiner et al., 2022). For the distribution of the analytical sample sizes across the reviewed studies, see Table 3 and Figure 5. Across all the studies, the median sample size was 585; the 10th, 25th, 75th, and 90th percentile sample sizes were 52, 134, 3,309, and 19,804, respectively. Although most studies employing RFs involved large samples, small sample sizes were not uncommon; overall, 5.51% of the studies used fewer than 30 participants, and 12.57% had sample sizes between 30 and 100. Among the 128 studies that involved a sample size $n \leq 100$ n ≤ 100, the mean sample size was 53. It should be noted that 7.34% of the studies did not report the analytical sample sizes.

Table 3.

Analytical Sample Sizes in Applied-Psychological Studies

Sample size	Number of studies	Percentage
0–30	39	5.51%
30–100	89	12.57%
100–300	118	16.67%
300–1,000	138	19.49%
1,000–5,000	130	18.36%
Above 5,000	142	20.06%
Unreported	52	7.34%

Fig. 5.

Analytical sample sizes in applied-psychological studies.

Not surprisingly, depending on the research context, the number of features also varied widely between studies. Some studies had a collection of fewer than 10 features (e.g., Smucny et al., 2021), whereas others had a large number of input features greater than 30,000 (e.g., Dai et al., 2021). But it was most common for empirical studies using RFs to include fewer than 50 input features, as evidenced in 51.28% of the studies. Across all the studies, the median number of features was 20; the 10th, 25th, 75th, and 90th percentiles of the number of features were 6, 10, 45, and 212, respectively. For the total number of input features and the counts of continuous and categorical features separately, see Figure 6 and Tables 4 and 5. Again, a substantial proportion of the studies (30.65%) did not report the number of input features in their RF models.

Fig. 6.

Total number of features used in applied-psychological studies.

Table 4.

Number of Continuous Features Used in Applied-Psychological Studies

	Classification problems		Regression problems
Number of features	Number of studies	Percentage	Number of studies	Percentage
0–10	96	17.91%	45	26.16%
10–50	189	35.26%	45	26.16%
50–100	29	5.41%	9	5.23%
100–300	25	4.66%	9	5.23%
Above 300	33	6.16%	5	2.91%
Unreported	164	30.60%	59	34.30%

Table 5.

Number of Categorical Features Used in Applied-Psychological Studies

	Classification problems		Regression problems
Number of features	Number of studies	Percentage	Number of studies	Percentage
0–10	134	25.00%	47	27.33%
10–50	42	7.84%	7	4.07%
50–100	6	1.12%	0	0%
Above 100	1	0.19%	0	0%
Unreported	353	65.86%	118	68.60%

To examine the prevalence of large-n-small-p versus small-n-large-p problems, we computed the n/p ratio (the ratio between the sample size n and the total number of input features p) for each study, which is visually summarized in Figure 7. The range of the ratio varied widely across the studies. For example, Dai et al. (2021) included 31,672 features for predicting depression based on data from 189 subjects, which is small-n-large-p; on the other hand, as another example, Götz et al. (2020) studied the prediction of personality with only 13 input features with data from 3,387,014 individuals, which is large-n-small-p. Again, 33.33% of the reviewed studies did not provide adequate information regarding n or p. For the studies that did reported the relevant information, a substantial proportion (24.44%) was considered to be dealing with the small-n-large-p scenario (n/p ratio falling below 10; see Matsuki et al., 2016); on the other hand, a large proportion of the studies (42.23%) commonly operated with large-n-small-p (n/p ratio above 10, n >> p), which is, in general, more typical in machine-learning applications. Across all the studies, the median n/p ratio was 21.9; the 10th, 25th, 75th, and 90th percentiles of the n/p ratio were 1.53, 4.61, 109, and 633, respectively.

Fig. 7.

The n/p ratio in applied-psychological studies.

Regarding the hyperparameters, we focused on two key hyperparameters of RF models in our review—number of trees (ntree) and number of randomly selected covariates for candidate splitting (m). Among the reviewed studies, 69.07% of them did not report the number of trees (Fig. 8), and 89.12% of the studies omitted details about the strategy for random feature selection at each split. Even among those that did provide this information, many relied on default software settings: 37.9% of them used the default number of trees (e.g., setting the number of trees to 500 in R and 100 in Scikit-learn), and 58.44% of them used the default setting for m (e.g., $m = \sqrt{p}$ for classifications and $m = p / 3$ for regressions, with p being the total number of features). Only a minority of studies offered explicit rationales for selecting optimal hyperparameter values, mostly through manual search, by using k-fold cross-validation (e.g., Gomes et al., 2023), comparing the OOB error (e.g., Marengo et al., 2022), or evaluating on a separate validation set (Pradier et al., 2021). Overall, however, this summary can be quite limited because most studies failed to either clearly document the hyperparameter values or justify their specification adequately.

Fig. 8.

The number of trees used in applied-psychological studies.

For the missing-data-handling strategies used in the applied studies, see Figure 9. As shown, 87.15% of the studies did not report the missing-data percentage, and 67.80% did not mention how missing data were handled in RF models. Out of the 228 studies that mentioned the missing-data-handling approaches, most of them (42.98%) used listwise deletion and discarded cases with missing data; 42.54% of them applied some form of imputation to fill in missing data before model fitting¹⁴ (e.g., k-nearest neighbor imputation; Ye et al., 2023). Some studies adopted a mixed strategy such that they discarded variables conditionally on the missing-data percentage and imputed missingness only if the variables had a missing rate below a certain threshold (e.g., variables with more than 25% missing data were excluded from analysis, and the others were imputed; Karabacak & Margetis, 2024).

Fig. 9.

Missing-data-handling approaches.

Discussion

Machine-learning techniques are playing an increasingly important role in transforming psychological science and serve as a great addition to conventional statistical methods and explanatory modeling (Dwyer et al., 2018; Orrù et al., 2020; Rosenbusch et al., 2021; Sleek, 2023; Vélez, 2021; Yarkoni & Westfall, 2017). For instance, they have been shown to be effective in enhancing the diagnostic, prognostic, and treatment decisions in clinical settings, particularly by tailoring personalized intervention strategies to meet the needs of the individual patient (Dwyer et al., 2018).

In this article, we focus on a widely used machine-learning technique, RF, and its applications in empirical psychological studies. This review highlights both the widespread adoption and substantial methodological inconsistencies currently present in the field. Although RF methods hold considerable promise in advancing psychological science, our findings reveal several critical gaps and methodological challenges that warrant attention.

A notable concern identified in this review is the inconsistent reporting and lack of methodological transparency across studies. The quality of technical reporting is poor overall, which was also found in previous work that reviewed the application of predictive modeling in clinical studies (e.g., Bouwmeester et al., 2012; Mallett et al., 2010). Many articles omitted essential methodological details when applying RFs, such as the software used, hyperparameter values, hyperparameter-tuning strategies, and handling of missing data. The omission of this key information can lead to replication crisis because replication by future researchers is impossible without these methodological details. The lack of transparency also severely limits the generalizability of findings, renders the results hardly comparable across research findings, and thus hinders the potential of machine-learning methods to transform psychological science as a field.

Furthermore, our review reveals the diverse data conditions and research contexts under which RF methods are applied in psychological studies, reflecting substantial variability in sample sizes and feature input. Although most RF applications are conducted in large-n scenarios, our literature review shows that a nontrivial proportion of psychological studies operate with small samples. The small-n-large-p condition has long been a common challenge in conventional statistical modeling. Currently, it remains unclear whether RFs are robust under this condition. Given the various purposes RF models possibly serve (including prediction, classification, missing-data imputation, and feature selection), systematically evaluating their performance in small-sample scenarios for different purposes is much needed in future methodological investigations.

Addressing these noted gaps requires establishing clear, psychology-specific practice guidelines. We recommend that future applied research should prioritize rigorous reporting standards, including a detailed documentation for software choices, the base learner used for growing trees, hyperparameter decisions, data-preprocessing procedures, and other important technical details that are relevant in a study. Most importantly, future research should develop and adopt standardized reporting frameworks for machine-learning applications in psychology. Existing guidelines, such as the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis; Collins et al., 2015) statement, provide a useful reference. But psychology-specific guidelines that address the unique considerations in psychological-research contexts remain very necessary. In addition, given the considerable variability across available software tools, psychological researchers should also consider preregistering their analysis plans. In particular, researchers should preregister the key aspects of their RF models, including the intended purpose of using RF (e.g., prediction vs. feature selection), the hyperparameter-tuning strategy, the cross-validation plan, model-performance evaluation criteria, and alternative comparison models if applicable. We also encourage sharing analysis code whenever feasible; importantly, the code should explicitly document the hyperparameter values (whether or not defaults are used) and include the missing-data processing steps when applicable. This will complement, although not replace, the formal reporting of key analytic decisions in the main text. Together, these practices will significantly enhance reproducibility, transparency, and collective scientific advancement in the field.

Furthermore, for future methodological work, systematic investigations into the performance of RFs under varying empirical conditions, such as varying sample sizes, feature dimensionality, and missing-data-handling strategies, are in pressing need. Such methodological investigations will help guide psychological researchers to implement RF methods more effectively, thus maximizing their research contributions. It is our hope that the review of applied studies can provide a useful data-driven reference for designing methodological investigations with greater practical relevance. On the other hand, developing tutorial materials featuring psychological-data examples that clearly demonstrate the workflow (from data preparation to results interpretation) and tutorial materials that introduce the state-of-the-art RF techniques are important tasks for methodological work, too. Particularly, many modern, advanced techniques remain unfamiliar to psychological researchers, who, without such knowledge, often then default to more convenient yet suboptimal practices (e.g., listwise deletion for missing data). Accessible tutorials would therefore bridge the knowledge gap and promote methodological rigor.

Note, however, that although in this article we focus on RFs as a widely adopted analytical tool in psychological research, there are some limitations and inherent biases in the traditional RF algorithms, particularly with CART as the base learner. One major concern is selection bias when growing the trees. CART’s greedy search over all possible splits was found to favor covariates with many possible splits (or categories) and variables with many missing values (Hothorn et al., 2006; Kim & Loh, 2001; Strobl et al., 2007, 2008). This would not only bias the VI measures but also limit the interpretability of the model. In addition, the bootstrap-sampling-with-replacement procedure commonly used in RFs was also found to introduce bias by favoring covariates with more categories, thereby artificially inflating their importance (Strobl et al., 2007). Furthermore, the typical permutation-based VI measures can also yield biases because of the correlations among covariates (Strobl et al., 2008). Specifically, the VI measure reflects not only a variable’s unique association with the outcome but also its association with other correlated predictors that are related to the outcome, thus leading to an overestimation of a covariate’s independent importance. To address these issues, alternative frameworks, such as conditional inference forests, have been developed, which reduce the bias in variable selection by using a statistical-inference-based approach for splitting rather than a greedy search (Hothorn et al., 2006) and is thus recommended in the literature (e.g., Strobl et al., 2007). Methodological investigations further recommend using conditional inference forests with subsampling without replacement to further reduce the bias (Strobl et al., 2007) and employing conditional permutation schemes for VI (Nason et al., 2004; Strobl et al., 2008) to more accurately evaluate the importance of a covariate conditional on other covariates. Although a comprehensive examination of relevant methods is beyond the scope of this article, researchers should be aware of the potential biases when implementing RF. This again underscores the importance of thoughtful model specification rather than relying solely on software defaults.

On the other hand, besides RFs, other ensemble methods, including boosting algorithms, such as AdaBoost (Freund & Schapire, 1997; Schapire, 2013), Gradient Boosting Machine (Friedman, 2001), XGBoost (Chen & Guestrin, 2016), and LightGBM (Ke et al., 2017), are powerful alternatives that also have great potential for helping advance psychological science. Unlike bagging and RFs, which build trees independently, boosting algorithms build trees sequentially, with each iteration depending on the previous one. Each RF or boosting algorithm offers its own advantages. For example, LightGBM is designed for efficiency that reduces the computational time, particularly with large data sets. Although boosting methods often show excellent prediction accuracy, they do require more careful parameter tuning. In contrast, RFs tend to be more robust with less intensive tuning and are found to perform competitively across a wide range of problems (Bentéjac et al., 2021). Nonetheless, given the important role that boosting methods play in machine learning, we encourage future research to further investigate the application of boosting in psychological science and to develop practical guidelines for more effective use of boosting in advancing psychological science.

We acknowledge that this study is limited by the currently available data in several aspects. These limitations point to important directions for future research that can build on our findings. For starters, with the limited information, we were unable to identify more nuanced relations between key practical aspects of applying RFs. As an example, we were unable to find a clear relation between sample size and missing-data-handling approaches largely because the majority of reviewed studies did not report how missing data were handled. As the data-analyses reporting becomes more transparent and better guided by clearer guidelines, future reviews with more complete data can thus examine such more nuanced relations. Another limitation is that we did not systematically collect data on more detailed application-level characteristics that are important for understanding how RFs are used in practice. Future work could extend the current study by explicitly reviewing and synthesizing additional aspects in applying RFs, including but not limited to the analytic role of RFs (e.g., whether they are used for feature selection, stand-alone prediction, or benchmarking/comparison with other methods), common competitor models being compared with RFs (e.g., parametric models, regularized regressions, boosting), validation strategies (e.g., separate testing set, k-fold cross-validation, OOB error), and the metrics and decision thresholds used to evaluate good model performance. Third, we did not stratify findings by psychological subfields in the current study. Future systematic reviews can address this limitation by conducting subfield-specific analyses to identify potential challenges that are more unique or prominent in certain fields. For example, imbalanced data may be more commonly encountered in clinical psychology, particularly when making diagnostics is the goal. Such subfield analyses would provide more useful practical guidance for researchers working in specific areas of psychological science. Finally, in the current study, we chose to focus on reviewing the information formally reported in the main text of published articles. We did not systematically analyze any external code files. Although code sharing is highly encouraged for transparency and reproducibility, they cannot replace the clear in-text reporting of key analytic details. But future work may benefit by extending our approach by systematically reviewing the external syntax files to look for additional implementation details.

In conclusion, RFs offer promising methodological opportunities to advance psychological science; however, fully achieving its potential power depends on methodological rigor and reporting transparency in the field. We hope this review motivates meaningful discussions and highlights potential directions for future research to enhance the application of RFs and machine-learning methods, in general, in psychological studies.

Supplemental Material

sj-pdf-1-amp-10.1177_25152459251404358 – Supplemental material for Advancing Psychological Research With Random Forests: A Review of Methods, Tools, and Applications

Supplemental material, sj-pdf-1-amp-10.1177_25152459251404358 for Advancing Psychological Research With Random Forests: A Review of Methods, Tools, and Applications by Yi Feng, Han Du, Jiarui Song, Yina Sun, Yiting Wang and Aedan Joel in Advances in Methods and Practices in Psychological Science

Footnotes

Transparency

Action Editor: Yasemin Kisbu-Sakarya

Editor: David A. Sbarra

Author Contributions

Yi Feng: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Software; Supervision; Validation; Visualization; Writing – original draft; Writing – review & editing.

Han Du: Conceptualization; Data curation; Investigation; Methodology; Project administration; Resources; Supervision; Writing – review & editing.

Jiarui Song: Investigation; Writing – review & editing.

Yina Sun: Investigation; Writing – review & editing.

Yiting Wang: Investigation; Writing – review & editing.

Aedan Joel: Investigation; Writing – review & editing.

ORCID iDs

Yi Feng

Han Du

Supplemental Material

Additional supporting information can be found at

Notes

References

Abreu

Jorge

Leal

Koenig

Figueiredo

(2021). EEG microstates predict concurrent fMRI dynamic functional connectivity states. Brain Topography, 34(1), 41–55. https://doi.org/10.1007/s10548-020-00805-1

Akiba

Sano

Yanase

Ohta

Koyama

(2019). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2623–2631). Association for Computing Machinery. https://doi.org/10.1145/3292500.3330701

Bentéjac

Csörgő

Martínez-Muñoz

(2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937–1967. https://doi.org/10.1007/s10462-020-09896-5

Betz

L. T.

Rosen

Salokangas

R. K. R.

Kambeitz

(2022). Disentangling the impact of childhood abuse and neglect on depressive affect in adulthood: A machine learning approach in a general population sample. Journal of Affective Disorders, 315, 17–26. https://doi.org/10.1016/j.jad.2022.07.042

Bischl

Lang

Kotthoff

Schiffner

Richter

Studerus

Casalicchio

Jones

Z. M.

(2016). mlr: Machine learning in R. Journal of Machine Learning Research, 17(170), 1–5.

Bischl

Richter

Bossek

Horn

Thomas

Lang

(2018). mlrMBO: A modular framework for model-based optimization of expensive black-box functions. arXiv. https://doi.org/10.48550/arXiv.1703.03373

Bouwmeester

Zuithoff

N. P. A.

Mallett

Geerlings

M. I.

Vergouwe

Steyerberg

E. W.

Altman

D. G.

Moons

K. G. M.

(2012). Reporting and methods in clinical prediction research: A systematic review. PLoS Medicine, 9(5), Article e1001221. https://doi.org/10.1371/journal.pmed.1001221

Breiman

(1996). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/BF00058655

Breiman

(2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

10.

Breiman

Friedman

Stone

C. J.

Olshen

R. A.

(1984). Classification and regression trees. Taylor & Francis.

11.

Brick

T. R.

Koffer

R. E.

Gerstorf

Ram

(2018). Feature selection methods for optimal design of studies for developmental inquiry. The Journals of Gerontology Series B: Psychological Sciences and Social Sciences, 73(1), 113–123. https://doi.org/10.1093/geronb/gbx008

12.

Browne

M. W.

(2000). Cross-validation methods. Journal of Mathematical Psychology, 44(1), 108–132. https://doi.org/10.1006/jmps.1999.1279

13.

Chawla

N. V.

Bowyer

K. W.

Hall

L. O.

Kegelmeyer

W. P.

(2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

14.

Chen

Guestrin

(2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785

15.

Collins

G. S.

Reitsma

J. B.

Altman

D. G.

Moons

K. G.

(2015). Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. BMC Medicine, 13, Article 1. https://doi.org/10.1186/s12916-014-0241-z

16.

Dai

Zhou

Wang

(2021). Improving depression prediction using a novel feature selection algorithm coupled with context-aware analysis. Journal of Affective Disorders, 295, 1040–1048. https://doi.org/10.1016/j.jad.2021.09.001

17.

Dhiman

Bullock

Sergeant

J. C.

Riley

R. D.

Collins

G. S.

(2023). Sample size requirements are not being considered in studies developing prediction models for binary outcomes: A systematic review. BMC Medical Research Methodology, 23, Article 188. https://doi.org/10.1186/s12874-023-02008-1

18.

Dwyer

D. B.

Falkai

Koutsouleris

(2018). Machine learning approaches for clinical psychology and psychiatry. Annual Review of Clinical Psychology, 14, 91–118. https://doi.org/10.1146/annurev-clinpsy-032816-045037

19.

Enders

C. K.

(2022). Applied missing data analysis (2nd ed.). Guilford Press.

20.

Enders

C. K.

(2025). Missing data: An update on the state of the art. Psychological Methods, 30(2), 322–339. https://doi.org/10.1037/met0000563

21.

Fife

D. A.

D’Onofrio

(2023). Common, uncommon, and novel applications of random forest in psychological research. Behavior Research Methods, 55(5), 2447–2466. https://doi.org/10.3758/s13428-022-01901-9

22.

Freund

Schapire

R. E.

(1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. https://doi.org/10.1006/jcss.1997.1504

23.

Friedman

J. H.

(2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.

24.

Fryda

LeDell

Gill

Aiello

Candel

Click

Kraljevic

Nykodym

Aboyoun

Kurka

Malohlava

Poirier

Wong

Rehak

Eckstrand

Hill

Vidrio

Jadhawani

, . . . H2O.ai. (2024). H2o: R Interface for the “H2o”, scalable machine learning platform (Version 3.44.0.3) [Computer software]. https://cran.r-project.org/web/packages/h2o/index.html

25.

Gashler

Giraud-Carrier

Martinez

(2008). Decision tree ensemble: Small heterogeneous is better than large homogeneous. In 2008 Seventh International Conference on Machine Learning and Applications (pp. 900–905). IEEE. https://doi.org/10.1109/ICMLA.2008.154

26.

Gök

Akkuş

E. B.

Kavak

Kasap

(2023). Investigation of the variables affecting primary school teachers’ state of anxiety and motivation in mathematics teaching through data mining methods. Current Psychology, 42(31), 27678–27693. https://doi.org/10.1007/s12144-022-03711-w

27.

Gomes

S. R. B. S.

von Schantz

Leocadio-Miguel

(2023). Predicting depressive symptoms in middle-aged and elderly adults using sleep data and clinical health markers: A machine learning approach. Sleep Medicine, 102, 123–131. https://doi.org/10.1016/j.sleep.2023.01.002

28.

Götz

F. M.

Stieger

Gosling

S. D.

Potter

Rentfrow

P. J.

(2020). Physical topography is associated with human personality. Nature Human Behaviour, 4(11), 1135–1144. https://doi.org/10.1038/s41562-020-0930-x

29.

Hapfelmeier

Hothorn

Ulm

Strobl

(2014). A new variable importance measure for random forests with missing data. Statistics and Computing, 24(1), 21–34. https://doi.org/10.1007/s11222-012-9349-1

30.

Harman

Kliamovich

Morales

A. M.

Gilbert

Barch

D. M.

Mooney

M. A.

Ewing

S. W. F.

Fair

D. A.

Nagel

B. J.

(2021). Prediction of suicidal ideation and attempt in 9 and 10 year-old children using transdiagnostic risk features. PLOS ONE, 16(5), Article e0252114. https://doi.org/10.1371/journal.pone.0252114

31.

Hastie

Tibshirani

Friedman

(2009). The elements of statistical learning. Springer. https://doi.org/10.1007/978-0-387-84858-7

32.

Head

Kumar

Nahrstaedt

Louppe

Shcherbatyi

(2021). scikit-optimize/scikit-optimize (v0.9.0). Zenodo. https://doi.org/10.5281/zenodo.5565057

33.

T. K.

(1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844. https://doi.org/10.1109/34.709601

34.

Hothorn

Hornik

Zeileis

(2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. https://doi.org/10.1198/106186006X133933

35.

Hothorn

Zeileis

(2015). Partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16(1), 3905–3909.

36.

Ishwaran

Kogalur

U. B.

Blackstone

E. H.

Lauer

M. S.

(2008). Random survival forests. The Annals of Applied Statistics, 2(3), 841–860. https://doi.org/10.1214/08-AOAS169

37.

Japkowicz

(2000). Learning from imbalanced data sets: A comparison of various strategies. In AAAI Workshop on Learning from Imbalanced Data Sets (Vol. 68, pp. 10–15). AAAI. https://cdn.aaai.org/Workshops/2000/WS-00-05/WS00-05-003.pdf

38.

Jones

D. R.

Schonlau

Welch

W. J.

(1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4), 455–492. https://doi.org/10.1023/A:1008306431147

39.

Karabacak

Margetis

(2024). Prognosis at your fingertips: A machine learning-based web application for outcome prediction in acute traumatic epidural hematoma. Journal of Neurotrauma, 41(1–2), 147–160. https://doi.org/10.1089/neu.2023.0122

40.

Meng

Finley

Wang

Chen

Liu

T.-Y.

(2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

41.

Kim

Loh

W.-Y.

(2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96(454), 589–604. https://doi.org/10.1198/016214501753168271

42.

Koul

Becchio

Cavallo

(2018). Cross-validation approaches for replicability in psychology. Frontiers in Psychology, 9, Article 117. https://doi.org/10.3389/fpsyg.2018.01117

43.

Kuhn

(2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28, 1–26. https://doi.org/10.18637/jss.v028.i05

44.

Kuhn

Wickham

(2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles [Computer software]. https://www.tidymodels.org

45.

Kuhn

Yan

Pawley

, & Posit Software, PBC. (2024). agua: “tidymodel”, integration with “h2o” (Version 0.1.4) [Computer software]. https://cran.r-project.org/web/packages/agua/index.html

46.

Lemaître

Nogueira

Aridas

C. K.

(2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17), 1–5.

47.

Liaw

Wiener

(2002). Classification and regression by randomForest. R News, 2(3), 18–22.

48.

Ling

C. X.

(1998). Data mining for direct marketing: Problems and solutions. In KDD, 98: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (pp. 73–79). American Association for Artificial Intelligence. https://www.csd.uwo.ca/~xling/papers/kdd98.pdf

49.

Lohani

D. C.

Rana

(2023). ADHD diagnosis using structural brain MRI and personal characteristic data with machine learning framework. Psychiatry Research: Neuroimaging, 334, Article 111689. https://doi.org/10.1016/j.pscychresns.2023.111689

50.

Lunardon

Menardi

Torelli

(2014). ROSE: A package for binary imbalanced learning. R Journal, 6(1), 79–89. https://doi.org/10.32614/RJ-2014-008

51.

Mallett

Royston

Dutton

Waters

Altman

D. G.

(2010). Reporting methods in studies developing prognostic models in cancer: A review. BMC Medicine, 8(1), Article 20. https://doi.org/10.1186/1741-7015-8-20

52.

Malley

J. D.

Kruppa

Dasgupta

Malley

K. G.

Ziegler

(2018). Probability machines. Methods of Information in Medicine, 51, 74–81. https://doi.org/10.3414/ME00-01-0052

53.

Marengo

Angelo Fabris

Longobardi

Settanni

(2022). Smartphone and social media use contributed to individual tendencies towards social media addiction in Italian adolescents during the COVID-19 pandemic. Addictive Behaviors, 126, Article 107204. https://doi.org/10.1016/j.addbeh.2021.107204

54.

Matsuki

Kuperman

Van Dyke

J. A.

(2016). The random forests statistical technique: An examination of its value for the study of reading. Scientific Studies of Reading, 20(1), 20–33. https://doi.org/10.1080/10888438.2015.1107073

55.

Mienye

I. D.

Jere

(2024). A survey of decision trees: Concepts, algorithms, and applications. IEEE Access, 12, 86716–86727. https://doi.org/10.1109/ACCESS.2024.3416838

56.

Nason

Emerson

LeBlanc

(2004). CARTscans: A tool for visualizing complex models. Journal of Computational and Graphical Statistics, 13(4), 807–825. https://doi.org/10.1198/106186004X11417

57.

Orrù

Monaro

Conversano

Gemignani

Sartori

(2020). Machine learning in psychometrics and psychological research. Frontiers in Psychology, 10, Article 2970. https://doi.org/10.3389/fpsyg.2019.02970

58.

Owen

(2022). Hyperparameter tuning with Python: Boost your machine learning model’s performance via hyperparameter tuning. Packt Publishing Ltd.

59.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

Vanderplas

Passos

Cournapeau

Brucher

Perrot

Duchesnay

(2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.

60.

Pradier

M. F.

Hughes

M. C.

McCoy

T. H.

Barroilhet

S. A.

Doshi-Velez

Perlis

R. H.

(2021). Predicting change in diagnosis from major depression to bipolar disorder after antidepressant initiation. Neuropsychopharmacology, 46(2), 455–461. https://doi.org/10.1038/s41386-020-00838-x

61.

Probst

Boulesteix

A.-L.

(2018). To tune or not to tune the number of trees in random forest. Journal of Machine Learning Research, 18(181), 1–18.

62.

Probst

Wright

M. N.

Boulesteix

A.-L.

(2019). Hyperparameters and tuning strategies for random forest. WIREs Data Mining and Knowledge Discovery, 9(3), Article e1301. https://doi.org/10.1002/widm.1301

63.

Qiu

Wang

Lan

Miao

Pan

Sun

Wang

Zhao

Zhu

(2022). Explore the influencing factors and construct random forest models of post-stroke depression at 3 months in males and females. BMC Psychiatry, 22(1), Article 811. https://doi.org/10.1186/s12888-022-04467-0

64.

Quinlan

J. R.

(2014). C4.5: Programs for machine learning. Elsevier.

65.

Ranjan

Singh

V. P.

Mishra

R. B.

Thakur

A. K.

Singh

A. K.

(2021). Sentence polarity detection using stepwise greedy correlation based feature selection and random forests: An fMRI study. Journal of Neurolinguistics, 59, Article 100985. https://doi.org/10.1016/j.jneuroling.2021.100985

66.

Rosenbusch

Soldner

Evans

A. M.

Zeelenberg

(2021). Supervised machine learning methods in psychology: A practical introduction with annotated R code. Social and Personality Psychology Compass, 15(2), Article e12579. https://doi.org/10.1111/spc3.12579

67.

Rubin

D. B.

(1976). Inference and missing data. Biometrika, 63(3), 581–592.

68.

Ryo

Rillig

M. C.

(2017). Statistically reinforced machine learning for nonlinear patterns and variable interactions. Ecosphere, 8(11), Article e01976. https://doi.org/10.1002/ecs2.1976

69.

Schapire

R. E.

(2013). Explaining AdaBoost. In Schölkopf

Luo

Vovk

(Eds.), Empirical inference (pp. 37–52). Springer. https://doi.org/10.1007/978-3-642-41136-6_5

70.

Segal

M. R.

(2004). Machine learning benchmarks and random forest regression. eScholarship. https://escholarship.org/uc/item/35x3v9t4

71.

Shiner

Peltzman

Cornelius

S. L.

Gui

Jiang

Riblet

Gottlieb

D. J.

Watts

B. V.

(2022). Influence of contextual factors on death by suicide in rural and urban settings. The Journal of Rural Health, 38(2), 336–345. https://doi.org/10.1111/jrh.12579

72.

Shmueli

(2010). To explain or to predict? Statistical Science, 25(3), 289–310. https://doi.org/10.1214/10-STS330

73.

Siriseriwan

(2024). smotefamily: A collection of oversampling techniques for class imbalance problem based on SMOTE (Version 1.4.0) [Computer software]. https://cran.r-project.org/web/packages/smotefamily/index.html

74.

Sleek

B. S.

(2023). How machine learning is transforming psychological science. APS Observer, 36. https://www.psychologicalscience.org/observer/machine-learning-transforming-psychological-science

75.

Smucny

Davidson

Carter

C. S.

(2021). Comparing machine and deep learning-based algorithms for prediction of clinical improvement in psychosis with functional magnetic resonance imaging. Human Brain Mapping, 42(4), 1197–1205. https://doi.org/10.1002/hbm.25286

76.

Stekhoven

D. J.

Bühlmann

(2012). MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118. https://doi.org/10.1093/bioinformatics/btr597

77.

Stone

(1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B: Methodological, 36(2), 111–133. https://doi.org/10.1111/j.2517-6161.1974.tb00994.x

78.

Strobl

Boulesteix

A.-L.

Kneib

Augustin

Zeileis

(2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), Article 307. https://doi.org/10.1186/1471-2105-9-307

79.

Strobl

Boulesteix

A.-L.

Zeileis

Hothorn

(2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1), Article 25. https://doi.org/10.1186/1471-2105-8-25

80.

Strobl

Malley

Tutz

(2009). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348. https://doi.org/10.1037/a0016973

81.

Tang

Ishwaran

(2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(6), 363–377. https://doi.org/10.1002/sam.11348

82.

Touw

W. G.

Bayjanov

J. R.

Overmars

Backus

Boekhorst

Wels

van Hijum

S. A. F. T

. (2013). Data mining in the life sciences with random forest: A walk in the park or lost in the jungle? Briefings in Bioinformatics, 14(3), 315–326. https://doi.org/10.1093/bib/bbs034

83.

Twala

B. E. T. H.

Jones

M. C.

Hand

D. J.

(2008). Good methods for coping with missing data in decision trees. Pattern Recognition Letters, 29(7), 950–956. https://doi.org/10.1016/j.patrec.2008.01.010

84.

Vabalas

Gowen

Poliakoff

Casson

A. J.

(2019). Machine learning algorithm validation with a limited sample size. PLOS ONE, 14(11), Article e0224365. https://doi.org/10.1371/journal.pone.0224365

85.

Vélez

J. I.

(2021). Machine learning based psychology: Advocating for a data-driven approach. International Journal of Psychological Research, 14(1), 6–11.

86.

Wang

King

Haw

Leung

(2023). What explains Macau students’ achievement? An integrative perspective using a machine learning approach. Journal for the Study of Education and Development, 46(1), 71–108. https://doi.org/10.1080/02103702.2022.2149120

87.

Wright

M. N.

Ziegler

(2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77, 1–17. https://doi.org/10.18637/jss.v077.i01

88.

Yarkoni

Westfall

(2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393

89.

E. M.

Sun

Krishnamurthy

P. V.

Adra

Ganglberger

Thomas

R. J.

Lam

A. D.

Westover

M. B.

(2023). Dementia detection from brain activity during sleep. Sleep, 46(3), Article zsac286. https://doi.org/10.1093/sleep/zsac286

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.18 MB