Mathematical Propositions for Feature Engineering in Decision Tree Models

Abstract

Knowing mathematical properties about machine learning models can provide effective guidance in feature engineering and selection. This paper presents mathematical propositions for identifying redundant variables in decision tree models. The first proposition demonstrates that if one variable is an order-preserving one-to-one corresponding of another, the performance of the model remains unaffected even when one of the two variables is removed. The second proposition reveals that if one variable is an order-preserving mapping of another and this mapping reduces the cardinality of the variable, the model’s performance remains unchanged upon the removal of the variable with lower cardinality. We provide formal mathematical proofs for both propositions and support our findings with simulation-based experiments. These results demonstrate that, within the standard CART-style framework of axis-aligned, greedily constructed decision trees, common order-preserving transformations—such as min–max scaling or logarithmic transformation—do not alter the set of feasible partitions and therefore leave predictive performance unchanged. Furthermore, the results suggest that rank-based correlation measures, such as Spearman’s rank correlation coefficient, can serve as an effective tool for identifying redundant variables under this modeling framework.

Keywords

decision tree feature engineering machine learning

1. Introduction

Decision tree models are widely employed in the early stages of modeling to explain how target variables are related to input variables due to their intuitive structure and inherent interpretability (Breiman et al., 1984; Hastie et al., 2009). They are also frequently used to identify potential relationships among input variables. Building upon these characteristics, ensemble methods such as Random Forests and Gradient Boosted Trees extend the basic decision tree framework to achieve robust predictive performance by effectively mitigating both bias and variance (Breiman, 2001; Friedman, 2001).

Decision tree–based methods and their ensembles have been widely adopted across diverse disciplines (Costa & Pedreira, 2023; Mienye & Jere, 2024). In healthcare, they have been applied to clinical decision support, disease risk prediction, and prognostic modeling (Chrimes, 2023; Podgorelec et al., 2002). In finance, gradient boosting methods such as XGBoost have become standard tools for credit scoring and fraud detection (Chang et al., 2018; Chen & Guestrin, 2016). In real estate, random forest–based approaches have been used for mass appraisal and house price valuation (Hong et al., 2020). Tree-based models have also been employed in environmental sciences for flood susceptibility mapping and land-use classification (Lee et al., 2017; Mobley et al., 2021).

Despite these strengths, decision trees are inherently prone to overfitting, especially when constructed with high-dimensional or noisy features. A common yet often overlooked source of such overfitting arises from information redundancy among features, where multiple variables encode overlapping aspects of the same underlying information. If highly correlated variables are included in a model, the explanatory power derived from the qualitatively identical information may be split between the two variables. Such redundancy may not significantly affect in-sample predictive accuracy, yet it can introduce substantial distortion in out-of-sample prediction and secondary analyses, such as simulations based on the trained model or assessments of variable importance (Gregorutti et al., 2017; Li et al., 2019).

As a result, careful feature selection becomes critical. Several theoretical approaches have been proposed to improve feature selection for decision trees, including impurity-based measures (e.g., information gain, Gini index), feature importance ranking, and regularization techniques embedded within tree-building algorithms (Breiman, 2001; Breiman et al., 1984; Zhang & Gionis, 2023). These methods aim to identify features that contribute most significantly to the decision-making process of the model.

However, this approach relies on empirical procedures—training the model directly on data and then examining the outcomes or measuring feature importance. The problem arises when many features are correlated: their importance tends to be diluted across variables, making it difficult to identify the truly influential ones. In other words, the observed importance distribution is already affected by correlations, rendering it an imperfect tool for causal interpretation.

To mitigate this issue, one must identify and account for strong correlations among features prior to model training. In classical statistical analysis, indicators such as the correlation coefficient and the Variance Inflation Factor (VIF) are employed to preemptively diagnose multicollinearity. However, these measures are inherently grounded in the linear regression framework and are not directly applicable to nonlinear, partition-based methods like tree-based algorithms. A unified and generally accepted framework for detecting redundant features in tree-based models has yet to be established.

This paper addresses this gap by introducing mathematical propositions that characterize how certain types of features become structurally redundant within a well-defined class of decision tree models, namely CART-style trees constructed via axis-aligned, greedy split selection. The first proposition shows that if one variable is an order-preserving bijective mapping of another, the performance of the model remains unaffected even when one of the two variables is removed. The second proposition reveals that if one variable is an order-preserving surjective mapping of another, the performance of the model remains unchanged upon the removal of the image variable.

While the analysis is intentionally restricted to this modeling framework, it provides clear theoretical guidance for feature engineering practices commonly employed in CART-style decision trees and their widespread implementations. Within this CART-style setting, our propositions imply that common order-preserving transformations—such as min–max scaling or logarithmic transformation—do not change the split structure and thus offer no systematic benefit for predictive performance. They also provide practical guidelines for feature selection in real-world contexts, such as dealing with time-series variables that exhibit monotonic movements. Furthermore, they suggest that Spearman’s rank correlation coefficient can serve as an effective measure for identifying redundant variables in decision tree models.

2. Related Work

Feature selection for decision tree models has been a long-standing area of research, approached from a variety of perspectives including statistical heuristics, correlation analysis, optimization, and robustness. Existing research on feature selection has predominantly addressed the variable selection problem from an empirical perspective, typically evaluating features through performance-based criteria or statistical associations observed in data. For instance, De Mántaras introduced a distance-based attribute selection measure to mitigate the bias of information gain toward multi-valued attributes (De Mántaras, 1991). Correlation-based feature selection (CFS) evaluates subsets of features by balancing individual predictive power against inter-feature redundancy (Hall, 2000), and has influenced many subsequent filter-based approaches. Optimization-based strategies have also been explored. Bredensteiner proposed feature minimization within decision trees, targeting parsimonious models that achieve compact representations without significant loss of accuracy (Bredensteiner, 1999). Tuv et al. further advanced this line of research by introducing an ensemble-based feature selection framework that integrates artificial contrast variables and redundancy elimination (Tuv et al., 2009). Their method identifies compact yet informative feature subsets and demonstrates superior robustness on high-dimensional and noisy datasets compared to traditional filter and wrapper approaches. Izza et al. also improved interpretability by introducing PI-explanations, which provide shorter path representations of decision tree predictions by removing redundant conditions (Izza et al., 2020).

Although these approaches often improve interpretability or empirical performance, they typically rely on statistical associations among features and outcomes rather than addressing redundancy through the intrinsic structural properties of decision tree models. As a result, redundancy is typically identified only after a model has been trained, making it difficult to detect structurally equivalent feature representations in advance and potentially leading to unnecessary model complexity during training. In this context, we provide a mathematical characterization of feature redundancy to decision tree models. We formalize conditions under which alternative feature representations become structurally equivalent or redundant from the perspective of decision tree construction, thereby offering a systematic theoretical account of when feature transformations preserve or restrict the expressive capacity of tree-based models.

The propositions developed in this paper formalize commonly stated but rarely proven properties of decision trees. For example, it is well known in practitioner-oriented literature and textbooks that standard decision trees are insensitive to monotone transformations of individual features. Breiman et al. (1984) and Hastie et al. (2009) note that tree-based models depend primarily on the ordering of feature values rather than their absolute scale. Similar remarks appear in software documentation, including the scikit-learn user guide (Pedregosa et al., 2011). Also, some studies have shown that impurity- and importance-based measures may suffer from systematic biases, particularly in the presence of correlated or high-cardinality features (Strobl et al., 2007). These observations are typically stated informally, without a precise theoretical characterization of when two features are equivalent or redundant in terms of the structure and expressive power of decision tree models. To the best of our knowledge, there has been no systematic theoretical treatment that formalizes redundancy induced by order-preserving transformations through explicit propositions and proofs.

Our analysis can be applied to characterize how discretization and binning operations interact with the expressive capacity of decision tree models. Classical studies have investigated supervised and unsupervised discretization strategies as a means of improving predictive performance or model interpretability (Dougherty et al., 1995). These methods typically aim to balance information preservation with model simplicity by reducing continuous variables to a finite set of intervals. While discretization is often treated as a preprocessing choice, its interaction with the expressive capacity of decision tree models has not been fully characterized from a structural perspective.

By linking these theoretical results to practical feature engineering operations—such as scaling, discretization, and binning—our work complements existing empirical approaches by clarifying when alternative feature representations genuinely expand, preserve, or restrict the expressive capacity of decision tree models. This perspective provides practitioners with principled guidance for selecting or transforming features prior to model training.

3. Decision Trees and Redundant Variables

In order to discuss the redundancy of a variable in a binary decision tree, let us present the related denotations and definitions. The initial sample set, denoted by $S_{0}$ , comprises a collection of samples indexed from $s_{1}$ through $s_{n}$ , ${s_{1}, s_{2}, \dots s_{n}}$ . Each sample, referred to as $s_{i}$ , is a vector that resides within an $m$ -dimensional space. For the samples, the $j$ th element is called the variable $j$ . The value of a variable $j$ of a sample $s_{i}$ is denoted by $x_{j}^{i}$ . We can also define the variable space for the variable $j$ . The variable space $X_{j}$ represents the set of all feasible values for $x_{j}^{i}$ . For any variable $j$ , the variable space $X_{j}$ is always a subset of the real number space.

Before stating the propositions, we summarize the main modeling assumptions under which the theoretical results are derived. We restrict attention to single CART-style decision trees with binary, axis-aligned splits, where each internal node selects a univariate threshold split by maximizing an impurity-based gain function. The gain function is assumed to be defined solely through the induced partition of the sample and to be invariant to monotone reparameterizations of the splitting variable. These assumptions encompass standard impurity measures used in practice, including squared error for regression and Gini index or entropy for classification. The following statements formalize them:

Assumption (Tree class) We consider single decision trees trained under the standard CART-style induction rule. In particular, each internal node performs an axis-aligned binary split based on a single variable and a threshold, i.e., rules of the form $x_{j} < {\bar{x}}_{j}$ versus $x_{j} \geq {\bar{x}}_{j}$ . The split is selected greedily at each node by maximizing a local impurity-based gain criterion computed on the samples reaching that node.

Binary trees partition a given set of samples into two groups based on whether the variable values of the samples exceed a certain threshold. For analysis, we can define separately the task of partitioning the given samples and the task of determining which criterion for sample partitioning is superior. At first, the partitioning can be separately defined by a partition function, $β$ . Let a set of samples be denoted as $S_{z}$ , the variable on which the condition is applied as $j$ , and the threshold point for partitioning as ${\bar{x}}_{j}$ . The partition function can then be defined as follows:

\begin{aligned} β (S_{z}, j, {\bar{x}}_{j}) = (S_{l} (S_{z}, j, {\bar{x}}_{j}), S_{h} (S_{z}, j, {\bar{x}}_{j})), \\ where S_{l} (S_{z}, j, {\bar{x}}_{j}) \equiv {s_{i} \in S ∣ x_{j}^{i} < {\bar{x}}_{j}}, \\ and S_{h} (S_{z}, j, {\bar{x}}_{j}) \equiv {s_{i} \in S ∣ x_{j}^{i} \geq {\bar{x}}_{j}} . \end{aligned}

(1)

The above parentheses $(\cdot)$ denote an ordered pair, and $S_{l} (S_{z}, j, {\bar{x}}_{j})$ and $S_{h} (S_{z}, j, {\bar{x}}_{j})$ can be referred as the partitions of $S_{z}$ .¹ The partition function serves the sole purpose of dividing the samples into two groups which are exclusive. Thus, by the definition (1), we know the following is always satisfied:

S_{l} \cup S_{h} = S_{z} and S_{l} \cap S_{h} = \emptyset .

(2)

Next, the information gain obtainable from a partition $β (S_{z}, j, {\bar{x}}_{j})$ of the given sample can be considered as a gain function, $I$ . $I (β (S_{z}, j, {\bar{x}}_{j}))$ denotes the gain function defined on a set of samples $S_{z}$ , a variable $j$ , and a threshold point ${\bar{x}}_{j} \in X_{j}$ . We know that the information gain function $I$ is discrete with respect to the sample set $S_{z}$ , and the variable index $j$ , and stepwise with respect to the split point ${\bar{x}}_{j}$ . In general, the information gain takes only a positive value. Under these conditions, the results do not depend on the specific choice of impurity measure. Any impurity criterion that evaluates split quality solely as a function of the induced partition—such as Gini index, entropy, misclassification error, or squared error—leads to identical conclusions.

Then, the optimal partitioning condition with respect to a set of samples $S_{z}$ can be defined as the pair $(j^{*}, {\bar{x}}_{j}^{*})$ such that,

j^{*}, {\bar{x}}_{j}^{*} = a r g m a x_{j, {\bar{x}}_{j}} I (β (S_{z}, j, {\bar{x}}_{j})) .

(3)

In general, a mathematical expression for binary decision tree can sufficiently rely on the optimal partition condition defined above. However, since this study focus on analyzing the efficiency of using a variable, it is helpful to define the set of space of feasible gain for each variable as an instrumental concept. Based on the denotation of $I$ , we can denote the gain space for a variable $j$ as $Θ_{j} (S_{z})$ . The gain space indicates the set of all feasible gain by using a variable $j$ with a given set $S_{z}$ , which can be expressed as the follow:

Θ_{j} (S_{z}) \equiv {I (β (S_{z}, j, {\bar{x}}_{j})) | S_{z} = S_{z}, j = j, {\bar{x}}_{j} \in X_{j}} .

(4)

Further, the optimal gain by using a variable $j$ with a given set $S_{z}$ , $I_{j}^{*} (S_{z})$ , can also be defined as $I_{j}^{*} (S_{z}) \equiv max (Θ_{j} (S_{z}))$ .

The equation (3) illustrates the principles by which a binary decision tree recursively generates branches. Lastly, defining a decision tree requires the establishment of stopping criteria. These criteria can be expressed as a union set $D$ of various conditions $d_{i} \subset D$ . Examples of such conditions include constraints on the minimum number of samples required to generate further partitions (the minimum number of samples to split) or limits on the number of consecutive partitions allowed (the depth of tree).

Based on the denotations and definitions described above, we can express the training of a decision tree as below:

Training of Decision Tree Let $S_{0}$ represent the initial set of given training samples, $β$ the partition function, $j^{*}$ and ${\bar{x}}_{j}^{*}$ the optimal partition conditions, and $D$ a given stopping criteria. The training of a binary decision tree is a sequential process that begins with the sample set $S_{0}$ and involves recursively applying the partition with respect to the optimal conditions, $β (\cdot, j^{*}, {\bar{x}}_{j}^{*})$ , to each generated partition $S_{z} \subset S_{0}$ until the stopping criterion $D$ is met.

Next, we can suggest the definition of a variable being meaningful or redundant in the training of the decision tree as follows:

Redundant Variable If there is a partition set generated in the training of the decision tree, which $I_{j}^{*} (S_{z})$ satisfies $I_{j}^{*} (S_{z}) > I_{k}^{*} (S_{z})$ for any $k \neq j$ , then the variable $j$ is defined as meaningful in the decision tree. Otherwise, the variable $j$ is defined as redundant in the decision tree.

The implication of the above definition is straightforward. In a decision tree, a variable influences prediction only through the splits at which it is actually used: each internal node corresponds to a specific variable–threshold pair that contributes to the tree’s partition structure. For a variable to play any role in shaping this structure, it must either be selected as an optimal splitting condition at least once during training, or induce a partition that cannot be replicated by other variables. Conversely, if a variable is never chosen as the optimal condition at any partition, or if every partition it can induce can be replicated exactly by splits on other variables, then the variable does not contribute uniquely to the tree’s partition structure. Excluding such a variable leaves both the set of feasible partitions and the resulting prediction function unchanged.

To address variable selection and transformation issues more directly, we can rewrite the concept for the redundancy of a variable to focus on the relationships between variables as follows:

Substitutable A variable $j$ is defined substitutable with a variable $k$ if the variable $k$ satisfies $I_{j}^{*} (S_{z}) \leq I_{k}^{*} (S_{z})$ for any partition set $S_{z} \subset S_{0}$ .

Interchangeability Two variable $j$ and $k$ are interchangeable, if $I_{j}^{*} (S_{z}) = I_{k}^{*} (S_{z})$ is satisfied for any partition set $S_{z} \subset S_{0}$ .

These definitions admit a natural interpretation in terms of the partition spaces induced by variables in a decision tree. Interchangeability means that two variables generate exactly the same collection of feasible sample partitions at every node: any split achievable using one variable can be replicated using the other, leading to identical gain spaces and identical optimal gains. Substitutability is a weaker notion. Variable $k$ is substitutable by variable $j$ when every partition that can be induced by $k$ can also be induced by $j$ , so that the partition space of $k$ is contained in that of $j$ . As a result, the maximal gain achievable using $k$ cannot exceed that achievable using $j$ , although in finite-sample greedy tree construction either variable may still be selected at a given node. From the perspective of tree construction, interchangeability implies that the two variables are perfect substitutes: using one in place of the other leaves the set of possible tree structures unchanged. By contrast, substitutability implies a directional replacement relationship. Variable $j$ can fully stand in for variable $k$ without loss of expressive power, whereas replacing $j$ by $k$ may eliminate certain feasible splits and restrict the tree’s structural flexibility.

Naturally, interchangeability is a special case of substitutability. The above definitions are other version of the definition of redundancy of a variable. Thus, In accordance with the definitions, the following corollary can be obtained.

Corollary 1

If a variable $j$ is substitutable or interchangeable with a variable $k$ , the variable $j$ is redundant in the decision tree.

4. Propositions and Discussion

This section presents two propositions that formalize the conditions under which variables become structurally redundant in CART-style decision trees. Proposition 1 addresses the case of order-preserving bijective mappings, establishing that such transformations yield interchangeable variables that produce identical feasible partitions. Proposition 2 extends this analysis to order-preserving surjective mappings, showing that the variable with lower cardinality is substitutable. Each proposition is followed by a formal proof and illustrative examples that demonstrate its practical implications for feature engineering.

Based on the properties defined in Section 3, we present two propositions that provide theoretical insight into feature engineering for decision tree models. Under the axis-aligned greedy split rule defined in (1)–(4), we establish the following proposition:

Proposition 1
Suppose there are two variables $j$ and $k$ which is $j \neq k$ . If the variable $k$ is the order-preserving bijective function of the variable $j$ , the varible $j$ and $k$ are interchangeable.

Proof. Let $f : X_{j} \to X_{k}$ be the order-preserving and one-to-one correspondence function. The variables $j$ and $k$ are interchangeable if $I_{j}^{} (S_{z}) = I_{k}^{} (S_{z})$ is satisfied for any $S_{z} \subset S_{0}$ . We know, if $I_{j} (S, j, {\bar{x}}_{j}) = I_{k} (S, k, f ({\bar{x}}_{j}))$ is satisfied for all ${\bar{x}}_{j} \in X_{j}$ , $I_{j}^{} (S_{z}) = I_{k}^{} (S_{z})$ is automatically satisfied. On the definition of $I$ , $I_{j} (S, j, {\bar{x}}_{j})$ is identical with $I_{k} (S, k, f ({\bar{x}}_{j}))$ if $β (S, j, {\bar{x}}_{j}) = β (S, k, f ({\bar{x}}_{j}))$ is satisfied. And, by the definition of $β$ , $β (S, j, {\bar{x}}_{j}) = β (S, k, f ({\bar{x}}_{j}))$ is satisfied if $S_{l} (S, j, {\bar{x}}_{j}) = S_{l} (S, k, f ({\bar{x}}_{j}))$ and $S_{h} (S, j, {\bar{x}}_{j}) = S_{h} (S, k, f ({\bar{x}}_{j}))$ are satisfied for all ${\bar{x}}_{j} \in X_{j}$ . We know if $S_{l} (S, j, {\bar{x}}_{j}) = S_{l} (S, k, f ({\bar{x}}_{j}))$ is satisfied, $S_{h} (S, j, {\bar{x}}_{j}) = S_{h} (S, k, f ({\bar{x}}_{j}))$ is automatically satisfied, as $S_{l} \cup S_{h} = S$ and $S_{l} \cap S_{h} = \emptyset$ . Then, $S_{l} (S, j, {\bar{x}}_{j}) = S_{l} (S, k, f ({\bar{x}}_{j}))$ is satisfied if and only if $s_{i} \in S_{l} (S, j, {\bar{x}}_{j}) \Rightarrow s_{i} \in S_{l} (S, k, f ({\bar{x}}_{j}))$ and $s_{i} \in S_{l} (S, k, f ({\bar{x}}_{j})) \Rightarrow s_{i} \in S_{l} (S, j, {\bar{x}}_{j})$ are satisfied. Because $f$ is order-preserving, it is obvious that $s_{i}$ such that $x_{j}^{i} < {\bar{x}}_{j}$ also satisfies $f (x_{j}^{i}) = x_{k}^{i} < f ({\bar{x}}_{j}) = {\bar{x}}_{k}$ , which implies $\forall s_{i} \in S_{l} (S, j, {\bar{x}}_{j}) \Rightarrow s_{i} \in S_{l} (S, k, f ({\bar{x}}_{j}))$ . And, as $f$ is order-preserving bijective, $\exists f^{- 1}$ and $f^{- 1}$ is also order-preserving. Then, $s_{i}$ such that $k$ , $x_{k}^{i} < f ({\bar{x}}_{j})$ also satisfies $f^{- 1} (x_{k}^{i}) = x_{j}^{i} < f^{- 1} (f ({\bar{x}}_{j})) = {\bar{x}}_{j}$ , which implies $\forall s_{i} \in S_{l} (S, k, f ({\bar{x}}_{j})) \Rightarrow s_{i} \in S_{l} (S, j, {\bar{x}}_{j})$ .

It is useful to provide intuition for Proposition 1. In practical terms, an order-preserving mapping refers to any transformation that maintains the rank ordering of observations. Common examples include min–max scaling, standardization, and logarithmic transformation applied to strictly positive variables. Although such transformations may substantially change the numerical scale of a variable, they do not alter the relative ordering of samples and therefore do not introduce new splitting possibilities in axis-aligned decision trees.

In an axis-aligned decision tree, each split compares the value of a single variable to a threshold, thereby partitioning samples solely based on their relative ordering along that variable. If one variable is transformed into another through an order-preserving one-to-one mapping, the ordering of samples is unchanged. Consequently, for any threshold on the original variable, there exists a corresponding threshold on the transformed variable that induces exactly the same partition of the data. As a result, the set of feasible splits—and hence the split opportunities available to the tree—remains identical under such transformations.

Therefore, the notion of redundancy considered here can be understood as structural redundancy. Structural redundancy refers to a situation in which a variable does not make a substantive contribution to the structural expressiveness of a decision tree. Specifically, when a variable $X$ is said to be structurally redundant, this means that, when evaluated in terms of the collection of all partition structures (or tree topologies) that the tree can generate, the inclusion of $X$ does not expand the set of structurally feasible trees in any way.

In other words, if the entire partition space that could be induced using a given variable can already be fully realized through other variables, or through alternative representations of those variables, then the variable is unnecessary from the perspective of the tree’s structural expressiveness. In such cases, the variable does not enable any new partitions nor does it give rise to tree structures that were previously infeasible. All possible arrangements of branches and leaves, along with the corresponding ways of partitioning the data, can already be reproduced without the presence of that variable.

Accordingly, structural redundancy can be defined as the property that a variable fails to expand the theoretical expressive power of a tree model—namely, the set of attainable partition structures. In this sense, a structurally redundant variable is one that does not provide additional opportunities for structural discovery in the formation of the tree.

We now discuss the practical implications of this proposition. First, the above proposition can be used to interpret the role of common reversible feature engineering techniques—such as logarithmic transformation, squaring, min–max scaling, or standardization—from a structural perspective. As discussed, when a variable is transformed through a one-to-one, order-preserving mapping, the resulting feature does not expand the set of partitions that a decision tree can represent in an idealized, implementation-agnostic sense. In this respect, such transformations are structurally redundant: they do not alter the hypothesis space of the tree, nor do they introduce fundamentally new splitting possibilities beyond those already available through the original variable.

Consider the following example.
Example 1
We define a random nonlinear function of a variable $x$ defined between $0$ and $1$ follows:
$y = a x^{3} + b x^{2} + c x + d$
(5)

This function’s shape changes based on the values of $a$ , $b$ , $c$ , and $d$ . We train regression trees with random samplings, and assess the predictive performances measured by R-squared. In this simulation, we compare the performances of the cases where the relationship between $y$ and $x$ is trained against the cases where the relationship between $y$ and $x^{2}$ , between $y$ and $\sqrt{x}$ are used instead. The simulation is conducted $100$ times with random values for $a$ , $b$ , $c$ , and $d$ chosen between $-$ 10 and $10$ . We draw $1, 000$ samples for training, and evaluate the predictive performance on $30$ evenly spaced points in the domain. The resulting distribution of R-squared values is shown Table 1. The results show that the transformations of $x$ yield the same prediction, demonstrating that they have the same performance (Table 2).
Table 1.
Example 1 - predictive Performance Across Transformations.

Metric $x$ $\sqrt{x}$ $x^{2}$

$R^{2}$ 0.999986 0.999986 0.999986

(0.000015) (0.000015) (0.000015)

MSE 0.000091 0.000091 000091

(0.000124) (0.000124) 0. (0.000124)

RMSE 0.007829 0.007829 0.007829

(0.005503) (0.005503) (0.005503)

Notes: Results are based on 100 simulations of Example 1. Each entry reports the mean and standard deviation (in parentheses) of the predictive metric computed on 30 evenly spaced evaluation points. The target variable is generated as $y = a x^{3} + b x^{2} + c x + d$ , where coefficients are drawn independently from a uniform distribution on $[- 10, 10]$ .

Table 2.
Example 1 - Decision Tree Implementation Details and Hyperparameter Settings Used.

Item Value

Implementation scikit-learn

scikit-learn version 1.8.0

Criterion squared_error

Splitter best

Max depth None (unrestricted)

Min samples split 2

Min samples leaf 1

Min weight fraction leaf 0.0

Max features None

Max leaf nodes None

Min impurity decrease 0.0

Pruning strategy Cost-complexity pruning

ccp_alpha 0.0

Proposition 1 has implications not only for the transformation of variables but also for their selection. Such cases arise when, although the variables represent conceptually different properties, one variable can be expressed as a monotonic function of the other within the range captured by the sample data. For instance, variables often exhibit trends over time. If a variable change monotonically over the period in the sample data analyzed, this essentially constitutes an order-preserving one-to-one mapping of time points, which implies that the decision tree trained with the variable could not make a difference with the tree trained with the time variable. Consider an example as the follow.
Example 2
Consider a decision tree to forecast the Consumer Price Index (CPI) based on the money supply (M2).² The training utilizes monthly data observed from January 2022 to December 2022, as detailed in Table 3.
Table 3.
Example 2 - Monthly M2 and CPI Data for 2022.

Period M2 CPI

Jan-22 19,323.5 282.3

Feb-22 19,560.9 284.5

Mar-22 19,800.2 287.5

Apr-22 20,125.6 288.7

May-22 20,429.9 291.3

Jun-22 20,473.4 294.9

Jul-22 20,625.3 294.9

Aug-22 20,836.0 295.2

Sep-22 20,965.8 296.3

Oct-22 21,146.2 297.8

Nov-22 21,320.4 298.6

Dec-22 21,500.4 298.8

Notes: M2 is measured in billions of U.S. dollars, while the CPI is an index with a base value of 100. Both the M2 and CPI were obtained from the economic statistics database of the Federal Reserve Bank of St. Louis.

An examination of the M2 money supply reveals a consistent upward trend across the months, albeit with varying rates of increase. According to Proposition 1, this suggests that the information regarding the M2 supply is qualitatively equivalent to that provided by the monthly time points. In Figure 1, we compare decision trees trained on the relationship between the M2 money supply and CPI with those trained on the relationship between the monthly time points and CPI. Both trees are found to branch at the same split point.
Figure 1.
Example 2 - Comparison of trained decision tree. (a) Decision tree with M2. (b) Decision tree with integerized month.

The proposition presented above analyze, from a mathematical perspective, the role of order-preserving mappings between variables in decision trees. Nevertheless, even when two variables satisfy such a relationship, their use in an implemented decision tree algorithm does not necessarily result in complete predictive equivalence. This discrepancy arises because the actual tree induction procedure does not operate by directly enumerating and comparing all possible partitions of the sample space, as assumed in the theoretical formulation, but instead searches over a finite set of candidate threshold values that implicitly induce partitions.

In practical CART implementations (including common software libraries such as scikit-learn), the algorithm evaluates splits of the form ${A \leq θ}$ for a finite collection of threshold candidates $θ$ . These candidates are not arbitrary real numbers, but are typically constructed as midpoints between consecutive ordered observations. Specifically, letting $a_{(i)}$ and $a_{(i + 1)}$ denote adjacent order statistics of variable $A$ , a representative threshold candidate is given by
$θ_{i}^{(A)} = \frac{a_{(i)} + a_{(i + 1)}}{2} .$
All values within the open interval $(a_{(i)}, a_{(i + 1)})$ induce the same partition of the sample, and the midpoint is chosen as a convenient numerical representative. The algorithm then computes the corresponding impurity reduction associated with the induced partition.

At the theoretical level, if $B = f (A)$ is a strictly monotone transformation of $A$ , Proposition 1 guarantees that for every partition induced by $A$ , there exists a corresponding partition induced by $B$ that is identical as a set-theoretic division of the sample. However, at the level of implementation, the reproduction of the same partition relies on whether an appropriate threshold value is present among the finite set of candidates actually evaluated by the algorithm.

In particular, when $f$ is nonlinear, the threshold corresponding to $θ_{i}^{(A)}$ in the transformed scale is
$f (θ_{i}^{(A)}) = f (\frac{a_{(i)} + a_{(i + 1)}}{2}),$
whereas the threshold candidate generated directly from variable $B$ is
$θ_{i}^{(B)} = \frac{b_{(i)} + b_{(i + 1)}}{2} = \frac{f (a_{(i)}) + f (a_{(i + 1)})}{2} .$
For a nonlinear $f$ , these two quantities generally differ:
$f (\frac{a_{(i)} + a_{(i + 1)}}{2}) \neq \frac{f (a_{(i)}) + f (a_{(i + 1)})}{2} .$
Although $θ_{i}^{(B)}$ still lies within the interval $(b_{(i)}, b_{(i + 1)})$ and therefore induces the same partition in exact arithmetic, practical implementations rely on finite-precision floating-point representations. When the scale of the data is large or the transformation $f$ significantly compresses or expands distances, rounding errors may cause $θ_{i}^{(B)}$ to be represented very close to, or even numerically indistinguishable from, one of the boundary values. Since CART implementations typically apply comparison rules of the form “ $\leq$ ” versus “>”, such numerical effects can lead to subtle differences in how observations are assigned to child nodes, resulting in partitions that deviate from their theoretical counterparts.

An additional source of divergence arises from tie-breaking rules. In practice, it is common for multiple variable–threshold combinations to yield identical or nearly identical impurity reductions. This situation is especially likely when two variables $A$ and $B = f (A)$ are informationally equivalent in the sense of Proposition 1. Nevertheless, the algorithm must select a single split, and this selection is governed by implementation-specific rules—such as variable ordering, search order, or the first candidate encountered—that are extraneous to the theoretical definition of CART. As a result, one variable may be consistently selected while the other is never chosen, despite their conceptual equivalence.

Because CART employs a greedy, node-wise optimization strategy, even minor differences in early partitions can propagate through the tree. A small deviation in the first split alters the subset of observations reaching subsequent nodes, thereby changing the set of candidate splits and potentially leading to a substantially different tree structure. Consequently, although two variables $A$ and $B = f (A)$ are guaranteed to admit identical partitions at the theoretical level, the use of finite candidate thresholds, floating-point arithmetic, and greedy split selection implies that their empirical learning trajectories need not coincide in practice.

These considerations suggest that, although equivalence between variables related by order-preserving mappings holds strongly at the theoretical level, it may not always be reproduced perfectly in practice under implemented algorithms and finite-sample settings. This observation does not undermine the validity of the theoretical results; rather, it highlights the need to examine how stably such results are realized in actual data analysis environments.

Accordingly, in order to assess whether the theoretical properties are largely preserved under real data and implemented CART algorithms, we consider the following illustrative example.
Example 3
We use the red and white variants of the Portuguese “Vinho Verde” wine dataset from the UCI Machine Learning Repository, focusing on the white wine subset. The dataset consists of 4,898 observations, each describing physicochemical properties of wine samples—such as acidity, residual sugar, density, sulphates, and alcohol content—along with a sensory quality score assigned by human experts. Following common practice, we treat the quality score as a continuous outcome and consider a regression setting.

Our analysis focuses on the variable alcohol, which takes strictly positive values in the data (ranging from 8 to 14.2). We compare two model specifications that differ only in the representation of this variable: (i) the raw alcohol content, and (ii) its squared transformation. Since the mapping $x \mapsto x^{2}$ is strictly monotone increasing on the positive real line, the two representations are theoretically order-preserving and therefore admit identical sample partitions under the conditions of Proposition 1.

For each trial, the data are randomly split into training and test sets, with 90% of the observations used for training and the remaining 10% held out for evaluation. A CART-style decision tree regressor is fitted using identical hyperparameters across specifications (maximum depth of 5 and a minimum leaf size of 10). In the transformed specification, the original alcohol variable is replaced by its squared counterpart, while all other covariates remain unchanged. This procedure is repeated over 100 independent random splits to assess the stability of the comparison.

Predictive performance is evaluated on the test set using $R^{2}$ , mean absolute error (MAE), and root mean squared error (RMSE). Rather than requiring exact numerical equality, we assess whether differences in predictive performance are practically insignificant, using a tolerance of 0.01 for $R^{2}$ and 5% of the typical error magnitude for MAE and RMSE.

Across all trials, the predictive performance of the two specifications is virtually indistinguishable. The maximum absolute difference observed between the raw and squared specifications is approximately 0.002 for $R^{2}$ , 0.0011 for MAE, and 0.0010 for RMSE—values that are an order of magnitude smaller than the predefined tolerance thresholds. Indeed, according to the practical insignificance criteria, the two specifications are classified as equivalent in 100% of the trials for all three performance metrics.

These findings are also reflected in the aggregated results. The average $R^{2}$ across trials is 0.297 for both specifications, with nearly identical standard deviations. Similarly, the mean MAE and RMSE differ only at the fourth decimal place, well below any level of practical relevance. Overall, this example demonstrates that, in a realistic data environment and under an implemented CART algorithm, the equivalence implied by order-preserving transformations is not only theoretically valid but also empirically dominant, with any deviations being negligible in magnitude (Tables 4 –6).
Table 4.
Example 3 - Predictive Performance Comparison for Wine Quality Experiments.

Metric Raw alcohol Squared alcohol

$R^{2}$ 0.297043 (0.037445) 0.297040 (0.037425)

MAE 0.579979 (0.022588) 0.579986 (0.022569)

RMSE 0.745606 (0.029295) 0.745607 (0.029270)

Notes: Performance comparison between the raw alcohol feature and its squared transformation over 100 random train–test splits. Reported values are the mean and standard deviation (in parentheses) across trials.

Table 5.
Example 3 - Rates of Practically Insignificant Differences between the Two Specifications over 100 Trials.

Criterion Same rate Same count

$| Δ R^{2} | \leq 0.01$ 1.000 100

$| Δ MAE | \leq 0.05 \times {MAE}_{raw}$ 1.000 100

$| Δ RMSE | \leq 0.05 \times {RMSE}_{raw}$ 1.000 100

Notes: For MAE and RMSE, the tolerance is defined as 5% of the corresponding error magnitude in the raw-alcohol specification (trial-by-trial).

Table 6.
Example 3 - Decision Tree Implementation Details and Hyperparameter Settings Used in the Wine Quality Experiment.

Item Value

Implementation scikit-learn

scikit-learn version 1.8.0

Criterion squared_error

Splitter best

Max depth 5

Min samples split 2

Min samples leaf 10

Min weight fraction leaf 0.0

Max features None

Max leaf nodes None

Min impurity decrease 0.0

Pruning strategy Cost-complexity pruning

ccp_alpha 0.0

Proposition 2
Suppose there are two variables $j$ and $k$ which is $j \neq k$ . If the variable $k$ is the order-preserving surjective mapping of the variable $j$ and the variable space $X_{k}$ is a countable set, the varible $k$ is substitutable with the variable $j$ .

Proof. Before presenting the formal proof, it is useful to outline the main idea. The key step is to show that every partition induced by the lower-cardinality variable $k$ can be replicated by an appropriate threshold on the higher-cardinality variable $j$ . Because the mapping from $j$ to $k$ is order-preserving and surjective, each value of $k$ corresponds to a nonempty interval of $j$ -values. By selecting the minimal preimage of a given threshold in $k$ , we can construct an equivalent split using $j$ . This establishes that the set of feasible partitions generated by $k$ is a subset of those generated by $j$ . Let’s define the sets of all feasible splitting of $S_{z}$ by using the variable $j$ and $k$ as $ω_{i} (S_{z}) \equiv {β (S_{z}, i, {\bar{x}}_{i}) | {\bar{x}}_{i} \in X_{i}}$ where $i = j, k$ . The variable $k$ is substitutable if $I_{k}^{} (S_{z}) \in Θ_{j} (S_{z})$ is satisfied for any $S_{z} \subset S_{0}$ . See $I_{k}^{} (S_{z}) \in Θ_{j} (S_{z})$ is satisfied when $ω_{k} (S_{z}) \subset ω_{j} (S_{z})$ because it implies $Θ_{k} (S_{z}) \subset Θ_{j} (S_{z})$ . On the fact that $S_{l} \cup S_{h} = S_{z}$ and $S_{l} \cap S_{h} = \emptyset$ , if ${s_{i} \in S_{z} | x_{j}^{i} \geq {\tilde{x}}_{j}} = {s_{i} \in S_{z} | x_{k}^{i} \geq {\bar{x}}_{k}}$ is satisfied, it also means ${s_{i} \in S_{z} | x_{j}^{i} < {\tilde{x}}_{j}} = {s_{i} \in S_{z} | x_{k}^{i} < {\bar{x}}_{k}}$ . Then, if there exists ${\tilde{x}}_{j} \in X_{j}$ such that ${s_{i} \in S_{z} | x_{j}^{i} \geq {\tilde{x}}_{j}} = {s_{i} \in S_{z} | x_{k}^{i} \geq {\bar{x}}_{k}}$ for all ${\bar{x}}_{k} \in X_{k}$ , $β (S_{z}, k, {\bar{x}}_{k}) \in ω (S_{z}, j)$ is satisfied for all ${\bar{x}}_{k} \in X_{k}$ , which is equivalent with $ω_{k} (S_{z}) \subset ω_{j} (S_{z})$ .

To show that there exists ${\tilde{x}}_{j} \in X_{j}$ such that ${s_{i} \in S_{z} | x_{j}^{i} \geq {\tilde{x}}_{j}} = {s_{i} \in S_{z} | x_{j}^{i} \geq {\bar{x}}_{k}}$ , we can define the set $χ_{j} (x_{k}) = {x_{j} \in X_{j} | f (x_{j}) = x_{k}}$ . See $χ_{j} (x_{k}) \neq \emptyset$ for all $x_{k} \in X_{k}$ as $x_{k}$ is surjective mapping from $x_{j}$ . The role of the set $χ_{j} (x_{k})$ is to collect all values of the original variable $j$ that are mapped to the same transformed value $x_{k}$ . Because the mapping is surjective, this set is nonempty for every $x_{k}$ . The assumption that $X_{k}$ is countable ensures that the elements of $χ_{j} (x_{k})$ can be ordered and that a well-defined minimal element exists, which allows the construction of a threshold on $j$ that exactly reproduces the split induced by $k$ . Without countability, such a minimal representative need not exist, and the argument would require additional technical conditions.

Then, we can show that, if $s_{i} \in S_{h} (S_{z}, j, min χ_{j} ({\bar{x}}_{k}))$ , then $s_{i} \in S_{h} (S_{z}, k, {\bar{x}}_{k})$ . Suppose that there exists $s_{i} \in S_{h} (S_{z}, j, min χ_{j} ({\bar{x}}_{k}))$ such that $s_{i} \notin S_{h} (S_{z}, k, {\bar{x}}_{k})$ . $s_{i} \in {s_{i} \in S_{z} | x_{j}^{i} \geq min χ_{j} ({\bar{x}}_{k})}$ implies $s_{i} \in {s_{i} \in S_{z} | x_{k}^{i} \geq {\bar{x}}_{k}}$ because $f (min χ_{j} ({\bar{x}}_{k})) = {\bar{x}}_{k}$ and $f$ is order-preserving. On the other hand, if $s_{i} \notin S_{h} (S_{z}, k, {\bar{x}}_{k})$ , it means $s_{i} \in S_{l} (S_{z}, k, {\bar{x}}_{k}) = {s_{i} \in S_{z} | x_{k}^{i} < {\bar{x}}_{k}}$ , which is contradicted to $s_{i} \in {s_{i} \in S_{z} | x_{k}^{i} \geq {\bar{x}}_{k}}$ .

We can also show, if $s_{i} \in S_{h} (S_{z}, k, {\bar{x}}_{k})$ , then $s_{i} \in S_{h} (S_{z}, j, min χ_{j} ({\bar{x}}_{k}))$ . Suppose that there exists $s_{i} \in S_{h} (S_{z}, k, {\bar{x}}_{k})$ such that $s_{i} \notin S_{h} (S_{z}, j, min χ_{j} ({\bar{x}}_{k}))$ . If $s_{i} \notin S_{h} (S_{z}, j, min χ_{j} ({\bar{x}}_{k}))$ , then $s_{i} \in S_{l} (S_{z}, j, min χ_{j} ({\bar{x}}_{k}))$ . Because $X_{k}$ is countable set and ordered by $f$ from $x_{j}$ , ${s_{i} \in S_{z} | x_{j}^{i} < min χ_{j} ({\bar{x}}_{k})} = {s_{i} \in S_{z} | x_{j}^{i} \leq max χ_{j} ({\bar{x}}_{k} - δ)}$ where ${\bar{x}}_{k} - δ$ is the largest number which satisfies ${\bar{x}}_{k} - δ < {\bar{x}}_{k}$ and ${\bar{x}}_{k} - δ \in X_{k}$ . But, when $s_{i}$ satisfies $s_{i} \in {s_{i} \in S_{z} | x_{j}^{i} \leq max χ_{j} ({\bar{x}}_{k} - δ)}$ , it means $s_{i} \in {s_{i} \in S_{z} | x_{k}^{i} \leq {\bar{x}}_{k} - δ}$ , which is contradicted with $s_{i} \in S_{h} (S_{z}, k, {\bar{x}}_{k}) = {s_{i} \in S_{z} | x_{k}^{i} \geq {\bar{x}}_{k}}$ .

It shows $S_{h} (S_{z}, k, {\bar{x}}_{k}) \subset S_{h} (S_{z}, j, min χ_{j} ({\bar{x}}_{k}))$ and $S_{h} (S_{z}, j, min χ_{j} ({\bar{x}}_{k})) \subset S_{h} (S_{z}, k, {\bar{x}}_{k})$ , which implies $S_{h} (S_{z}, k, {\bar{x}}_{k}) = S_{h} (S_{z}, j, min χ_{j} ({\bar{x}}_{k}))$ . Hence, there exists ${\tilde{x}}_{j} \in X_{j}$ such that satisfies ${s_{i} \in S_{z} | x_{j}^{i} \geq {\tilde{x}}_{j}} = {s_{i} \in S_{z} | x_{j}^{i} \geq {\bar{x}}_{k}}$ for all ${\bar{x}}_{k} \in X_{k}$ .

The intuition behind Proposition 2 is that an order-preserving surjective transformation reduces the set of feasible split locations available to a decision tree. An order-preserving surjective mapping additionally merges multiple distinct values into the same transformed value—for example, rounding a continuous variable to one decimal place, converting income into brackets (low/medium/high), or binning age into five-year intervals. These transformations keep the overall order but make the variable coarser by reducing the number of distinct levels (cardinality). In CART-style trees, each variable defines a collection of candidate threshold splits, and the quality of a split is evaluated by the impurity reduction it induces. When one variable is a surjective, order-preserving mapping of another, multiple distinct values of the original variable are collapsed into a single value of the transformed variable. As a result, certain threshold positions that are available under the original variable no longer exist under its surjective image.

A simple schematic example illustrates this point. Consider five observations with variable $j$ taking values ${1.1, 1.4, 1.7, 2.2, 2.6}$ , and define $k = round (j)$ , which maps these values to a smaller set of ordered levels. The key point is not inclusion between the value sets of $j$ and $k$ , but inclusion between the partition spaces they induce. Any threshold split defined on $k$ (for example, $k \geq 2$ ) can be replicated by an appropriate threshold split on $j$ (such as $j \geq 1.5$ ), whereas the converse is not true: $j$ admits additional split locations within levels of $k$ that cannot be represented using $k$ . This asymmetry reflects the fact that discretization through rounding—an order-preserving surjective mapping—can only restrict, and never expand, the space of feasible partitions, consistent with Proposition 2.

From an informational perspective, this means that the set of impurity reductions achievable by splitting on the transformed variable is a subset of those achievable by splitting on the original variable. The proposition therefore does not assert that the higher-cardinality variable will necessarily be selected by a greedy algorithm in finite samples, but rather that it weakly dominates its surjective image in terms of the best split quality that can be achieved in principle. In other words, reducing cardinality through an order-preserving mapping can only restrict, and never expand, the space of attainable partitions.

This perspective suggests practical implications for feature selection. If multiple predictors are linked by order-preserving relationships, then, in principle, the higher-cardinality predictor offers at least as many feasible split locations as its lower-cardinality counterparts. Thus, keeping the most fine-grained version is often sufficient for preserving split opportunities, while additional coarser variants may contribute little new information under axis-aligned threshold splits.

Discretization and binning provide common and intuitive examples of order-preserving surjective transformations, in which multiple adjacent values are deliberately collapsed into the same level, thereby reducing the number of distinct split locations available to the tree. More generally, however, Proposition 2 applies to any order-preserving transformation that reduces the effective cardinality of a variable, regardless of whether the resulting representation is explicitly discrete or remains continuous.

The following examples illustrate these implications in settings such as rounding-based discretization and a simple time-series substitution case.
Example 4
Consider a scenario where the relationship between $x$ and $y$ is identical to that in Example 1. The data-generating process follows equation (5), where $a$ , $b$ , $c$ , and $d$ are independently drawn from a uniform distribution on $[- 10, 10]$ in each simulation. In this experiment, we compare the raw variable $x$ with its discretized (quantized) versions obtained by rounding $x$ to three and two decimal places. Rounding is a simple discretization rule that induces an order-preserving surjective map from a continuous variable to a finite set; consistent with Proposition 2, this reduction in cardinality restricts the set of feasible split points and can degrade predictive performance. We train and evaluate regression trees using each version of the feature, repeating the procedure for 100 simulations. As shown in Figure 2 and Table 7, predictive performance deteriorates monotonically as discretization becomes coarser. In all simulations and for all evaluation metrics considered ( $R^{2}$ , MSE, and RMSE), performance is highest with the unrounded $x$ , followed by rounding to three decimal places, and lowest with rounding to two decimal places.
Figure 2.
Example 4 - Performance degradation under discretization of the input variable. Across all metrics, predictive performance deteriorates monotonically as discretization becomes coarser. (a) $R^{2}$ , (b) MSE and (c) RMSE.

Table 7.
Example 4 - Performance Degradation under Rounding.

Metric $x$ round $(x, 3)$ round $(x, 2)$

$R^{2}$ 0.999986 0.999982 0.999809

(0.000015) (0.000019) (0.000125)

MSE $9.10 \times 10^{- 5}$ $1.19 \times 10^{- 4}$ $1.326 \times 10^{- 3}$

( $1.24 \times 10^{- 4}$ ) ( $1.58 \times 10^{- 4}$ ) ( $1.521 \times 10^{- 3}$ )

RMSE 0.007829 0.009013 0.031239

(0.005503) (0.006150) (0.018813)

Notes: Entries report the mean (first line) and standard deviation (in parentheses, second line) across 100 simulations. The raw feature $x$ is used without rounding, while round $(x, 3)$ and round $(x, 2)$ discretize $x$ to three and two decimal places, respectively. Settings for the experiments are same with those used in Example 1. Consistent with Proposition 2, performance deteriorates as rounding becomes coarser.

Example 5
Consider a decision tree to forecast the price of Bitcoin using the base interest rate (i.e. Federal Funds Rate) as predictor. The prices of Bitcoin and the federal funds rate were both recorded on a daily basis throughout the year 2022, from January 1 to December 31. Figure 3 presents the Bitcoin price and the federal funds rate during this period. Firstly, we can see that the federal funds rate (shown in panel (b)) consistently increased over time, suggesting that it could potentially be substituted by time variable. Secondly, the changes in the federal funds rate did not occur daily. Given its role as the benchmark interest rate, the federal funds rate acts as an anchor for monetary policy direction. This feature of the federal funds rate implies that while differences in the federal funds rate can be identified by dates, the converse is not true; differences identified by date do not necessarily correspond to changes in the federal funds rate. Thus, we can expect that the federal funds rate is redundant because that using a simple time variable could potentially capture more information than using the federal funds rate.
Figure 3.
Example 5 - Time series for Bitcoin price and Federal funds rate. (a) Bitcoin price and (b) federal funds rate.

To demonstrate, we conducted 100 simulations where we randomly selected half of the samples to train on the relationship between the federal funds rate, time(day), and Bitcoin prices and then predicted for the remaining evaluation samples. The federal funds rate exhibits an average cardinality of only 9.64 across experiments, whereas the time index (day) has a cardinality of 182, reflecting the fact that the policy rate changes only intermittently while time progresses daily. Consistent with Proposition 2, this severe reduction in cardinality restricts the feasible split points available to the tree. As expected, in all 100 simulations, the model using time as the predictor outperforms the model using the federal funds rate across all performance metrics, including $R^{2}$ , MSE, and RMSE (Figure 4 and Table 8). The mean $R^{2}$ for the time-based model is $0.987$ with a standard deviation of $0.003$ , whereas the corresponding values for the rate-based model are $0.954$ and $0.004$ , respectively.
Figure 4.
Example 5 - Performance degradation with coarse variable. Across all metrics, predictive performance deteriorates monotonically as discretization becomes coarser. (a) $R^{2}$ , (b) MSE and (c) RMSE.

Table 8.
Example 5 - Predictive Performance of the Time Index and Federal Funds Rate in Predicting Bitcoin Prices.

Metric Federal Funds Rate (FFR) Day (Time Index)

Cardinality 9.65 183.00

(0.50) (0.00)

$R^{2}$ 0.954756 0.987083

(0.004792) (0.003044)

MSE $4.70 \times 10^{6}$ $1.34 \times 10^{6}$

( $5.14 \times 10^{5}$ ) ( $3.16 \times 10^{5}$ )

RMSE 2164.489 1150.227

(118.198) (133.162)

Notes: Entries report the mean (first line) and standard deviation (in parentheses, second line) across 100 Monte Carlo replications. In each replication, half of the observations are randomly selected for training and the remaining half are used for evaluation. Cardinality denotes the number of distinct feature values in the training sample (i.e., the number of feasible split points available to the tree). Setting for the experiments are same with those used in Example 1. In all 100 replications, using the day index strictly outperforms using the federal funds rate across all metrics ( $R^{2}$ , MSE, and RMSE).

Proposition 2 characterizes a fundamentally different type of result from Proposition 1. Whereas Proposition 1 establishes a strict equivalence between variables under order-preserving bijections, Proposition 2 addresses a weaker, yet important, notion of informational dominance. Its central implication can be summarized as follows: if the best achievable split quality can be compared exactly, then a variable with higher cardinality is never inferior to its order-preserving surjective image.

Crucially, this statement does not* concern the success or failure of a particular tree-building algorithm in finite samples. Rather, it is a claim about the inclusion relationship between information sets, formalized through the feasible gain spaces of the variables. Specifically, Proposition 2 proves that when a variable $k$ is an order-preserving surjective mapping of another variable $j$ , the set of all impurity reductions attainable by splitting on $k$ is a subset of those attainable by splitting on $j$ , i.e.,
$Θ_{k} (S_{z}) \subset Θ_{j} (S_{z}),$
(6)
which immediately implies
$max Θ_{k} (S_{z}) \leq max Θ_{j} (S_{z}) .$
(7)
This is an upper-bound result: it states that the maximal split quality achievable using the lower-cardinality variable cannot exceed that achievable using the higher-cardinality variable.

For this informational dominance to translate into actual split selection, however, an additional condition is required—namely, the exact comparison of impurity values across candidate splits. In practical tree induction algorithms, impurity reductions are computed using sample-based estimates,
$\hat{I} (β (S_{z}, j, {\bar{x}}_{j})),$
(8)
which depend on the finite sample $S_{z}$ and are therefore subject to sampling variability, noise, and estimation error. As a result, the empirical ordering of candidate splits need not coincide with their true ordering at the population level. In particular, it is entirely possible in finite samples that
$\hat{I} (β_{1}) > \hat{I} (β_{2}) while I (β_{1}) < I (β_{2}),$
(9)
so that a suboptimal split appears preferable due to estimation noise.

By contrast, exact comparison of impurity values refers to an idealized setting in which impurity reductions are evaluated without estimation error. This can be interpreted in two equivalent ways. First, at the population level, impurity reduction is defined in terms of expectations under the true data-generating distribution and thus reflects an intrinsic property of the distribution rather than a random sample. Second, in the infinite-sample limit, standard law-of-large-numbers arguments imply that the sample-based impurity estimates converge almost surely to their population counterparts for all candidate splits. In this limit, the relative ranking of splits is recovered exactly, and the informational dominance established in Proposition 2 is fully realized.

The practical difficulty arises because realistic tree induction operates neither at the population level nor in the infinite-sample limit. In finite samples, variables with higher cardinality generate a larger number of candidate splits, which increases both the opportunity for genuine improvements and the risk that noise-driven splits appear spuriously attractive. Consequently, the dominance result of Proposition 2 should not be interpreted as a guarantee that higher-cardinality variables will always yield better predictive performance in finite-sample greedy tree construction. Rather, it should be understood as an informational benchmark: under exact impurity comparison, higher cardinality weakly dominates its order-preserving surjective image, but this dominance may be attenuated or obscured in practice by estimation noise, sample size limitations, or additional algorithmic constraints.

In this sense, Proposition 2 complements Proposition 1 by clarifying not an invariance property, but a directional one. It delineates the conditions under which reducing cardinality through order-preserving mappings necessarily restricts the space of achievable splits, while also highlighting why such restrictions may sometimes appear beneficial in finite samples due to implicit regularization effects.
Example 6
We consider a real-world regression problem using the Student Performance Dataset from the UCI Machine Learning Repository, which contains academic and socio-demographic information on secondary-school students in Portugal. The target variable is the final mathematics grade $G 3$ , recorded on an integer scale from 0 to 20. Among the available predictors, the first- and second-term grades $G 1$ and $G 2$ are known to be highly informative for predicting $G 3$ , as they summarize students’ accumulated academic performance over time.

To isolate the effect of discretization, we restrict the feature set to $(G 1, G 2)$ and compare two representations of the same underlying information. In the first specification, the raw integer-valued variables $G 1$ and $G 2$ are used directly. In the second specification, both variables are discretized by binning them into 4 intervals, yielding coarser, low-cardinality representations. Apart from this transformation, the learning environment is kept identical. In both cases, a regression decision tree is trained to predict $G 3$ .

The experiment is conducted in a Monte Carlo fashion. In each replication, 80% of the observations are randomly selected for training and the remaining 20% are used for evaluation. This procedure is repeated 200 times. For each run, we record $R^{2}$ , mean squared error (MSE), and root mean squared error (RMSE) for both the raw and binned specifications.

Figure 5 reports the distributions of the performance metrics across replications. Across all three metrics, the distribution corresponding to the raw $(G 1, G 2)$ representation is systematically shifted in a favorable direction relative to the binned version: median $R^{2}$ is higher, while median MSE and RMSE are lower. At the same time, the two distributions partially overlap, reflecting the inherent variance of tree-based models and the fact that coarser representations can occasionally reduce overfitting in particular train–test splits.
Figure 5.
Example 6 - Performance degradation under binning.

These results are consistent with Proposition 2. Binning $G 1$ and $G 2$ induces an order-preserving surjective mapping that reduces feature cardinality and hence restricts the set of feasible split points available to the tree. This contraction of the hypothesis space increases approximation error and lowers expected predictive accuracy. Nevertheless, the raw specification does not dominate the binned specification in 100% of replications; the advantage of using raw features is therefore a tendency rather than a deterministic outcome. This partial overlap is natural for high-variance learners such as decision trees, as coarser representations may occasionally improve out-of-sample performance by mitigating overfitting in specific train–test splits. Overall, the systematic shift of the performance distribution in favor of the raw variables supports the theoretical ordering implied by the inclusion $Θ_{binned} \subset Θ_{raw}$ (Tables 9 and 10).
Table 9.
Example 6 - Predictive Performance of Raw and Binned $(G 1, G 2)$ for $G 3$ .

Metric Raw $(G 1, G 2)$ Binned $(G 1, G 2)$

$R^{2}$ (mean, sd) 0.7893 0.7707

* (0.0593) (0.0525)

$R^{2}$ (median) 0.7942 0.7746

$R^{2}$ (min–max) 0.5851–0.9343 0.6064–0.8939

MSE (mean, sd) 4.2752 4.6516

(1.3173) (1.2416)

MSE (median) 4.0733 4.6119

MSE (min–max) 1.8656–10.8134 2.2387–10.2497

RMSE (mean, sd) 2.0418 2.1374

(0.3270) (0.2890)

RMSE (median) 2.0182 2.1470

RMSE (min–max) 1.3658–3.2884 1.4962–3.2015

Notes: Entries are based on 200 Monte Carlo replications with 80% training and 20% test splits. The binned variables are obtained by discretizing $G 1$ and $G 2$ into four equal-width bins. Raw variables retain the original integer grade scale. Settings for the experiments are same with those used in Example 1.

Table 10.
Example 6 - Frequency that Raw $(G 1, G 2)$ Outperforms Binned $(G 1, G 2)$ .

$R^{2}$ MSE RMSE

Share of experiments where raw is better (%) 74.5 74.5 74.5

Notes: Each entry reports the fraction (in percent) of Monte Carlo replications in which the raw-feature model strictly outperforms the binned-feature model. For $R^{2}$ , “better” means larger; for MSE and RMSE, “better” means smaller. “All metrics” indicates replications in which raw outperforms binned simultaneously in $R^{2}$ , MSE, and RMSE. The summary is based on 200 replications in the provided results file.

The preceding examples focus on single decision trees, for which Proposition 2 implies only a stochastic ordering: although a higher–cardinality representation dominates a coarser one in expectation, individual train–test splits may occasionally favor the latter due to variance and overfitting effects. This behavior was evident in Example 6, where the raw $(G 1, G 2)$ representation was superior on average but did not dominate the binned version in every replication.

An important question is how this relationship changes under ensembling. Methods such as bagging and random forests are designed precisely to reduce the variance of high–variance base learners by averaging across many bootstrap samples. From a bias–variance perspective, discretization increases bias by shrinking the hypothesis space, whereas ensembling primarily reduces variance while leaving bias largely unchanged. This suggests that, when trees are aggregated, the advantage of higher–cardinality representations might be amplified: the variance penalty associated with rich feature spaces is mitigated by averaging, while their expressive power is retained. It can be expressed as the follows:
Remark (Ensembling as an expectation operator over hypothesis spaces)

Metric	$x$	$\sqrt{x}$	$x^{2}$
$R^{2}$	0.999986	0.999986	0.999986
	(0.000015)	(0.000015)	(0.000015)
MSE	0.000091	0.000091	000091
	(0.000124)	(0.000124)	0. (0.000124)
RMSE	0.007829	0.007829	0.007829
	(0.005503)	(0.005503)	(0.005503)

Item	Value
Implementation	scikit-learn
scikit-learn version	1.8.0
Criterion	squared_error
Splitter	best
Max depth	None (unrestricted)
Min samples split	2
Min samples leaf	1
Min weight fraction leaf	0.0
Max features	None
Max leaf nodes	None
Min impurity decrease	0.0
Pruning strategy	Cost-complexity pruning
ccp_alpha	0.0

Period	M2	CPI
Jan-22	19,323.5	282.3
Feb-22	19,560.9	284.5
Mar-22	19,800.2	287.5
Apr-22	20,125.6	288.7
May-22	20,429.9	291.3
Jun-22	20,473.4	294.9
Jul-22	20,625.3	294.9
Aug-22	20,836.0	295.2
Sep-22	20,965.8	296.3
Oct-22	21,146.2	297.8
Nov-22	21,320.4	298.6
Dec-22	21,500.4	298.8

Metric	Raw alcohol	Squared alcohol
$R^{2}$	0.297043 (0.037445)	0.297040 (0.037425)
MAE	0.579979 (0.022588)	0.579986 (0.022569)
RMSE	0.745606 (0.029295)	0.745607 (0.029270)

Criterion	Same rate	Same count
$\| Δ R^{2} \| \leq 0.01$	1.000	100
$\| Δ MAE \| \leq 0.05 \times {MAE}_{raw}$	1.000	100
$\| Δ RMSE \| \leq 0.05 \times {RMSE}_{raw}$	1.000	100

Item	Value
Implementation	scikit-learn
scikit-learn version	1.8.0
Criterion	squared_error
Splitter	best
Max depth	5
Min samples split	2
Min samples leaf	10
Min weight fraction leaf	0.0
Max features	None
Max leaf nodes	None
Min impurity decrease	0.0
Pruning strategy	Cost-complexity pruning
ccp_alpha	0.0

Metric	$x$	round $(x, 3)$	round $(x, 2)$
$R^{2}$	0.999986	0.999982	0.999809
(0.000015)	(0.000019)	(0.000125)
MSE	$9.10 \times 10^{- 5}$	$1.19 \times 10^{- 4}$	$1.326 \times 10^{- 3}$
	( $1.24 \times 10^{- 4}$ )	( $1.58 \times 10^{- 4}$ )	( $1.521 \times 10^{- 3}$ )
RMSE	0.007829	0.009013	0.031239
	(0.005503)	(0.006150)	(0.018813)

Metric	Federal Funds Rate (FFR)	Day (Time Index)
Cardinality	9.65	183.00
	(0.50)	(0.00)
$R^{2}$	0.954756	0.987083
	(0.004792)	(0.003044)
MSE	$4.70 \times 10^{6}$	$1.34 \times 10^{6}$
	( $5.14 \times 10^{5}$ )	( $3.16 \times 10^{5}$ )
RMSE	2164.489	1150.227
	(118.198)	(133.162)

Metric	Raw $(G 1, G 2)$	Binned $(G 1, G 2)$
$R^{2}$ (mean, sd)	0.7893	0.7707
* (0.0593)	(0.0525)
$R^{2}$ (median)	0.7942	0.7746
$R^{2}$ (min–max)	0.5851–0.9343	0.6064–0.8939
MSE (mean, sd)	4.2752	4.6516
	(1.3173)	(1.2416)
MSE (median)	4.0733	4.6119
MSE (min–max)	1.8656–10.8134	2.2387–10.2497
RMSE (mean, sd)	2.0418	2.1374
	(0.3270)	(0.2890)
RMSE (median)	2.0182	2.1470
RMSE (min–max)	1.3658–3.2884	1.4962–3.2015

	$R^{2}$	MSE	RMSE
Share of experiments where raw is better (%)	74.5	74.5	74.5

Let $A$ denote a randomized tree–learning algorithm that maps a training sample $S$ and a feature representation $X$ into a predictor $f = A (S, X) \in Θ_{X}$ , where $Θ_{X}$ denotes the set of functions representable by trees constructed from $X$ . For a fixed data–generating process and loss function $L$ , define the generalization risk

R (X) = E_{S} [L (A (S, X))],

where the expectation is taken over random training samples.

Proposition 2 implies that if $X^{'}$ is an order–preserving surjective transformation of $X$ , then $Θ_{X^{'}} \subset Θ_{X}$ and hence $R (X) \leq R (X^{'})$ in expectation. However, for a single realization of $S$ , the realized risks $L (A (S, X))$ and $L (A (S, X^{'}))$ need not be ordered, reflecting the sampling variability of high–variance learners.

Now consider an ensemble of $B$ independently trained trees,

{\bar{f}}_{B} (X) = \frac{1}{B} \sum_{b = 1}^{B} A (S_{b}, X),

where

{S_{b}}_{b = 1}^{B}

are bootstrap resamples of the original dataset. The ensemble risk is

R_{B} (X) = E [L ({\bar{f}}_{B} (X))] .

B

increases,

R_{B} (X)

converges to the risk of the expectation of

A (S, X)

, thereby averaging out sample–specific variance while preserving the approximation properties induced by

Θ_{X}

Since discretization replaces $X$ by $X^{'}$ with $Θ_{X^{'}} \subset Θ_{X}$ , it follows heuristically that

R_{B} (X) \leq R_{B} (X^{'}) with increasing probability as B \to \infty,

even when the corresponding inequality need not hold for

B = 1

. In other words, ensembling acts as an operator that reinforces the stochastic dominance implied by Proposition 2 for single trees, leading to a more stable ordering across feature representations.

Although this argument is heuristic rather than a formal theorem, it may lead to an empirical prediction: under bagging or random forests, representations with larger cardinality should dominate coarser discretizations much more decisively than in single–tree models. In particular, one would expect ensembling to reinforce the tendency for the raw representation in Example 6 to outperform its binned counterpart across replications. The next example investigates this conjecture empirically.

Example 7
This experiment extends Example 6, which compared raw and binned representations of the UCI student performance data using a single decision tree. We apply nested bagging to the same data and experimental setup. The target variable is the final grade $G 3$ , and the predictors are $G 1$ and $G 2$ . Two representations are compared: the raw numerical variables $(G 1, G 2)$ and their discretized versions obtained by equal–width binning into four bins per variable. In each Monte Carlo replication, 80% of the observations are randomly selected for training and the remaining 20% are used for evaluation.

For each replication, a bagging ensemble of $B_{max} = 200$ decision trees is trained separately for the raw and binned representations. Predictions for smaller ensemble sizes $B \in {10, 25, 50, 100, 200}$ are then obtained by averaging the first $B$ trees of the same ensemble (nested bagging). This design ensures that differences across $B$ reflect only the effect of ensembling rather than changes in the underlying bootstrap samples. Across 200 replications, performance is evaluated using $R^{2}$ , MSE, and RMSE.

The results are summarized in Table 11. The share of replications in which the raw representation outperforms the binned representation increases monotonically with the bagging size, rising from 78.5% at $B = 10$ to 83.0% at $B = 200$ . This pattern is consistent with the theoretical intuition: while discretization does not guarantee inferior performance for a single tree, aggregating many trees suppresses variance and increasingly exposes the structural advantage of higher–cardinality representations implied by Proposition 2. Nevertheless, the dominance is probabilistic rather than absolute, reflecting the fact that even coarse binning retains substantial predictive information in this dataset (Table 12).
Table 11.
Example 7 - Frequency that Raw $(G 1, G 2)$ Outperforms Binned $(G 1, G 2)$ under Nested Bagging ( $%$ ).

Bagging size $B$ 10 25 50 100 200

Share raw better ( $R^{2}$ ) 78.5 81.5 81.5 82.0 83.0

Share raw better (MSE) 78.5 81.5 81.5 82.0 83.0

Share raw better (RMSE) 78.5 81.5 81.5 82.0 83.0

Share raw better (All) 78.5 81.5 81.5 82.0 83.0

Notes: Entries report the fraction of Monte Carlo replications in which the raw representation $(G 1, G 2)$ strictly outperforms the binned representation (four equal–width bins per variable). For $R^{2}$ , “better” means larger; for MSE and RMSE, “better” means smaller. “All” counts replications in which raw dominates binned simultaneously in all three metrics. Bagging sizes are evaluated in a nested manner: a single ensemble of 200 trees is trained in each replication, and performance for smaller $B$ is obtained by averaging the first $B$ trees.

Table 12.
Example 7 - Settings for the Nested Bagging.

Component Setting

Number of experiments 200

Training fraction 0.8

Shuffling per replication True

Binning method Equal-width discretization

Number of bins 4

Bagging sizes $B$ ${10, 25, 50, 100, 200}$

Maximum ensemble size $B_{max}$ 200

Base learner Decision tree regressor

Tree maximum depth None (unrestricted)

Minimum samples per split 2

Minimum samples per leaf 1

Bagging bootstrap True

Bagging max_samples 1.0

Bagging max_features 1.0

Parallel jobs -1 (all available cores)

Notes: Nested bagging is used: for each Monte Carlo replication, an ensemble of $B_{max} = 200$ trees is trained once, and performance for smaller $B$ is obtained by averaging the first $B$ trees. This ensures that differences across $B$ reflect only the effect of ensembling rather than changes in bootstrap samples.

5. Conclusion

Bagging size $B$	10	25	50	100	200
Share raw better ( $R^{2}$ )	78.5	81.5	81.5	82.0	83.0
Share raw better (MSE)	78.5	81.5	81.5	82.0	83.0
Share raw better (RMSE)	78.5	81.5	81.5	82.0	83.0
Share raw better (All)	78.5	81.5	81.5	82.0	83.0

Component	Setting
Number of experiments	200
Training fraction	0.8
Shuffling per replication	True
Binning method	Equal-width discretization
Number of bins	4
Bagging sizes $B$	${10, 25, 50, 100, 200}$
Maximum ensemble size $B_{max}$	200
Base learner	Decision tree regressor
Tree maximum depth	None (unrestricted)
Minimum samples per split	2
Minimum samples per leaf	1
Bagging bootstrap	True
Bagging max_samples	1.0
Bagging max_features	1.0
Parallel jobs	-1 (all available cores)

This study introduces two mathematical propositions that characterize the behavior of feature transformations and substitutions within the standard CART-style framework of axis-aligned, greedily induced decision trees. The first proposition demonstrates that if one variable is an order-preserving bijective function of another, both variables are interchangeable under axis-aligned threshold splits, yielding identical feasible partitions and hence identical split opportunities. The second proposition shows that one variable is an order-preserving surjective mappings of another, it is substitutable. Consequently, the propositions provide theoretical guidance for identifying redundant variables in model specifications.

In addition to the theoretical results, we conduct empirical validation using both synthetic and real-world datasets. The results of experiments confirm the theoretical findings that, in CART-style axis-aligned trees, order-preserving transformations do not affect the model’s predictive performance, while order-reducing mappings such as rounding, discretization, and binning lead to systematic degradation.

Based on the theoretical and empirical results of this paper, several practical guidelines emerge for practitioners using decision tree–based models: (i)

Avoid including multiple monotone or order-preserving transformations of the same underlying feature (e.g., raw values, logarithms, ranks, or discretized versions), as these variables generate redundant or nested partition spaces and do not expand the expressive power of the model.

(ii)

When order-preserving mappings exist between features, prioritize variables with higher effective cardinality, since they induce strictly larger sets of feasible partitions and weakly dominate lower-cardinality alternatives in terms of achievable split quality.

(iii)

Use rank-based measures, such as Spearman’s rank correlation, as a simple diagnostic tool to screen for near-monotone relationships between features before training tree-based models.

(iv)

Be cautious when applying discretization or rounding as a preprocessing step, as such transformations tend to restrict the set of attainable partitions in decision trees.

The theoretical results presented in this paper are inherently limited to the classical CART-style framework. Because each internal node employs a single-variable, axis-aligned threshold split selected greedily based on a local impurity-based criterion, the propositions characterize invariance and substitutability only within this specific mode of tree induction. Interaction effects among variables arise solely through recursive partitioning across tree depths, and not through multivariate split conditions at a single node. As a result, the propositions should not be interpreted as applying to oblique or multivariate split trees, globally optimized tree structures, or tree models with additional regularization or monotonicity constraints.

Finally, although our propositions are derived for single trees, we have added a supplementary discussion and an empirical bagging experiment to examine how these representational effects behave under ensembling. While this analysis is necessarily informal and does not constitute a formal extension of the theory, the results suggest that aggregating trees tends to make the representational disadvantages of lower-cardinality variables more visible by reducing variance. A rigorous theoretical treatment of ensemble methods remains an important direction for future research.

Footnotes

Acknowledgements

This research was supported by No. 202500810001 of Handong Global University Research Grants.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

References

Bredensteiner

E. J.

(1999). Feature minimization within decision trees (Technical Report). Department of Computer Science, University of Colorado Boulder.

Breiman

(2001). Random forests. Machine Learning, 45(1), 5–32.

Breiman

Friedman

J. H.

Olshen

R. A.

Stone

C. J.

(1984). Classification and regression trees. Wadsworth Statistics/Probability Series.

Chang

Y. C.

Chang

K. H.

G. J.

(2018). Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions. Applied Soft Computing, 73, 112–116.

Chen

Guestrin

(2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 785–794). ACM (Association for Computing Machinery).

Chrimes

(2023). Using decision trees as an expert system for clinical decision support for COVID-19. Interactive Journal of Medical Research, 12, e42540.

Costa

V. G.

Pedreira

C. E.

(2023). Recent advances in decision trees: An updated survey. Artificial Intelligence Review, 56, 4765–4800.

De Mántaras

R. L.

(1991). A distance-based attribute selection measure for decision tree induction. Machine Learning, 6, 81–92.

Dougherty

Kohavi

Sahami

(1995). Supervised and unsupervised discretization of continuous features. In Proceedings of the twelfth international conference on machine learning (pp. 194–202).

10.

Friedman

J. H.

(2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.

11.

Gregorutti

Michel

Saint-Pierre

(2017). Correlation and variable importance in random forests. Statistics and Computing, 27(3), 659–678.

12.

Hall

M. A.

(2000). Correlation-based feature selection for machine learning [Doctoral dissertation]. University of Waikato.

13.

Hastie

Tibshirani

Friedman

(2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.

14.

Hong

Choi

Kim

(2020). A house price valuation based on the random forest approach: The mass appraisal of residential property in South Korea. International Journal of Strategic Property Management, 24(3), 140–152.

15.

Izza

Ignatiev

Marques-Silva

(2020). On explaining decision trees. arXiv preprint arXiv:201011034.

16.

Lee

Kim

J. C.

Jung

H. S.

Lee

M. J.

Lee

(2017). Spatial prediction of flood susceptibility using random-forest and boosted-tree models in Seoul metropolitan city, Korea. Geomatics, Natural Hazards and Risk, 8(2), 1185–1203.

17.

Wang

Basu

Kumbier

(2019). A debiased MDI feature importance measure for random forests. Advances in Neural Information Processing Systems, 32, 8047–8057.

18.

Mienye

I. D.

Jere

(2024). A survey of decision trees: Concepts, algorithms, and applications. IEEE Access, 12, 86716–86727.

19.

Mobley

Sebastian

Blessing

Highfield

W. E.

Stearns

Brody

S. D.

(2021). Quantification of continuous flood hazard using random forest classification and flood insurance claims at large spatial scales: A pilot study in southeast texas. Natural Hazards and Earth System Sciences, 21, 807–822.

20.

Pedregosa

Varoquaux

Gramfort

Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. & Vanderplas, J. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.

21.

Podgorelec

Kokol

Stiglic

Rozman

(2002). Decision trees: An overview and their use in medicine. Journal of Medical Systems, 26(5), 445–463.

22.

Strobl

Boulesteix

A. L.

Zeileis

Hothorn

(2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 25.

23.

Tuv

Borisov

Runger

Torkkola

(2009). Feature selection with ensembles, artificial variables, and redundancy elimination. The Journal of Machine Learning Research, 10, 1341–1366.

24.

Zhang

Gionis

(2023). Regularized impurity reduction: Accurate decision trees with complexity guarantees. Data mining and knowledge discovery, 37(1), 434–475.