Abstract
Knowing mathematical properties about machine learning models can provide effective guidance in feature engineering and selection. This paper presents mathematical propositions for identifying redundant variables in decision tree models. The first proposition demonstrates that if one variable is an order-preserving one-to-one corresponding of another, the performance of the model remains unaffected even when one of the two variables is removed. The second proposition reveals that if one variable is an order-preserving mapping of another and this mapping reduces the cardinality of the variable, the model’s performance remains unchanged upon the removal of the variable with lower cardinality. We provide formal mathematical proofs for both propositions and support our findings with simulation-based experiments. These results demonstrate that, within the standard CART-style framework of axis-aligned, greedily constructed decision trees, common order-preserving transformations—such as min–max scaling or logarithmic transformation—do not alter the set of feasible partitions and therefore leave predictive performance unchanged. Furthermore, the results suggest that rank-based correlation measures, such as Spearman’s rank correlation coefficient, can serve as an effective tool for identifying redundant variables under this modeling framework.
Introduction
Decision tree models are widely employed in the early stages of modeling to explain how target variables are related to input variables due to their intuitive structure and inherent interpretability (Breiman et al., 1984; Hastie et al., 2009). They are also frequently used to identify potential relationships among input variables. Building upon these characteristics, ensemble methods such as Random Forests and Gradient Boosted Trees extend the basic decision tree framework to achieve robust predictive performance by effectively mitigating both bias and variance (Breiman, 2001; Friedman, 2001).
Decision tree–based methods and their ensembles have been widely adopted across diverse disciplines (Costa & Pedreira, 2023; Mienye & Jere, 2024). In healthcare, they have been applied to clinical decision support, disease risk prediction, and prognostic modeling (Chrimes, 2023; Podgorelec et al., 2002). In finance, gradient boosting methods such as XGBoost have become standard tools for credit scoring and fraud detection (Chang et al., 2018; Chen & Guestrin, 2016). In real estate, random forest–based approaches have been used for mass appraisal and house price valuation (Hong et al., 2020). Tree-based models have also been employed in environmental sciences for flood susceptibility mapping and land-use classification (Lee et al., 2017; Mobley et al., 2021).
Despite these strengths, decision trees are inherently prone to overfitting, especially when constructed with high-dimensional or noisy features. A common yet often overlooked source of such overfitting arises from information redundancy among features, where multiple variables encode overlapping aspects of the same underlying information. If highly correlated variables are included in a model, the explanatory power derived from the qualitatively identical information may be split between the two variables. Such redundancy may not significantly affect in-sample predictive accuracy, yet it can introduce substantial distortion in out-of-sample prediction and secondary analyses, such as simulations based on the trained model or assessments of variable importance (Gregorutti et al., 2017; Li et al., 2019).
As a result, careful feature selection becomes critical. Several theoretical approaches have been proposed to improve feature selection for decision trees, including impurity-based measures (e.g., information gain, Gini index), feature importance ranking, and regularization techniques embedded within tree-building algorithms (Breiman, 2001; Breiman et al., 1984; Zhang & Gionis, 2023). These methods aim to identify features that contribute most significantly to the decision-making process of the model.
However, this approach relies on empirical procedures—training the model directly on data and then examining the outcomes or measuring feature importance. The problem arises when many features are correlated: their importance tends to be diluted across variables, making it difficult to identify the truly influential ones. In other words, the observed importance distribution is already affected by correlations, rendering it an imperfect tool for causal interpretation.
To mitigate this issue, one must identify and account for strong correlations among features prior to model training. In classical statistical analysis, indicators such as the correlation coefficient and the Variance Inflation Factor (VIF) are employed to preemptively diagnose multicollinearity. However, these measures are inherently grounded in the linear regression framework and are not directly applicable to nonlinear, partition-based methods like tree-based algorithms. A unified and generally accepted framework for detecting redundant features in tree-based models has yet to be established.
This paper addresses this gap by introducing mathematical propositions that characterize how certain types of features become structurally redundant within a well-defined class of decision tree models, namely CART-style trees constructed via axis-aligned, greedy split selection. The first proposition shows that if one variable is an order-preserving bijective mapping of another, the performance of the model remains unaffected even when one of the two variables is removed. The second proposition reveals that if one variable is an order-preserving surjective mapping of another, the performance of the model remains unchanged upon the removal of the image variable.
While the analysis is intentionally restricted to this modeling framework, it provides clear theoretical guidance for feature engineering practices commonly employed in CART-style decision trees and their widespread implementations. Within this CART-style setting, our propositions imply that common order-preserving transformations—such as min–max scaling or logarithmic transformation—do not change the split structure and thus offer no systematic benefit for predictive performance. They also provide practical guidelines for feature selection in real-world contexts, such as dealing with time-series variables that exhibit monotonic movements. Furthermore, they suggest that Spearman’s rank correlation coefficient can serve as an effective measure for identifying redundant variables in decision tree models.
Related Work
Feature selection for decision tree models has been a long-standing area of research, approached from a variety of perspectives including statistical heuristics, correlation analysis, optimization, and robustness. Existing research on feature selection has predominantly addressed the variable selection problem from an empirical perspective, typically evaluating features through performance-based criteria or statistical associations observed in data. For instance, De Mántaras introduced a distance-based attribute selection measure to mitigate the bias of information gain toward multi-valued attributes (De Mántaras, 1991). Correlation-based feature selection (CFS) evaluates subsets of features by balancing individual predictive power against inter-feature redundancy (Hall, 2000), and has influenced many subsequent filter-based approaches. Optimization-based strategies have also been explored. Bredensteiner proposed feature minimization within decision trees, targeting parsimonious models that achieve compact representations without significant loss of accuracy (Bredensteiner, 1999). Tuv et al. further advanced this line of research by introducing an ensemble-based feature selection framework that integrates artificial contrast variables and redundancy elimination (Tuv et al., 2009). Their method identifies compact yet informative feature subsets and demonstrates superior robustness on high-dimensional and noisy datasets compared to traditional filter and wrapper approaches. Izza et al. also improved interpretability by introducing PI-explanations, which provide shorter path representations of decision tree predictions by removing redundant conditions (Izza et al., 2020).
Although these approaches often improve interpretability or empirical performance, they typically rely on statistical associations among features and outcomes rather than addressing redundancy through the intrinsic structural properties of decision tree models. As a result, redundancy is typically identified only after a model has been trained, making it difficult to detect structurally equivalent feature representations in advance and potentially leading to unnecessary model complexity during training. In this context, we provide a mathematical characterization of feature redundancy to decision tree models. We formalize conditions under which alternative feature representations become structurally equivalent or redundant from the perspective of decision tree construction, thereby offering a systematic theoretical account of when feature transformations preserve or restrict the expressive capacity of tree-based models.
The propositions developed in this paper formalize commonly stated but rarely proven properties of decision trees. For example, it is well known in practitioner-oriented literature and textbooks that standard decision trees are insensitive to monotone transformations of individual features. Breiman et al. (1984) and Hastie et al. (2009) note that tree-based models depend primarily on the ordering of feature values rather than their absolute scale. Similar remarks appear in software documentation, including the scikit-learn user guide (Pedregosa et al., 2011). Also, some studies have shown that impurity- and importance-based measures may suffer from systematic biases, particularly in the presence of correlated or high-cardinality features (Strobl et al., 2007). These observations are typically stated informally, without a precise theoretical characterization of when two features are equivalent or redundant in terms of the structure and expressive power of decision tree models. To the best of our knowledge, there has been no systematic theoretical treatment that formalizes redundancy induced by order-preserving transformations through explicit propositions and proofs.
Our analysis can be applied to characterize how discretization and binning operations interact with the expressive capacity of decision tree models. Classical studies have investigated supervised and unsupervised discretization strategies as a means of improving predictive performance or model interpretability (Dougherty et al., 1995). These methods typically aim to balance information preservation with model simplicity by reducing continuous variables to a finite set of intervals. While discretization is often treated as a preprocessing choice, its interaction with the expressive capacity of decision tree models has not been fully characterized from a structural perspective.
By linking these theoretical results to practical feature engineering operations—such as scaling, discretization, and binning—our work complements existing empirical approaches by clarifying when alternative feature representations genuinely expand, preserve, or restrict the expressive capacity of decision tree models. This perspective provides practitioners with principled guidance for selecting or transforming features prior to model training.
Decision Trees and Redundant Variables
In order to discuss the redundancy of a variable in a binary decision tree, let us present the related denotations and definitions. The initial sample set, denoted by
Before stating the propositions, we summarize the main modeling assumptions under which the theoretical results are derived. We restrict attention to single CART-style decision trees with binary, axis-aligned splits, where each internal node selects a univariate threshold split by maximizing an impurity-based gain function. The gain function is assumed to be defined solely through the induced partition of the sample and to be invariant to monotone reparameterizations of the splitting variable. These assumptions encompass standard impurity measures used in practice, including squared error for regression and Gini index or entropy for classification. The following statements formalize them:
Binary trees partition a given set of samples into two groups based on whether the variable values of the samples exceed a certain threshold. For analysis, we can define separately the task of partitioning the given samples and the task of determining which criterion for sample partitioning is superior. At first, the partitioning can be separately defined by a partition function,
The above parentheses
Next, the information gain obtainable from a partition
Then, the optimal partitioning condition with respect to a set of samples
In general, a mathematical expression for binary decision tree can sufficiently rely on the optimal partition condition defined above. However, since this study focus on analyzing the efficiency of using a variable, it is helpful to define the set of space of feasible gain for each variable as an instrumental concept. Based on the denotation of
Further, the optimal gain by using a variable
The equation (3) illustrates the principles by which a binary decision tree recursively generates branches. Lastly, defining a decision tree requires the establishment of stopping criteria. These criteria can be expressed as a union set
Based on the denotations and definitions described above, we can express the training of a decision tree as below:
Next, we can suggest the definition of a variable being meaningful or redundant in the training of the decision tree as follows:
The implication of the above definition is straightforward. In a decision tree, a variable influences prediction only through the splits at which it is actually used: each internal node corresponds to a specific variable–threshold pair that contributes to the tree’s partition structure. For a variable to play any role in shaping this structure, it must either be selected as an optimal splitting condition at least once during training, or induce a partition that cannot be replicated by other variables. Conversely, if a variable is never chosen as the optimal condition at any partition, or if every partition it can induce can be replicated exactly by splits on other variables, then the variable does not contribute uniquely to the tree’s partition structure. Excluding such a variable leaves both the set of feasible partitions and the resulting prediction function unchanged.
To address variable selection and transformation issues more directly, we can rewrite the concept for the redundancy of a variable to focus on the relationships between variables as follows:
These definitions admit a natural interpretation in terms of the partition spaces induced by variables in a decision tree. Interchangeability means that two variables generate exactly the same collection of feasible sample partitions at every node: any split achievable using one variable can be replicated using the other, leading to identical gain spaces and identical optimal gains. Substitutability is a weaker notion. Variable
Naturally, interchangeability is a special case of substitutability. The above definitions are other version of the definition of redundancy of a variable. Thus, In accordance with the definitions, the following corollary can be obtained.
If a variable
This section presents two propositions that formalize the conditions under which variables become structurally redundant in CART-style decision trees. Proposition 1 addresses the case of order-preserving bijective mappings, establishing that such transformations yield interchangeable variables that produce identical feasible partitions. Proposition 2 extends this analysis to order-preserving surjective mappings, showing that the variable with lower cardinality is substitutable. Each proposition is followed by a formal proof and illustrative examples that demonstrate its practical implications for feature engineering.
Based on the properties defined in Section 3, we present two propositions that provide theoretical insight into feature engineering for decision tree models. Under the axis-aligned greedy split rule defined in (1)–(4), we establish the following proposition:
Suppose there are two variables
Proof. Let
It is useful to provide intuition for Proposition 1. In practical terms, an order-preserving mapping refers to any transformation that maintains the rank ordering of observations. Common examples include min–max scaling, standardization, and logarithmic transformation applied to strictly positive variables. Although such transformations may substantially change the numerical scale of a variable, they do not alter the relative ordering of samples and therefore do not introduce new splitting possibilities in axis-aligned decision trees.
In an axis-aligned decision tree, each split compares the value of a single variable to a threshold, thereby partitioning samples solely based on their relative ordering along that variable. If one variable is transformed into another through an order-preserving one-to-one mapping, the ordering of samples is unchanged. Consequently, for any threshold on the original variable, there exists a corresponding threshold on the transformed variable that induces exactly the same partition of the data. As a result, the set of feasible splits—and hence the split opportunities available to the tree—remains identical under such transformations.
Therefore, the notion of redundancy considered here can be understood as structural redundancy. Structural redundancy refers to a situation in which a variable does not make a substantive contribution to the structural expressiveness of a decision tree. Specifically, when a variable
In other words, if the entire partition space that could be induced using a given variable can already be fully realized through other variables, or through alternative representations of those variables, then the variable is unnecessary from the perspective of the tree’s structural expressiveness. In such cases, the variable does not enable any new partitions nor does it give rise to tree structures that were previously infeasible. All possible arrangements of branches and leaves, along with the corresponding ways of partitioning the data, can already be reproduced without the presence of that variable.
Accordingly, structural redundancy can be defined as the property that a variable fails to expand the theoretical expressive power of a tree model—namely, the set of attainable partition structures. In this sense, a structurally redundant variable is one that does not provide additional opportunities for structural discovery in the formation of the tree.
We now discuss the practical implications of this proposition. First, the above proposition can be used to interpret the role of common reversible feature engineering techniques—such as logarithmic transformation, squaring, min–max scaling, or standardization—from a structural perspective. As discussed, when a variable is transformed through a one-to-one, order-preserving mapping, the resulting feature does not expand the set of partitions that a decision tree can represent in an idealized, implementation-agnostic sense. In this respect, such transformations are structurally redundant: they do not alter the hypothesis space of the tree, nor do they introduce fundamentally new splitting possibilities beyond those already available through the original variable.
Consider the following example.
We define a random nonlinear function of a variable
Example 1 - predictive Performance Across Transformations.
Notes: Results are based on 100 simulations of Example 1. Each entry reports the mean and standard deviation (in parentheses) of the predictive metric computed on 30 evenly spaced evaluation points. The target variable is generated as
Example 1 - Decision Tree Implementation Details and Hyperparameter Settings Used.
Proposition 1 has implications not only for the transformation of variables but also for their selection. Such cases arise when, although the variables represent conceptually different properties, one variable can be expressed as a monotonic function of the other within the range captured by the sample data. For instance, variables often exhibit trends over time. If a variable change monotonically over the period in the sample data analyzed, this essentially constitutes an order-preserving one-to-one mapping of time points, which implies that the decision tree trained with the variable could not make a difference with the tree trained with the time variable. Consider an example as the follow.
Example 2 - Monthly M2 and CPI Data for 2022.
Notes: M2 is measured in billions of U.S. dollars, while the CPI is an index with a base value of 100. Both the M2 and CPI were obtained from the economic statistics database of the Federal Reserve Bank of St. Louis.
An examination of the M2 money supply reveals a consistent upward trend across the months, albeit with varying rates of increase. According to Proposition 1, this suggests that the information regarding the M2 supply is qualitatively equivalent to that provided by the monthly time points. In Figure 1, we compare decision trees trained on the relationship between the M2 money supply and CPI with those trained on the relationship between the monthly time points and CPI. Both trees are found to branch at the same split point.
Example 2 - Comparison of trained decision tree. (a) Decision tree with M2. (b) Decision tree with integerized month.
The proposition presented above analyze, from a mathematical perspective, the role of order-preserving mappings between variables in decision trees. Nevertheless, even when two variables satisfy such a relationship, their use in an implemented decision tree algorithm does not necessarily result in complete predictive equivalence. This discrepancy arises because the actual tree induction procedure does not operate by directly enumerating and comparing all possible partitions of the sample space, as assumed in the theoretical formulation, but instead searches over a finite set of candidate threshold values that implicitly induce partitions.
In practical CART implementations (including common software libraries such as
At the theoretical level, if
In particular, when
An additional source of divergence arises from tie-breaking rules. In practice, it is common for multiple variable–threshold combinations to yield identical or nearly identical impurity reductions. This situation is especially likely when two variables
Because CART employs a greedy, node-wise optimization strategy, even minor differences in early partitions can propagate through the tree. A small deviation in the first split alters the subset of observations reaching subsequent nodes, thereby changing the set of candidate splits and potentially leading to a substantially different tree structure. Consequently, although two variables
These considerations suggest that, although equivalence between variables related by order-preserving mappings holds strongly at the theoretical level, it may not always be reproduced perfectly in practice under implemented algorithms and finite-sample settings. This observation does not undermine the validity of the theoretical results; rather, it highlights the need to examine how stably such results are realized in actual data analysis environments.
Accordingly, in order to assess whether the theoretical properties are largely preserved under real data and implemented CART algorithms, we consider the following illustrative example.
We use the red and white variants of the Portuguese “Vinho Verde” wine dataset from the UCI Machine Learning Repository, focusing on the white wine subset. The dataset consists of 4,898 observations, each describing physicochemical properties of wine samples—such as acidity, residual sugar, density, sulphates, and alcohol content—along with a sensory quality score assigned by human experts. Following common practice, we treat the quality score as a continuous outcome and consider a regression setting.
Our analysis focuses on the variable
For each trial, the data are randomly split into training and test sets, with 90% of the observations used for training and the remaining 10% held out for evaluation. A CART-style decision tree regressor is fitted using identical hyperparameters across specifications (maximum depth of 5 and a minimum leaf size of 10). In the transformed specification, the original
Predictive performance is evaluated on the test set using
Across all trials, the predictive performance of the two specifications is virtually indistinguishable. The maximum absolute difference observed between the raw and squared specifications is approximately 0.002 for
Example 3 - Predictive Performance Comparison for Wine Quality Experiments.
Notes: Performance comparison between the raw alcohol feature and its squared transformation over 100 random train–test splits. Reported values are the mean and standard deviation (in parentheses) across trials.
Example 3 - Rates of Practically Insignificant Differences between the Two Specifications over 100 Trials.
Notes: For MAE and RMSE, the tolerance is defined as 5% of the corresponding error magnitude in the raw-alcohol specification (trial-by-trial).
Example 3 - Decision Tree Implementation Details and Hyperparameter Settings Used in the Wine Quality Experiment.
Suppose there are two variables
Proof. Before presenting the formal proof, it is useful to outline the main idea. The key step is to show that every partition induced by the lower-cardinality variable
To show that there exists
Then, we can show that, if
We can also show, if
It shows
The intuition behind Proposition 2 is that an order-preserving surjective transformation reduces the set of feasible split locations available to a decision tree. An order-preserving surjective mapping additionally merges multiple distinct values into the same transformed value—for example, rounding a continuous variable to one decimal place, converting income into brackets (low/medium/high), or binning age into five-year intervals. These transformations keep the overall order but make the variable coarser by reducing the number of distinct levels (cardinality). In CART-style trees, each variable defines a collection of candidate threshold splits, and the quality of a split is evaluated by the impurity reduction it induces. When one variable is a surjective, order-preserving mapping of another, multiple distinct values of the original variable are collapsed into a single value of the transformed variable. As a result, certain threshold positions that are available under the original variable no longer exist under its surjective image.
A simple schematic example illustrates this point. Consider five observations with variable
From an informational perspective, this means that the set of impurity reductions achievable by splitting on the transformed variable is a subset of those achievable by splitting on the original variable. The proposition therefore does not assert that the higher-cardinality variable will necessarily be selected by a greedy algorithm in finite samples, but rather that it weakly dominates its surjective image in terms of the best split quality that can be achieved in principle. In other words, reducing cardinality through an order-preserving mapping can only restrict, and never expand, the space of attainable partitions.
This perspective suggests practical implications for feature selection. If multiple predictors are linked by order-preserving relationships, then, in principle, the higher-cardinality predictor offers at least as many feasible split locations as its lower-cardinality counterparts. Thus, keeping the most fine-grained version is often sufficient for preserving split opportunities, while additional coarser variants may contribute little new information under axis-aligned threshold splits.
Discretization and binning provide common and intuitive examples of order-preserving surjective transformations, in which multiple adjacent values are deliberately collapsed into the same level, thereby reducing the number of distinct split locations available to the tree. More generally, however, Proposition 2 applies to any order-preserving transformation that reduces the effective cardinality of a variable, regardless of whether the resulting representation is explicitly discrete or remains continuous.
The following examples illustrate these implications in settings such as rounding-based discretization and a simple time-series substitution case.
Consider a scenario where the relationship between Example 4 - Performance degradation under discretization of the input variable. Across all metrics, predictive performance deteriorates monotonically as discretization becomes coarser. (a) Example 4 - Performance Degradation under Rounding. Notes: Entries report the mean (first line) and standard deviation (in parentheses, second line) across 100 simulations. The raw feature 
Consider a decision tree to forecast the price of Bitcoin using the base interest rate (i.e. Federal Funds Rate) as predictor. The prices of Bitcoin and the federal funds rate were both recorded on a daily basis throughout the year 2022, from January 1 to December 31. Figure 3 presents the Bitcoin price and the federal funds rate during this period. Firstly, we can see that the federal funds rate (shown in panel (b)) consistently increased over time, suggesting that it could potentially be substituted by time variable. Secondly, the changes in the federal funds rate did not occur daily. Given its role as the benchmark interest rate, the federal funds rate acts as an anchor for monetary policy direction. This feature of the federal funds rate implies that while differences in the federal funds rate can be identified by dates, the converse is not true; differences identified by date do not necessarily correspond to changes in the federal funds rate. Thus, we can expect that the federal funds rate is redundant because that using a simple time variable could potentially capture more information than using the federal funds rate.
Example 5 - Time series for Bitcoin price and Federal funds rate. (a) Bitcoin price and (b) federal funds rate.
To demonstrate, we conducted 100 simulations where we randomly selected half of the samples to train on the relationship between the federal funds rate, time(day), and Bitcoin prices and then predicted for the remaining evaluation samples. The federal funds rate exhibits an average cardinality of only 9.64 across experiments, whereas the time index (day) has a cardinality of 182, reflecting the fact that the policy rate changes only intermittently while time progresses daily. Consistent with Proposition 2, this severe reduction in cardinality restricts the feasible split points available to the tree. As expected, in all 100 simulations, the model using time as the predictor outperforms the model using the federal funds rate across all performance metrics, including Example 5 - Performance degradation with coarse variable. Across all metrics, predictive performance deteriorates monotonically as discretization becomes coarser. (a) Example 5 - Predictive Performance of the Time Index and Federal Funds Rate in Predicting Bitcoin Prices. Notes: Entries report the mean (first line) and standard deviation (in parentheses, second line) across 100 Monte Carlo replications. In each replication, half of the observations are randomly selected for training and the remaining half are used for evaluation. Cardinality denotes the number of distinct feature values in the training sample (i.e., the number of feasible split points available to the tree). Setting for the experiments are same with those used in Example 1. In all 100 replications, using the day index strictly outperforms using the federal funds rate across all metrics (
*Proposition 2 characterizes a fundamentally different type of result from Proposition 1. Whereas Proposition 1 establishes a strict equivalence between variables under order-preserving bijections, Proposition 2 addresses a weaker, yet important, notion of informational dominance. Its central implication can be summarized as follows: if the best achievable split quality can be compared exactly, then a variable with higher cardinality is never inferior to its order-preserving surjective image.
Crucially, this statement does not concern the success or failure of a particular tree-building algorithm in finite samples. Rather, it is a claim about the inclusion relationship between information sets, formalized through the feasible gain spaces of the variables. Specifically, Proposition 2 proves that when a variable
For this informational dominance to translate into actual split selection, however, an additional condition is required—namely, the exact comparison of impurity values across candidate splits. In practical tree induction algorithms, impurity reductions are computed using sample-based estimates,
By contrast, exact comparison of impurity values refers to an idealized setting in which impurity reductions are evaluated without estimation error. This can be interpreted in two equivalent ways. First, at the population level, impurity reduction is defined in terms of expectations under the true data-generating distribution and thus reflects an intrinsic property of the distribution rather than a random sample. Second, in the infinite-sample limit, standard law-of-large-numbers arguments imply that the sample-based impurity estimates converge almost surely to their population counterparts for all candidate splits. In this limit, the relative ranking of splits is recovered exactly, and the informational dominance established in Proposition 2 is fully realized.
The practical difficulty arises because realistic tree induction operates neither at the population level nor in the infinite-sample limit. In finite samples, variables with higher cardinality generate a larger number of candidate splits, which increases both the opportunity for genuine improvements and the risk that noise-driven splits appear spuriously attractive. Consequently, the dominance result of Proposition 2 should not be interpreted as a guarantee that higher-cardinality variables will always yield better predictive performance in finite-sample greedy tree construction. Rather, it should be understood as an informational benchmark: under exact impurity comparison, higher cardinality weakly dominates its order-preserving surjective image, but this dominance may be attenuated or obscured in practice by estimation noise, sample size limitations, or additional algorithmic constraints.
In this sense, Proposition 2 complements Proposition 1 by clarifying not an invariance property, but a directional one. It delineates the conditions under which reducing cardinality through order-preserving mappings necessarily restricts the space of achievable splits, while also highlighting why such restrictions may sometimes appear beneficial in finite samples due to implicit regularization effects.
We consider a real-world regression problem using the Student Performance Dataset from the UCI Machine Learning Repository, which contains academic and socio-demographic information on secondary-school students in Portugal. The target variable is the final mathematics grade
To isolate the effect of discretization, we restrict the feature set to
The experiment is conducted in a Monte Carlo fashion. In each replication, 80% of the observations are randomly selected for training and the remaining 20% are used for evaluation. This procedure is repeated 200 times. For each run, we record
Figure 5 reports the distributions of the performance metrics across replications. Across all three metrics, the distribution corresponding to the raw Example 6 - Performance degradation under binning.
Example 6 - Predictive Performance of Raw and Binned
Notes: Entries are based on 200 Monte Carlo replications with 80% training and 20% test splits. The binned variables are obtained by discretizing
Example 6 - Frequency that Raw
Notes: Each entry reports the fraction (in percent) of Monte Carlo replications in which the raw-feature model strictly outperforms the binned-feature model. For
The preceding examples focus on single decision trees, for which Proposition 2 implies only a stochastic ordering: although a higher–cardinality representation dominates a coarser one in expectation, individual train–test splits may occasionally favor the latter due to variance and overfitting effects. This behavior was evident in Example 6, where the raw
An important question is how this relationship changes under ensembling. Methods such as bagging and random forests are designed precisely to reduce the variance of high–variance base learners by averaging across many bootstrap samples. From a bias–variance perspective, discretization increases bias by shrinking the hypothesis space, whereas ensembling primarily reduces variance while leaving bias largely unchanged. This suggests that, when trees are aggregated, the advantage of higher–cardinality representations might be amplified: the variance penalty associated with rich feature spaces is mitigated by averaging, while their expressive power is retained. It can be expressed as the follows:
Let
Proposition 2 implies that if
Now consider an ensemble of
Since discretization replaces
Although this argument is heuristic rather than a formal theorem, it may lead to an empirical prediction: under bagging or random forests, representations with larger cardinality should dominate coarser discretizations much more decisively than in single–tree models. In particular, one would expect ensembling to reinforce the tendency for the raw representation in Example 6 to outperform its binned counterpart across replications. The next example investigates this conjecture empirically.
This experiment extends Example 6, which compared raw and binned representations of the UCI student performance data using a single decision tree. We apply nested bagging to the same data and experimental setup. The target variable is the final grade
For each replication, a bagging ensemble of
Example 7 - Frequency that Raw
Notes: Entries report the fraction of Monte Carlo replications in which the raw representation
Example 7 - Settings for the Nested Bagging.
Notes: Nested bagging is used: for each Monte Carlo replication, an ensemble of
This study introduces two mathematical propositions that characterize the behavior of feature transformations and substitutions within the standard CART-style framework of axis-aligned, greedily induced decision trees. The first proposition demonstrates that if one variable is an order-preserving bijective function of another, both variables are interchangeable under axis-aligned threshold splits, yielding identical feasible partitions and hence identical split opportunities. The second proposition shows that one variable is an order-preserving surjective mappings of another, it is substitutable. Consequently, the propositions provide theoretical guidance for identifying redundant variables in model specifications.
In addition to the theoretical results, we conduct empirical validation using both synthetic and real-world datasets. The results of experiments confirm the theoretical findings that, in CART-style axis-aligned trees, order-preserving transformations do not affect the model’s predictive performance, while order-reducing mappings such as rounding, discretization, and binning lead to systematic degradation.
Based on the theoretical and empirical results of this paper, several practical guidelines emerge for practitioners using decision tree–based models: Avoid including multiple monotone or order-preserving transformations of the same underlying feature (e.g., raw values, logarithms, ranks, or discretized versions), as these variables generate redundant or nested partition spaces and do not expand the expressive power of the model. When order-preserving mappings exist between features, prioritize variables with higher effective cardinality, since they induce strictly larger sets of feasible partitions and weakly dominate lower-cardinality alternatives in terms of achievable split quality. Use rank-based measures, such as Spearman’s rank correlation, as a simple diagnostic tool to screen for near-monotone relationships between features before training tree-based models. Be cautious when applying discretization or rounding as a preprocessing step, as such transformations tend to restrict the set of attainable partitions in decision trees.
The theoretical results presented in this paper are inherently limited to the classical CART-style framework. Because each internal node employs a single-variable, axis-aligned threshold split selected greedily based on a local impurity-based criterion, the propositions characterize invariance and substitutability only within this specific mode of tree induction. Interaction effects among variables arise solely through recursive partitioning across tree depths, and not through multivariate split conditions at a single node. As a result, the propositions should not be interpreted as applying to oblique or multivariate split trees, globally optimized tree structures, or tree models with additional regularization or monotonicity constraints.
Finally, although our propositions are derived for single trees, we have added a supplementary discussion and an empirical bagging experiment to examine how these representational effects behave under ensembling. While this analysis is necessarily informal and does not constitute a formal extension of the theory, the results suggest that aggregating trees tends to make the representational disadvantages of lower-cardinality variables more visible by reducing variance. A rigorous theoretical treatment of ensemble methods remains an important direction for future research.
Footnotes
Acknowledgements
This research was supported by No. 202500810001 of Handong Global University Research Grants.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
