Abstract
Recent advances in machine learning, artificial intelligence, and the availability of billions of high frequency data signals have made model selection a challenging and pressing need. However, most of the model selection methods available in modern finance are subject to backtest overfitting. This is the probability that one will select a financial strategy that outperforms during backtest, but underperforms in practice. We evaluate the performance of the novel model confidence set (MCS) introduced in Hansen et al. (2011a) in a simple machine learning trading strategy problem. We find that MCS is not robust to multiple testing and that it requires a very high signal-to-noise ratio to be utilizable. More generally, we raise awareness on the limitations of model selection in finance.
Introduction
With recent advances in machine learning, parallel computing, and large historical millisecond-based financial datasets, it is not rare for industry engineers to backtest hundreds or thousands different investment strategies in order to search for the most profitable model.2 Likewise, the availability of unprecedented quantities of individual-level data also means that A/B experimentation and data-driven designs are becoming the gold standard in online platforms, retail, technology companies, medicine, and even policy (Lazer et al. (2009), Kohavi et al. (2007), Bakshy et al. (2014), Varian (2014), Bastani & Bayati (2015), Aparicio & Prelec (2017), Athey (2017), Lada et al. (2018)). But as we test an increasing number of strategies and predictive features, or repeat the same experiment many times, it becomes more likely that some of the estimated effects will be extraordinarily effective. How, then, should we evaluate and select the right models? And in which situations does it matter? The multiple testing problem is now more pervasive and affects both practitioners and academics alike. And therefore it is important that analysts use tougher standards to test their models in a robust and unbiased way.
This paper evaluates the performance of the ’model confidence set’ (MCS) introduced in Hansen et al. (2011a). The MCS procedure, described in Section 2, starts with a collection of models, and sequentially prunes the worst performing models one by one, according to some user-defined loss function, until the first non-rejection takes place. These surviving models, found to be statistically similar, define the estimated model confidence set
In the introduction to their article, for instance, the authors suggest that MCS can be used to select ’treatment effects’ or ’trading rules with the best Sharpe ratio’ (p. 454). Our simulations suggest that the coverage properties in MCS are not adequate to winnow out trading strategies in practice. Analysts may use MCS for initial screening or forecasting combination, but not as sufficient evidence to select investment strategies. Similarly, academics should not rely solely on MCS as sufficient evidence to defend a given macroeconomic or forecasting model. We hope that our discussions here raise awareness of the limitations of similar model selection methods, but also of the need for further research in this area.
This paper relates to an extensive literature on model selection and forecast evaluation in economics (Corradi & Distaso (2011), Elliott & Timmermann (2016), Clark & McCracken (2013)). More generally, our work relates to a deeper discussion of the implications and challenges of data-driven model selection (Leeb & Potscher (2005)). Concerns about false discoveries due top-hacking, or data snooping, are not limited to finance, but arguably affect all observational or experimental studies (Ioannidis (2005), John et al. (2012), Simonsohn et al. 2014). A growing variety of methods address the multiple testing problem (White (2000), Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001), Storey (2002), Romano & Wolf (2005), Romano et al. (2010); see also Bailey et al. (2014) and Harvey et al. (2016) for recent methodologies in finance).
The rest of the paper is structured as follows. Section 2 describes the MCS and its limitations. Section 3 presents simulation results from the perspective of selecting financial strategies. Section 4 concludes.
MCS
We begin this Section with a brief overview of the model confidence set (MCS) from Hansen et al. (2011b), and then introduce the main limitations of applying it to a forecasting problem. We encourage the reader to see Hansen et al. (2011a), Hansen et al. (2011b), and Hansen at al. (2014) for additional details on the methodology. We stress that our discussions here should not be understood as a naive critique of the MCS. MCS presents a substantial contribution to the model selection problem; we find, however, that the requirements in MCS are not adequate to many of the modern model-selection problems faced in practice.
MCS starts with a collection of models
Following the notation in Hansen et al. (2011a), MCS is based on an equivalence test, Initially set Test the hypothesis If
When the procedure ends, MCS yields
Although MCS is easy to compute (there are several statistical software packages available) and has many attractive features, we find that its use is limited in practice. The methodology requires the true superior models to have an unrealistically high signal-to-noise ratio. The low power of the test is in part due to not defining a benchmark. In Section 3, for example, we show that a superior model would need to have an annualized Sharpe ratio greater than 7 to be picked up as the single model in
Model selection criteria that do not severely penalize for multiple testing tend to select models that have experienced a high backtesting performance when, in reality, they are of the same quality as many others with a poorer performance. The problem is exacerbated with large N trials, similarly to testing individual coefficients in a regression. If there are dozens of coefficients, on average there will be a few that appear strongly significant. If we run hundreds of trading strategies, some of them will yield extraordinarily large Sharpe ratios and MCS willselect them.3
Simulation exercise
This Section presents simulation results in a financial engineering problem. However, the results are relevant to a wider range of model selection applications. Data scientists are regularly testing usage time or conversion rates under different features via A/B experimentation in retail, online platforms, and mobile apps (Kohavi et al. (2007)); prediction methodologies to improve decision-making are also becoming popular in policy (Athey (2017)).
We take the stand of a hedge fund manager who has to choose between different investment strategies. A manager will typically simulate M different strategies, each of them generated using different features, data signals, and machine learning methods, and potentially choose those with the highest backtesting performance. We simulate M series of financial returns as follows.4, 5
Let M be the number of models (strategies) to be simulated. Assume each model m generates T returns according to a random walk with drift. We also assume that daily returns experience a Poisson jump-diffusion process, similar to Merton (1976). When this event takes place, returns jump upwards or downwards an amount equal to ten times the (daily) volatility. In discrete form,
We introduce one true superior strategy, which is defined as having a (’multiplier’) times higher expected returns. That is, using the notation from equation (1),
In order to evaluate the performance of a strategy, we define the loss function as the excess returns over the expected returns:
Finally, for each Monte Carlo simulation and parameter combination, we apply the MCS procedure and analyze the in-sample as well as the out-of-sample performance of the selected and excluded models. As defined earlier,
To first illustrate the lack of power or signal-to-noise ratio problem in the MCS procedure, we narrow the simulations to the case where M = 50, 100 and T = 250 (about a year of daily trading data). Figure 1 shows the number of selected models in

Number of models selected in
We find that the superior model needs to have a Sharpe ratio greater than 7 to be picked up as the sole best model in
Figure 1 also suggests that the MCS’s threshold Sharpe ratios uniformly penalize for the number of trials. This concern can be related to a growing literature on the false discovery rate (FDR) or family-wise error rate (FWER). The most common example is that of using individual t-tests in multiple testing. Suppose that we backtest N independent investment strategies and find that the most profitable one has a Sharpe ratio that is highly significant at the 1% level. Even for small N, such as N = 25, the implied probability of observing such t-statistic is high:
This concern is relevant here because the multiple testing problem is particularly worrisome in finance (Barras et al. (2010), Bailey et al. (2014), Lo (2016)). Hedge funds managers can be tempted to backtest hundreds of trading strategies, and then present to their clients those with the highest performance. By selecting investment i, where
Figure 2 Generalizes the results from Fig. 1 using all parameter specifications. In particular, the 3D-surface shows the percentage of models included in

Generalizes Fig. 1 using all parameter specifications. Percentage of models included in
Finally, we illustrate what we observe out-of-sample when we use the MCS algorithm to select financial strategies. We restrict the data to the case where T = 250 in-sample observations, T = 125 out-of-sample observations, M = 100 initial models, one superior strategy with a = 10, and μ = 10% and σ = 9% (results are similar under alternative specifications). We first compare the in-sample and out-of-sample performance of the MCS selected models,
Figure 3 shows that the out-of-sample performance of the selected models in

The out-of-sample performance of the selected models in

The out-of-sample mean returns of the selected strategies,
We now discuss robustness results from two alternative specifications. First, we extend the simulation to select financial strategies based on a collection of Sharpe ratios. In particular, we follow the steps from Section 3 and simulate three years of daily returns. For each strategy, we compute twelve annualized Sharpe ratios based on their quarterly performance (similar results are obtained using monthly or bi-monthly SRs). We then compute the number of MCS selected models as a function of the superior model’s Sharpe ratio, its return multiplier a (relative to a baseline 10% annual return), as well as the out-of-sample performance of the selected (and excluded) trading strategies. In this case, the loss function is computed for each strategy-period SR as opposed to daily returns. The results, shown in Figs. 5–7, are similar to those in the previous Section. We find that MCS will select strategies with backtest overfitting and that the selected and excluded strategies perform equally good out-of-sample. Figure 9 (Appendix) also shows that MCS excludes a large fraction of models even when all strategies are equally good(a = 1).

We simulate trading strategies following Section 3, and construct quarterly Sharpe ratios. We apply the MCS procedure to this collection of Sharpe ratios and compare the MCS selected and excluded models. Figure 5 shows that it takes a very large Sharpe ratio for MCS to pick up the superior model. Figure 9 generalizes this exercise for different parameter specifications.

Figures 6 and 7 show that the MCS selected strategies are subject to backtest overfitting, i.e. the strategies experience a larger in-sample Sharpe ratio although they are equally good out-of-sample. This figure shows the histogram of mean in-sample and out-of-sample Sharpe ratios of the MCS selected strategies (probability density function). Results are qualitatively similar to those in Section 3.

This figure shows the histogram of the mean out-of-sample Sharpe ratios of the MCS selected and excluded strategies (probability density function).
Finally, we discuss the case when MCS is used with the

We simulate trading strategies following Section 3 anduse the Tmax,M statistic instead of TRange,M. This figure shows that, for in-sample Sharpe ratios below 10, almost all models are selected, regardless of the number of starting models m0.

Generalizes the results on the collection of quarterly Sharpe ratios using all parameter specifications. This figure shows the number of models in

We simulate trading strategies following Section 3 and use the Tmax,M statistic instead of TRange,M. See Section 3.3 for additional details. Figure 10 shows the probability that
Traditional testing and evaluation methods need to be reconsidered in light of recent advances in big data and technology. Portfolio managers, for instance, can now generate thousands of different trading strategies at little computational cost, and then present those with the highest backtesting performance to their investors. Similarly, data scientists in industry can design experiments to test each of the new features, and even repeat these experiments many times. In finance, this means that machine learning strategies will be subject to backtest overfitting: we will tend to select strategies that, out of so many, just happened to experience high backtesting performance. We therefore need new tools that can severely penalize for the multiplicity of trials but remain powerful enough to be utilized in practice.
We test the performance of the model confidence set introduced in Hansen et al. (2011a) using a variety of financial strategies simulated from the perspective of a portfolio manager. We find that MCS is not adequate to solve an analyst’s model selection problem, and more generally we hope that our work raises awareness of the challenges of model selection in modern finance.
Footnotes
Acknowledgments
We are grateful to Mike Lock, Isaiah Andrews, Valentina Corradi, Peter Hansen, Anna Mikusheva, Michael Lewis, and Kevin Sheppard for helpful comments. All views expressed in this paper are those of the authors, and do not necessarily reflect those of True Positive Technologies. All errors are our own.
