We consider the task of portfolio selection as a time series prediction problem. At each time-step we obtain prices of a universe of assets and are required to allocate our wealth across them with the goal of maximizing it, based on the historic price returns. We assume these returns are realizations of a general non-stationary stochastic process, and only assume they do not change significantly over short time scales. We follow a statistical learning approach, in which we bound the generalization error of a non-stationary stochastic process, using analogues of uniform laws of large numbers for non-i.i.d. random variables. We use the learning bounds to formulate an optimization algorithm for portfolio selection, and present favorable numerical results with financial data.
One of the main challenges in predicting financial markets is the non-stationary nature of the markets. One way to handle non-stationarity is to fit the data with a time series model, however, model assumptions are often simplistic and not valid in practice. An alternative approach to data modeling is algorithmic modeling Breiman (2001). In the i.i.d. setting, a method for deriving prediction algorithms is to use statistical learning theory to bound the generalization error by the empirical error and the complexity of the hypothesis class, which can lead to such algorithms, as in the case of the margin theories for the SupportVectorMachines and AdaBoost algorithms Mohri et al. (2012).
In Kuznetsov & Mohri (2015, 2020), the authors derived generalization bounds for non-stationary stochastic processes, without relying on assumptions of mixing or ergodicity that were assumed previously Dehling et al. (2002), Adams & Nobel (2010), Mohri & Rostamizadeh (2010). The bounds were derived using the non-i.i.d. measures of complexity introduced in Rakhlin et al. (2011, 2015a,b) based on binary trees. In the standard learning scenario with i.i.d. data, complexity measures as the Rademacher complexity are expressed as function of a sample of data. For non-stationary stochastic processes, the complexity measure has to capture the sequential nature of the data by some data structure, and the appropriate data structure is a binary tree. The other ingredient necessary to bound the generalization error introduced in Kuznetsov & Mohri (2015, 2020) is a measure of discrepancy between the future and past conditional distributions, which can be estimated from data. They used the bounds to derive a forecasting algorithm for kernel-based hypothesis with regression losses, in the non-stationary setting.
Here, we follow this approach and use the tree-based complexity measure in a concentration inequality from Rakhlin et al. (2015b), and the discrepancy measure, to bound the generalization error of a non-stationary stochastic process. We refine the bound for portfolio selection by bounding the sequential complexity of the hypothesis class. In portfolio selection without short sales, we wish to find the investment distribution across assets that will maximize the resulting wealth. In the framework of statistical learning, the hypothesis class is the probability simplex and the loss function is the negative wealth derived from the investment strategy. We use the learning bounds to obtain a sequential optimization problem for portfolio selection, this optimization problem is non-convex but it is possible to find a local minimum efficiently using the projected gradient descent algorithm. We present supportive numerical results with financial data.
The paper is organised a follows. In Section 2 we describe the framework of learning in a non-stationary environment, and the tools and concepts required to analyze it. We use a concentration inequality for a martingale difference sequence from Rakhlin et al. (2015b) to obtain a generalization bound for a non-stationary stochastic process in Section 3. In Section 4 we customize the generalization bound by bounding the sequential Rademacher complexity for portfolio selection. We use the learning bounds in Section 5 to formulate a sequential optimization problem which maximizes the expected wealth while taking into account the discrepancy between recent and earlier data, and provide affirmative numerical results. The paper is summarized in Section 6.
Preliminaries
Consider a scenario in which the learner receives a realization of some stochastic process , in a sequential manner, Z1, Z2, . . . , ZT ≡ Z1:T. The goal of the learner is to select a hypothesis , that minimizes the generalization error conditioned on the data observed so far , where is a loss function. The set of loss functions associated to is denoted by , so that is written as . This function is bounded |f (Z) |≤1, and we denote as the set of all functions from the space to the interval [a, b].
For i.i.d. random variables, statements about the convergence to zero of the supremum of the sum of the sequence are known as uniform laws of large numbers. In Rakhlin et al. (2015b), analogues of uniform laws of large numbers are derived for dependent random variables and the sum of the martingale difference sequence . In this section we describe the concepts and tools required to derive some of the results in Rakhlin et al. (2015b), and the ones required to analyze the problem of learning in a non-stationary environment described in Kuznetsov & Mohri (2015, 2020).
A -valued treezof depthT is a complete rooted binary tree with nodes labeled by elements of . The tree is identified with a sequence (z1, . . . , zT) of labeling functions which provide the labels for each node. A path of length T is given by the sequence ε = (ε1, . . . εT) ∈ {±1} T. We denote zt (ε) although zt depends only on (ε1, . . . εt-1), so the root of the tree is z1, the right child of the root is z2 (+1), the left child of the root is z2 (-1), and so on. Given a tree z and a function , the composition f ∘ z is a real-valued tree identified by the functions (f ∘ z1, . . . , f ∘ zT). The sequential Rachemacher complexity of a function class is
where ε = (ε1, . . . εT) ∈ {±1} T is a sequence of i.i.d. Rademacher random variables. For the reader interested in getting an idea how empirical processes with dependent random variables are bounded by this supremum over trees, we present an example for T = 2 in Appendix A.
A set V of -valued trees of depth T is a sequentialα-cover (with respect to the lp norm) of on a tree z of depth T if
and with respect to l∞ if
To put the definition (1) in words, pick any tree in the set , and any path ε in this tree, then there is a tree in the set V, and for the path ε in this tree, each of the nodes is no more than distance α away from the nodes of the original tree. The sequential covering number of a function class on a tree z is defined as size of the minimal sequential cover
and the corresponding maximal sequential covering number is defined as . The generalization error is studied through the supremum of the empirical process Kuznetsov & Mohri (2015, 2020)
with the vector q = (q1, . . . , qT) assigning weights to losses at different times. This empirical process is linked to a martingale difference sequence through the discrepancy between the future and past conditional distributions
which is a measure of the non-stationarity of the time series, and it can be estimated from data as follows. Since the suprema of sum is bounded by the sum of suprema we can bound the discrepancy by
If we take pt to be a uniform distribution over the last s points, , we can assume the second term on the right hand side, denoted discs, is small, indicating the sample does not change significantly in s ⪡ T time steps. It was shown in Kuznetsov & Mohri (2015, 2020) that this assumption is necessary for learning in this setting, since an upper bound on discs is related to a lower bound on the generalization error. Using the argument about suprema again, the first term on the right hand side is bounded by
and the second term in (3) is the empirical discrepancy, denoted , which is calculated from the available data. The first term in (3) is a supremum over a sum of a martingale difference sequence, which is bounded in the next section.
Generalization bound for non-stationary stochastic processes
In this section we upper-bound the generalization error of a non-stationary stochastic process by the supremum of a sum of a martingale difference sequence, which is further bounded by a concentration inequality from Rakhlin et al. (2015b). We denote u as a vector of uniform weights (ut = 1/T), and use the empirical discrepancy to express the following learning guarantee for non-stationary sequences.
Theorem 1.Let Z1:T be a sequence of random variables in with T ≥ 2, then, for any δ > 0 with probability 1 - δ the following inequality holds for all:
under the mild assumptions , and , where c is an absolute constant, and .
Proof. A similar approach was used in the proof of Theorem 7 in Kuznetsov & Mohri (2015). Since the difference of suprema is bounded by the suprema of difference we have
and the second term is bounded by Holder’s inequality,
This gives an upper bound expressed in terms of a martingale difference sequence,
and we can use Lemma 15 in Rakhlin et al. (2015b) to write the following concentration inequality,
which is valid for all , T ≥ 2, and any α ≥ 0, under the mild assumptions and , where c is an absolute constant and .
Similarly, we can bound the first term in (3),
and use the inequalities (2), (3), and (5) to obtain the concentration inequality
which is valid under the same assumptions as (4). To combine these two results, note that for two random variables a and b, if a + b > α at least one of the them has to be larger than α/2, so , where the last inequality is the union bound. This gives
and by substituting the definition of Φ (Z1:T) we obtain Theorem 1. □
Note that the bound (4) is different from the generalization bounds of Theorem 1 in Kuznetsov & Mohri (2015, 2020) in the sense that the complexity in those bounds is expressed in terms of expected covering numbers, which is a concept based on drawing random trees according to conditional distributions. Since they exploit information about the conditional distributions the data is drawn from, those bounds are tighter. However, commonly the conditional distributions are unknown, and the complexity terms are further bounded by the sequential Rademacher complexity.
Generalization bound for portfolio selection in non-stationary markets
We wish to customize Theorem 1 for the case of portfolio selection. We start by describing the framework and loss functions of sequential portfolio selection. At each trading period t we receive a vector of n asset returns, zt = pt/pt-1 - 1, where pt is the price, and need to find optimal portfolio weights in the simplex, , that would maximize the return. Over the time T, the wealth accumulated in a buy-and-hold trading strategy is . Maximizing this objective amounts to solving a linear optimization problem over the simplex and the solution is to simply to put all the money on the best performing asset. A different trading strategy is to create a constant rebalanced portfolio, which requires rebalancing the portfolio at each time step to maintain the weights fixed due to price changes. The wealth of this trading strategy is . The wealth of the best constant rebalanced portfolio is at least as much as for the buy-and-hold strategy, since the best performing asset is the maximal objective over the set of single-asset portfolios, which is included in the simplex.
Considering the log wealth of the constant rebalanced portfolio strategy leads to the additive loss function - log(w · (zt + 1)) = - log(w · zt + 1). This loss function is α-exp concave for α ≤ 1 Cesa- Bianchi & Lugosi (2006), a property that enables the application of algorithms from the online learning literature, which produce wealth asymptotic to the wealth of the best constant rebalanced portfolio in hindsight, that is, after future prices are revealed Cesa-Bianchi & Lugosi (2006), Cover (1991), Hazan et al. (2007), Li & Hoi (2014). We consider this loss function due to its favorable properties. We take the asset returns to be in the space , implying that none of the assets reach a price of zero. Since in general price is bounded from below by zero but not bounded from above, we assume 0 < l ≤ u. The maximal value of the constant rebalanced portfolio loss function is then m = max {log(1 + u) , | log(1 - l) |}.
After defining the hypothesis class, we can state the learning bound for portfolio selection in non-stationary markets in the following theorem.
Theorem 2.Let with n ≥ 4. Then, if Z1:T is a sequence of random variables in with T ≥ 11, for any δ > 0 with probability 1 - δ, the following inequality holds for all:
under the mild assumption , where , and c is an absolute constant.
Proof. Consider the linear class , with . To bound the sequential Rademacher complexity, note that , where , and {e1, . . . , en} are unit vectors in the positive cartesian directions. From proposition 14 in Rakhlin et al. (2015b), . Since is a finite set, we can use Equation (7) after Lemma 3 in Rakhlin et al. (2015b), and obtain
this bound was derived in Rakhlin & Sridharan (2014). To derive a lower bound, note that , where
is the standard empirical Rademacher complexity. This inequality is true because the supremum over the standard Rademacher complexity can be seen as the sequential Rademacher complexity with trees that are independent of ε, which are trees restricted to have identical values in a level of the tree. Taking the supremum over such restricted trees is smaller or equal to taking the supremum without this restriction.
To obtain a lower bound on , note that the expectation over Rademacher random variables is symmetric to a sign transformation so
We further bound where is restricted to have cash as one of the assets. This ensures that , and , so
and
Specifically for our class
where the last inequality is due to Khintchine-Kahane inequality Mohri et al. (2012). The sequential Rademacher complexity of the linear class is therefore bounded by
The normalized loss function class for constant rebalanced portfolios is , with defined above and . The sequential Rademacher complexity of this class is bounded using Lemma 13 in Rakhlin et al. (2015b) by
where is the Lipschitz constant of φ (x) in the space . This lemma requires , which is satisfied for T ≥ 8n/(n - 1). Note that in general this requirement is very mild since is a nonnegative, nondecreasing function of T for any class . To see that , note that for n ≥ 2, we can consider the case T = 1 to bound the maximal covering number from below, , and we have
To see that for n ≥ 4, consider the class , with and φ (x) as defined above, it is clear that . Consider the tree z = (z1, z2), with z1 = (- l, - l, u, u) ⊤, z2 (+1) = z2 (-1) = (- l, u, - l, u) ⊤. The four trees of the set are shown in Fig. 1. A minimal sequential 1/2-cover with respect to l∞ for is given by these four trees themselves. This shows that for n ≥ 4 and T ≥ 2, . Substituting the appropriate hypothesis class, and the bound (7) into Theorem 1, gives Theorem 2. □
The four trees of the set , defined in the text, we denote a = log(1 - l)/m, b = log(1 + u)/m. Since b - a > 1, there is no number that can 1/2-cover both a and b. Since no path is repeated in two trees, we have that . To see a case in which the minimal covering number is smaller than the size of set, note that there are eight possible three-node trees with possible values a and b, and they can be covered by these four trees. This is because if we pick any tree from the eight, and pick a path in this tree, this path exists in one of these four trees.
We extend Theorem 1 to hold uniformly over q in Appendix B, as done in Kuznetsov & Mohri (2015, 2020). This suggests devising an algorithm for optimizing over the parameters q and w to minimize the learning bounds, which is described in the next section.
Formulating the sequential optimization problem
We wish to use our learning bounds to obtain an algorithm for portfolio selection. This algorithm will include the instantaneous discrepancy Kuznetsov & Mohri (2015, 2020), which is independent of q,
with , the instantaneous discrepancy.
Based on Theorems 2 and 3, and the inequality (8), the sequential optimization problem we want to solve is
where the weights q are taken to be in the simplex, which facilitates a weighted average of historical losses while taking into account their instantaneous discrepancy. Multiplicative constants are absorbed into the regularization parameters λ1, λ2, determined by cross-validation. Note that calculating the instantaneous discrepancy dt requires solving a nonlinear constrained optimization problem for each data point, which might be computationally prohibitive. To avoid this, since the instantaneous discrepancy involves only a small number of samples, we can consider the return in a single time step to be much smaller than 1, and solve the linearized optimization problem, , which is just finding the maximal element of a vector.
The optimization objective in (9) is non-convex, and we use the projected gradient descent algorithm to reach a local minimum. The iterative step of the projected gradient descent algorithm is xk+1 = argminx∈C ∥ x - (xk - η ∇ g (xk)) ∥ 2, where g (x) is the optimization objective, and C is a convex set. This algorithm converges to a local minima of a constrained non-convex objctive, a proof is given in Bertsekas (1999). Since the regularization terms are not differentiable everywhere, we perform a differentiable approximation , as in Lee et al. (2006), with γ = 10-9. The projection step of the algorithm requires solving a quadratic constrained optimization problem, which might be computationally expensive, however, there is an efficient algorithm for projecting into the simplex Wang & Carreira- Perpiñán (2013). This algorithm can be applied to solve (9) because if we note x⊤ = (wq) , y⊤ = (ywyq), we have
After deriving an algorithm for portfolio selection, we evaluate it on a backtest. The data is generated from the daily adjusted closing prices of the top 10 market cap stocks of the S&P 500 index in 2006, which are: Exxon Mobile, General Electric, Microsoft, Citigroup, Bank of America, Procter & Gambel, Walmart, Johnson & Johnson, Pfizer, and AIG. The performance of an equal weight portfolio of these stocks is presented in Figure 2(a), the performance is inferior to that of the index, demonstrating that the dataset does not have a positive bias with respect to the benchmark, which is expected since the stocks were picked in 2006. The results of the backtest are shown in Figure 2(b), transaction costs are taken to be 0.5% of the portfolio turnover (including buy and sell transactions) as suggested in Bogle (2017), and we take l = u = 0.4, and s = 20. The figure shows the aggregated test samples of a walk-forward test with an in-sample training length of 1260 days (5 years), and an out-of-sample test length of 22 days (1 month). First, the data is trained on 1210 days and the last 50 days are used for selecting the hyperparameters λ1,2 = {0.05, 0.1, 0.5, 1}. After this parameter is set, the full training sample of 1260 days is retrained with this value, the resulting portfolio return of the next 22 days is recorded, and this process is repeated with data shifted 22 days forward.
A backtest of the algorithm (9). Figure 2(a) shows the difference between the S&P 500 index and an equal weight portfolio of its top assets, used for the backtest. In figure 2(b) we see that the discrepancy-based portfolio produced higher returns than other benchmarks, Sharpe ratios are reported in the main text. The instantaneous discrepancy is shown in 2(c) and features a batch of high discrepancy values around 2009. The corresponding values of the q variables are shown in 2(d) and we see lower values around 2009 as expected.
There are two benchmarks in the figure, the S&P 500 index, and a strategy that sets weights proportional to the compounded asset returns in the training dataset, where the weights of assets with negative returns are set to zero. We see that the discrepancy-based portfolio (DBP), obtained by solving the optimization problem (9), produces higher returns than these benchmarks. The Sharpe ratios of the strategies are: S&P 500: 0.7, compounded: 0.78, and DBP: 0.88. The instantaneous discrepancy dt, and qt variables for one training sample are shown in Figures 2(c) and 2(d), respectively. We see a batch of high discrepancy values around 2009, corresponding to the financial crisis, and that these dates have less weight in the optimization objective due to smaller or zero qt values, as expected. In Figures 3(a) and 3(b) we plot portfolio weights, each time the portfolios are rebalanced, for the DBP and compounded strategies. We see that the temporal weight patterns of both strategies are similar, except that the DBP strategy produces portfolios that are more localized on the leading assets. This is understood since the algorithm is maximizing compounded returns rather than distributing weights proportional to individual asset performance, also recall that the solution of the linearized objective in w is a single asset.
The portfolio weights of the of the DBP and compounded strategies, black is 1 and white is 0. We see that the weight-temporal patterns of both strategies are similar, except that the DBP strategy puts more weight on the leading assets.
Summary
We studied the problem of portfolio selection in non-stationary markets using a statistical learning approach. We bounded the generalization error of a non-stationary stochastic process by the sequential Rademacher complexity and the empirical discrepancy, and modified the bound for portfolio section by bounding the sequential Rademacher complexity. We used the learning bounds to obtain a sequential optimization algorithm for portfolio selection, and presented supportive numerical results with financial data.
Footnotes
Appendix
References
1.
Adams,T.M., Nobel,A.B.2010. Uniform convergence of vapnik–chervonenkis classes under ergodic sampling, Ann Probab38(4) 1345–1367.
Kuznetsov,V., Mohri,M.2015. Learning theory and algorithms for forecasting non-stationary time series, in ‘Advances in Neural Information Processing Systems 28’, pp. 541–549.
10.
Kuznetsov,V., Mohri,M.2020. Discrepancy-based theory and algorithms for forecasting non-stationary time series, Annals of Mathematics and Artificial Intelligence88(4), 367–399.
11.
Lee,S., Lee,H., Abbeel,P., Ng,A.Y.2006. Efficient l1 regularized logistic regression, in ‘In Proceedings of the Twenty-first National Conference on Artificial Intelligence (AAAI-06)’, pp. 1–9.
Rakhlin,A., Sridharan,K., Tewari,A.2011. Online learning: Stochastic, constrained, and smoothed adversaries, in ‘Advances in Neural Information Processing Systems 24’, pp. 1764–1772.
17.
Rakhlin,A., Sridharan,K., Tewari,A.2015a. Online learning via sequential complexities, Journal of Machine Learning Research16(6), 155–186.
18.
Rakhlin,A., Sridharan,K., Tewari,A.2015b. Sequential complexities and uniform martingale laws of large numbers, Probability Theory and Related Fields161(1), 111–153.
19.
Wang,W., Carreira-Perpiñán,M.2013. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. Available at https://arxiv.org/abs/1309.1541.