Portfolio optimization for cointelated pairs: SDEs vs Machine learning

Abstract

With the recent rise of Machine Learning (ML) as a candidate to partially replace classic Financial Mathematics (FM) methodologies, we investigate the performances of both in solving the problem of dynamic portfolio optimization in continuous-time, finite-horizon setting for a portfolio of two assets that are intertwined.

In the Financial Mathematics approach we model the asset prices not via the common approaches used in pairs trading such as a high correlation or cointegration, but with the cointelation model in Mahdavi-Damghani (2013) that aims to reconcile both short-term risk and long-term equilibrium. We maximize the overall P&L with Financial Mathematics approach that dynamically switches between a mean-variance optimal strategy and a power utility maximizing strategy. We use a stochastic control formulation of the problem of power utility maximization and solve numerically the resulting HJB equation with the Deep Galerkin method introduced in Sirignano and Spiliopoulos (2018).

We turn to Machine Learning for the same P&L maximization problem and use clustering analysis to devise bands, combined with in-band optimization. Although this approach is model agnostic, results obtained with data simulated from the same cointelation model gives a slight competitive advantage to the ML over the FM methodology 1 .

Keywords

Pairs trading cointelation portfolio optimization stochastic control band-wise gaussian mixture deep learning

1 Introduction

The concept of co-movement and correlation have been shown to be usually misunderstood Mahdavi-Damghani et al. (2012) by practitioners yielding the proposed Cointelation model Mahdavi-Damghani (2013). We solve the portfolio optimization problem employing a more general set of admissible strategies than long/short strategies used in pairs trading.

A pairs trading strategy involves matching a long position with a short position in two assets with a high correlation. Pairs trading was pioneered in the mid 1980s by a group of quantitative researchers from Morgan Stanley. For an introduction to pairs trading see Vidyamurthy (2004). The securities in a pairs trade must have a high positive correlation, which is the primary driver behind the strategy’s profits.

Pairs trading is based on the high historical correlation of two assets and a trader’s view that the two securities will maintain a specified correlation. A pairs trading strategy is applied when a trader identifies a correlation discrepancy. More specifically, the trader monitors performance of two historically correlated securities. When the correlation between the two securities temporarily weakens, i.e. the spread widens, the trader applies a trading strategy which shorts the high asset and buys the low asset. As the spread narrows again to some equilibrium value, a profit results.

However, many authors argue that correlation is an inappropriate measure of dependency in financial markets, since returns often exhibit a nonlinear co-dependence (e.g. Alexander (2001), Wilmott (2007)). Mahdavi-Damghani et al. (2012) showed that the measured correlation taken from the returns of a mean-reverting processes is misleading: indeed a strong positive correlation does not necessarily imply that two stochastic processes move in the same direction and vice versa. Also Correlation measures the short term risk but Cointegration, on the other hand, tests the long-term equilibrium relationships between assets and has been extensively used in low frequency pairs trading (see Vidyamurthy (2004)). Cointegration tests do not measure how well two variables move together, but rather whether the difference between their means remains constant. Sometimes series with high correlation will also be cointegrated, and vice versa, but this is not always the case.

The cointelation model was introduced in Mahdavi-Damghani (2013) as a hybrid model which reconciles correlation and cointegration by capturing both short-term risk and long-term equilibrium. The rationale for the long term risk is that during the time of rare market crashes all assets prices fall. However, in the more bullish periods, the short term risk increases, the long term risk becomes less pronounced and the “macro” driver less visible. These influences are accompanied with mean reversion forces from one asset to the other.

In this setting we consider a continuous-time, finite horizon portfolio optimization problems for pairs of assets whose prices follow the cointelation model of Mahdavi-Damghani (2013). Generally, the optimization problem is to find the optimal control ${\tilde{w}}^{*} = \underset{\tilde{w} \in A}{argmax} U (X_{t}^{\tilde{w}}, Y_{t}^{\tilde{w}})$ (1) where U (x) is a utility function, $\tilde{w} = ({\tilde{w}}_{1}, {\tilde{w}}_{2})$ is a vector of proportions of wealth invested in each asset, $A$ is a set of admissible strategies: either ${\tilde{w}}_{1} = - {\tilde{w}}_{2}$ (long/short) or ${\tilde{w}}_{1}, {\tilde{w}}_{2} > 0$ with ${\tilde{w}}_{1} + {\tilde{w}}_{2} = 1$ (long only).

We solve the portfolio optimization problem in (1) with Financial Mathematics and Machine Learning methodologies and compare their performance. In the Financial Mathematics approach we use SDE evolution of asset prices, whereas the Machine Learning approach does not assume an underlying model and applies generally to any pair of assets.

In Section 2 we review the cointelation model. In Section 3 we use the classical Financial Mathematics criteria: mean-variance optimization and power utility maximization. In Section 4 we use clustering analysis from Machine Learning to solve the P&L maximization problem. We present the results of each approach in Section 5 and discuss them comparatively.

2 Review of cointelation model for pairs of asset

We first present the usual way correlation is calculated in the financial industry (see e.g. p.274 Wilmott (2007), Alexander (2001)). Assume we have two assets with prices modeled by stochastic processes (X_t) _t≥0 and (Y_t) _t≥0 on a probability space $(Ω, F, ℙ)$ . We have N observations of X and Y at intervals Δt, i.e. X (t_i) and Y (t_i) for all i = 1, . . . , N and Δt = t_i - t_i-1. Here Δt ∈ {1, 5, 22, 252} corresponds to daily, weekly, monthly and yearly data. The Δt-returns on i-th data point of assets X and Y is $R_{X} (t_{i}, Δ t) = \frac{X (t_{i} + Δ t) - X (t_{i})}{X (t_{i})}$ (2) $R_{Y} (t_{i}, Δ t) = \frac{Y (t_{i} + Δ t) - Y (t_{i})}{Y (t_{i})} .$ (3) The sample volatilities of time series of asset prices X and Y are then $σ_{X} (Δ t) = \sqrt{\frac{1}{Δ t (N - 1)} \sum_{i = 1}^{N} (R_{X} (t_{i}, Δ t) - {\bar{R}}_{X})^{2}}$ (4) $σ_{Y} (Δ t) = \sqrt{\frac{1}{Δ t (N - 1)} \sum_{i = 1}^{N} (R_{Y} (t_{i}, Δ t) - {\bar{R}}_{Y})^{2}},$ (5) where $\bar{R}$ _X, $\bar{R}$ _Y are the sample average of all the returns in the series of X and Y, respectively. The sample covariance between the returns of assets X and Y is given by

$\begin{matrix} σ_{XY} (Δ t) = \frac{1}{Δ t (N - 1)} \\ \sum_{i = 1}^{N} (R_{X} (t_{i}, Δ t) - {\bar{R}}_{X}) (R_{Y} (t_{i}, Δ t) - {\bar{R}}_{Y}) . \end{matrix}$ (6) In this paper we consider the measured correlation, which is the sample cross-correlation given by $ρ_{XY} (Δ t) = \frac{σ_{XY} (Δ t)}{σ_{X} (Δ t) σ_{Y} (Δ t)} .$ (7)

For correlation to be an appropriate choice of measure of co-dependence the assumption of linear dependency between series needs to be satisfied (see Chapter 1.4 Alexander (2001)). Often in financial markets with a non-linear dependence between returns, the correlation is an misleading measure of co-dependency, especially when used to capture long-term relationship between assets (see Mahdavi-Damghani et al. (2012) and Mahdavi-Damghani (2013)).

An alternative statistical measure to correlation is cointegration. If two time series X_t and Y_t are integrated 2 of order d and there exists β such that a linear combination X_t + βY_t is integrated of order less that d, then X_t and Y_t are cointegrated (see Engle and Granger (1991)). Since the spread of cointegrated asset prices is mean reverting, they have a common stochastic trend, i.e. the asset prices are ‘tied together’ in the long term, although they might drift apart in the short-term (see Alexander (1999)). Because the cointegration requires sophisticated statistical analysis, it has not been used as widely as correlation in the financial industry.

Although correlation and cointegration are related, they are different concepts. High correlation does not necessarily imply high cointegration, and neither does high cointegration imply high correlation (e.g. see Figure 4 in Mahdavi-Damghani et al. (2012)). Two assets may be perfectly correlated over short timescales yet diverge in the long run, with one growing and the other decaying. Conversely, two assets may follow each other, with a certain finite spread, but with any correlation, positive, negative or varying.

Mahdavi-Damghani (2013) proposed cointelation as a hybrid model that aims to mediate between correlation and cointegration. It captures both short-term and long-terms relationships between the assets.

Definition 1. Consider a filtered probability space by $(Ω, F, {(F_{t})}_{(t \geq 0)}, ℙ)$ , with the historical probability measure, $ℙ$ . The cointelation model for a pairs of assets with prices X_t and Y_t defined in Mahdavi-Damghani (2013) as

$\begin{matrix} {dX}_{t} = μ X_{t} dt + σ X_{t} {dW}_{t}, \\ {dY}_{t} = κ (X_{t} - Y_{t}) dt + η Y_{t} d {\tilde{W}}_{t}, \\ d 〈 W, \tilde{W} 〉_{t} = ρ dt, \end{matrix}$ (8) where $μ \in ℝ$ , σ > 0, X (t₀) = x₀ are the drift, diffusion coefficients and initial value of asset price X; 0 < κ ≤ 1, η > 0, Y (t₀) = y₀ > 0 are the rate of mean reversion, volatility and initial value of the asset price Y; $(\tilde{W} (t))_{t \geq 0}$ and (W (t)) _t≥0 are two correlated Brownian motions with constant correlation coefficient -1 ≤ ρ ≤ 1 that generate the filtration ${(F_{t})}_{(t \geq 0)}$ .

The processes (X) _t≥0 and (Y) _t≥0 are called the leading process and the lagging process, respectively. This is due to the fact that the lagging process reverts around the leading process.

We present here the concepts of inferred correlation function and number of crosses formula introduced in Mahdavi-Damghani (2013) in order to device a test whether two pairs are cointelated.

Let $ρ_{XY}^{*} (Δ t)$ be the inferred correlation function between two times series of cointelated asset prices defined as follows $ρ_{XY}^{*} (Δ t) = sup_{0 < \tilde{Δ} t \leq Δ t} ρ_{XY} (\tilde{Δ} t) .$ (9) Sometimes there may not be enough data to calculate Δt-inferred (measured) correlation of cointelated assets. In Mahdavi-Damghani (2013) the following formula for approximation of inferred correlation (9) was proposed via examining various data sets: $ρ_{XY}^{*} (Δ t) \approx ρ + (1 - ρ) [1 - exp (- λ κ (Δ t - 1))],$ (10) where κ ∈ [0, 1], λ > 0, ρ ∈ [-1, 1]. The parameter λ ≈ 1.75 for "regular financial data", although it is itself a function in general. Thus, if one does not have enough empirical data to calculate, for example, the yearly (252 days) inferred correlation, the formula in equation (10) allows to approximate it using only κ and ρ parameters of cointelation model in (8) and setting Δt = 252, λ ≈ 1.75.

The motivation for inferred correlation approximation formula (10), is that in the discrete version of the processes in equation (8) the measured correlation increases as the time increment, Δt, increases (e.g. correlations calculated using daily, weekly, monthly returns). Moreover, the measured correlation of cointelated pairs will converge to 1 faster as the speed of mean reversion parameter κ increases. If we set ρ = -1 in (8), the inferred correlation of cointelated asset prices may cover the whole correlation spectrum [-1, 1] (see Figure 1).

Fig. 1

(Up) Simulated path of cointelation model (8) with ρ = -1, θ = 0.1, σ = 0.01; (Down) Corresponding measured correlation (7) as a function of the time increment increases from -1 to 1.

Another way for testing if two times series are cointelated is to study how many times the normalized series cross paths. If one discretizes equation (8), then one can approximate the expectation of the number of times, Γ_x,y, the two stochastic process, x = X_{i∈[1,2,…N]} and y = Y_{i∈[1,2,…N]}, cross paths as follows $E [Γ_{x, y} (κ, N)] \approx N [γ (1 - κ) + \frac{1}{2} \sqrt{κ}]$ (11) with N is the length of the data, γ is a positive constant and κ is the speed of mean reversion in equation (8).

Compared to the number of times purely correlated SDEs (eg: without the mean reversion component, i.e. when κ = 0) the number of times the discrete version of the cointelated SDEs cross paths is larger than if they were random, and the bigger the κ the more often the paths of discretized SDEs cross each other per unit of time.

Then two stochastic processes are cointelated (see Mahdavi-Damghani (2013)) if

Inferred correlation formula in equation (10) is verified;

The number of crosses formula in equation (11) is verified;

the underlying assets have a reasonable physical connection that would suggest their spread should mean revert (e.g. oil and BP share prices).

The parameters in cointelation model (8) can be estimated using the inferred correlation formula (10) and the number of crosses formula (11) (see Mahdavi-Damghani (2013)). Similarly to the variance reduction methodology described in Mahdavi-Damghani et al. (2012), Mahdavi-Damghani (2013), we can define $B_{+} = | \frac{max (X_{t} - Y_{t}, t \in [0, T])}{2} |,$ (12) $B_{-} = | \frac{inf (X_{t} - Y_{t}, t \in [0, T])}{2} | .$ We note that the estimation of κ has a higher variance when $Z_{ρ} = B_{+} > | X_{t} - Y_{t} | > B_{-},$ (13) where ρ, on the other hand has quality samples. The reverse is true when $Z_{κ} = | X_{t} - Y_{t} | > B_{+} ⋃ | X_{t} - Y_{t} | < B_{-} .$ (14) We can therefore sample κ in Z_κ and ρ in Z_ρ. Figure 2 illustrates this.

Fig. 2

Hypothetical spread split in three different zones for risk management or/and trading purposes.

3 Financial Mathematics approach for portfolio optimization problem

We consider the portfolio of two assets and model the their prices with the cointelation in (8). We approach the optimization problem of this portfolio with classic Financial Mathematics criteria: mean-variance and power utility maximization. Since the cointelated assets are characterized by both correlation and mean-reversion components, we formulate the mean-variance optimization problem for long only strategies and we calculate the optimal strategies to make profit on correlation. To make profit on mean-reversion property of the cointelated assets we use stochastic control formulation of the power utility maximization problem for long/short strategies and calculate the optimal weights. We then maximize portfolio P&L by dynamically switching between these two optimal strategies.

3.1 Mean-variance optimization

We first review fundamental notions and concepts for mean-variance optimization.

Returns: A portfolio considers a combination of n potential assets, with an initial capital V (0) and weights w₁, w₂, . . . , w_n, such that $\sum_{i}^{n} w_{i} = 1$ , w_iV (0) is the amount invested in security i for i = 1, 2, . . . , n at time t = 0. The number of shares to invest in security i at time t = 0 is $n_{i} = \frac{w_{i} V (0)}{S_{i} (0)} .$ (15) The value of portfolio at time t is $V (t) = \sum_{i = 1}^{N} n_{i} S_{i} (t) .$ (16) Given the number of shares n_i with i = 1, . . . , n, the percentage of the portfolio invested in asset i at time t is $w_{i} (t) = \frac{n_{i} S_{i} (t)}{\sum_{i = 1}^{N} n_{i} S_{i} (t)},$ (17) with $\sum_{i = 1}^{N} w_{i} (t) = 1$ . The rate of return of asset i at time t (i.e. over [t - Δt, t]) is given by $R_{i} (t) = \frac{S_{i} (t) - S_{i} (t - Δ t)}{S_{i} (t - Δ t)} = \frac{S_{i} (t)}{S_{i} (t - Δ t)} - 1 .$ (18) The rate of return of portfolio, R_p (t), is then $R_{p} (t) = \frac{V (t) - V (t - Δ t)}{V (t - Δ t)} .$ (19) We can show that the return of portfolio is a linear combination of the returns of individual assets as follows

$\begin{matrix} R_{p} (t) & = & - 1 + \frac{V (t)}{V (t - Δ t)} \\ = & - 1 + \sum_{i = 1}^{N} \frac{n_{i} S_{i} (t)}{\sum_{j = 1}^{N} n_{i} S_{i} (t - Δ t)} \\ = & - 1 + \sum_{i = 1}^{N} \frac{n_{i} S_{i} (t - Δ t) S_{i} (t)}{\sum_{j = 1}^{N} n_{i} S_{i} (t - Δ t) S_{i} (t - Δ t)} \\ = & - 1 + \sum_{i = 1}^{N} w_{i} (t) (R_{i} (t) + 1) \\ = & \sum_{i = 1}^{N} w_{i} (t) R_{i} (t) . \end{matrix}$ (20) Sometimes it is more convenient to use log returns, which are defined for asset i by $r_{i} (t) = ln (\frac{S_{i} (t)}{S_{i} (t - Δ t)}) .$ (21) It should be pointed out that for short period of time the log return is approximately equal to the rate of return $r_{i} (t) = ln (\frac{S_{i} (t)}{S_{i} (t - Δ t)}) = ln (R_{i} (t) + 1) \approx R_{i} (t) .$ (22) Therefore we do not distinguish between these two returns, as long as the time increment, Δt, is short compared to the rate of return. Going forward we will use daily logarithmic returns. Thus, the return of portfolio, r_p, at time at time t in this case becomes $r_{p} = \sum_{i = 1}^{N} w_{i} r_{i} .$ (23)

Expectation and variance of returns: By the linearity property of expected value operator, the expected return of portfolio, E (r_p), is

$\begin{matrix} E (r_{p}) = E (\sum_{i = 1}^{N} w_{i} r_{i}) = \sum_{i = 1}^{N} w_{i} E (r_{i}) \\ = \sum_{i = 1}^{N} w_{i} μ_{i} = w^{⊤} μ, \end{matrix}$ (24) where μ_i denotes the expected return of asset i and w^⊤ = [w₁, w₂, . . . , w_n], μ = [μ₁, μ₂, . . . , μ_n] ^⊤.

The variance of the return of portfolio, Var (r_p), is given by

$\begin{matrix} Var (r_{p}) \\ = E [{(\sum_{i = 1}^{N} w_{i} r_{i} - E (r_{p}))}^{2}] \\ = E [{(\sum_{i = 1}^{N} w_{i} (r_{i} - E (r_{i})))}^{2}] \\ = E [(\sum_{i = 1}^{N} w_{i} (r_{i} - E (r_{i}))) (\sum_{j = 1}^{N} w_{j} (r_{j} - E (r_{j})))] \\ = \sum_{i = 1}^{N} \sum_{j = 1}^{N} w_{i} w_{j} \underset{: = σ (r_{i}, r_{j})}{\underset{︸}{E [(r_{i} - E (r_{i})) (r_{j} - E (r_{j}))]}} \\ = \sum_{i = 1}^{N} \sum_{j = 1}^{N} w_{i} w_{j} σ (r_{i}, r_{j}) = w^{⊤} Σ w, \end{matrix}$ (25)

where Σ denotes the covariance matrix of the asset returns, composed of all covariances between the returns of assets i and j defined as σ (r_i, r_j). The variance of asset i’s return, which constitute the diagonal of the covariance matrix, is σ (r_i, r_i).

Optimal investment strategy using mean-variance criterion

We consider a portfolio consisting of two assets. The uncertainty is modelled by a probability space $(Ω, F, P)$ with a filtration $(F_{t})_{t \geq 0}$ generated by two-dimensional Brownian motion: $(W, \tilde{W})$ . Denote by X (t) and Y (t) the prices of two assets at time t, with dynamics following cointelation model in (8). The investment behavior is modelled by an investment strategy h = (h₁, h₂). Here, h_i ∈ [0, 1], i = 1, 2, denotes the percentage of total wealth invested in i-th asset (see equation (17)). Let h₁ (t) and h₂ (t) denote respectively the portfolio weights for assets X and Y at time t. The holdings are allowed to be adjusted continuously up to a fixed horizon T.

Denoting by $V_{t}^{h}$ the value of portfolio at time t associated to a strategy h we have $V^{h} (t) = \frac{h_{1} (t) V^{h} (t)}{X (t)} X (t) + \frac{h_{2} (t) V^{h} (t)}{Y (t)} Y (t),$ (26) with initial wealth V^h (t₀) = v₀. We restrict our considerations to self-financing strategies, where the value of the portfolio changes only because the asset prices change, i.e. there is no inflow or withdrawal of money (see Harrison and Kreps (1979)). In this case the dynamic of the wealth process is ${dV}^{h} (t) = V^{h} (t) [h_{1} (t) \frac{dX (t)}{X (t)} + h_{2} (t) \frac{dY (t)}{Y (t)}] .$ (27)

Let $A^{1}$ denote the set of all admissible strategies, h = (h₁, h₂), satisfying:

Given v₀ > 0 the wealth process V^v₀,h (·) corresponding to w₀, h satisfies $V^{v_{0}, h} (t) \geq 0, 0 \leq t \leq T,$ (28)

h_i (t) ≥0 for all i = 1, 2,

$\sum_{i = 1}^{2} h_{i} (t) = 1$ .

An investment strategy, $h \in A^{1}$ , is called optimal if there exists no other strategy $\tilde{h} \in A^{1}$ such that E (r_p (h)) ≥ E (r_p ( $\overset{h}{~}$ )) and Var (r (h)) ≤ Var (r ( $\overset{h}{~}$ )) with at least one inequality being strict (see Li and Ng (2000)).

We define a utility function, U (t, h), as in Bodie et al. (1999): $U (t, h) = 2 τ E [r_{p} (t)] - σ^{2} [r_{p} (t)],$ (29) where τ ≥ 0 is the risk tolerance coefficient. Then according to Garcia et al. (2017) we have the following proposition.

Proposition 1. [Mean-Variance Criterion] Finding an optimal strategy for mean-variance criteria is equivalent to the utility maximization problem: $max_{h (t)} U (t, h)$ (30) with constraints

$\sum_{i = 1}^{N} h_{i} = 1$ ,

h_i ≥ 0 ∀ i.

and U (t, h) given in (29).

Thus we have optimization problem in equation (30). From equation (20) we have that the rate of return of our portfolio, R_p, over [t - Δt, t] is $R_{p} (t) = \frac{V^{h} (t) - V^{h} (t - Δ t)}{V^{h} (t - Δ t)} = \sum_{i = 1}^{2} h_{i} (t) R_{i} (t),$ (31) where R_i is the rate of return of individual assets. The log return of our portfolio, r_p is given by $r_{p} (t) = h_{1} r_{1} (t) + h_{2} r_{2} (t),$ (32) where r_i (t) ≈ R_i (t), as we showed in equation (22).

Lemma 1.Denote by V^h (t) the value of the portfolio corresponding to the admissible strategy $h \in A^{1}$ . Then:

The expectation of portfolio return over [t - Δt, t] is $E (r_{p} (t)) = h_{1} E [r_{X} (t)] + h_{2} E [r_{Y} (t)] . ∥$ (33)

The variance of portfolio return over [t - Δ, t] is

$\begin{matrix} Var (r_{p} (t)) = & h_{1}^{2} Var [r_{X} (t)] + h_{2}^{2} Var [r_{Y} (t)] \\ + 2 h_{1} h_{2} Cov [r_{X} r_{Y} (t)], \end{matrix}$ (34)

where

r (X_{t}) = ln (\frac{X_{t}}{X_{t - Δ t}})

and

r (Y_{t}) = ln (\frac{Y_{t}}{Y_{t - Δ t}})

the daily log returns of assets X and Y and

$E (r_{X} (t)) = (μ - \frac{σ^{2}}{2}) Δ t$ is the expected return of the asset price X over the horizon [t - Δt, t];

$E (r_{Y} (t)) = [ln ({ae}^{μ Δ t} + (Y_{0} - a) e^{- κ Δ t}) - \frac{{ce}^{(2 μ + σ^{2}) Δ t} + {de}^{(μ - κ + σ η ρ) Δ t}}{2 ({ae}^{μ Δ t} + (Y_{0} - a) e^{- κ Δ t})^{2}} - ln (Y_{t - Δ t})] -$ $\frac{(Y_{0}^{2} - c - d) e^{2 (η^{2} - κ) Δ t}}{2 ({ae}^{μ t} + (Y_{0} - a) e^{- κ Δ t})^{2}} + \frac{1}{2}$ is the expected return of the asset price Y over the horizon [t - Δt, t];

Var (r_X (t)) = σ²Δt is the variance of return of asset price X over the horizon [t - Δt, t];

$Var (r_{Y} (t)) = \frac{{ce}^{(2 μ + σ^{2}) Δ t}}{({ae}^{μ Δ t} + (Y_{0} - a) e^{- κ Δ t})^{2}} + \frac{{de}^{(μ + σ η ρ - κ) Δ t}}{({ae}^{μ Δ t} + (Y_{0} - a) e^{- κ Δ t})^{2}} + \frac{(Y_{0}^{2} - c - d) e^{2 (η^{2} - κ) Δ t}}{({ae}^{μ Δ t} + (Y_{0} - a) e^{- κ Δ t})^{2}} - 1$ is the variance of return of asset price Y over the horizon [t - Δt, t];

$Cov (r_{X} (t) r_{Y} (t)) = ln (\frac{{be}^{(μ + σ^{2}) Δ t} + (X_{0} Y_{0} - b) e^{(σ η ρ - κ) Δ t}}{{aX}_{0} e^{2 μ Δ t} + (X_{0} Y_{0} - {aX}_{0}) e^{(μ - κ) Δ t}})$ is the covariance of returns of two asset prices X and Y over the horizon [t - Δt, t].

Proof. See Appendix A

The optimal weights for mean-variance criterion were derived in Soeryana et al. (2017). We state the following proposition from Soeryana et al. (2017) applied to the cointelation model (8).

Proposition 2. The optimal solution for the problem in (30) for cointelation model (8) is:

$\begin{matrix} h^{*} (t) = \frac{1}{e^{'} Σ^{- 1} (t) e} Σ^{- 1} (t) e \\ + τ [Σ^{- 1} (t) M (t) - \frac{e^{'} Σ^{- 1} (t) M (t)}{e^{'} Σ^{- 1} (t) e} Σ^{- 1} (t) e], \end{matrix}$ (35)

Replacing these formulas for expectation, variance and covariance of the returns of asset prices in equation (35), we get optimal strategies for mean-variance optimization problem. We will present numerical examples in Section 5.

3.2 Stochastic control for pairs trading

Power utility maximization problem

We now use a stochastic control approach to the power utility maximization problem. Here we mainly follow Mudchanatongsuk et al. (2008), but with modified dynamics for asset prices. More specifically, they assume the price dynamics of one of the assets is a geometric Brownian motion and model the log-spread as an Ornstein-Uhlenbeck process. We, however, assume the dynamics of asset prices are governed by the cointelation model in equation (8), where one of the assets follow the geometric Brownian motion and the second asset mean reverts around the first one.

Let $(Ω, F, P)$ be a complete probability space with a filtration $(F_{t})_{t \geq 0}$ generated by two-dimensional Brownian motion: $(W, \tilde{W})$ . We consider the same market as in Subsection 3.1: two assets which follow the cointelation model (8).

We assume an initial wealth v₀ > 0 at time t = 0. Initial wealth is held in a margin account. For simplicity we assume that the interest rate for margin account is 0, r = 0. Margining constraints can be quite punitive financial. The holdings are allowed to be adjusted continuously up to a fixed horizon T. The investment behavior is modelled by an investment strategy π = (π₁, π₂). Here, π_i (t), i = 1, 2, denotes the percentage of total wealth invested in i-th asset at time t (see equation (17)). Let π₁ (t), π₂ (t) be respectively the portfolio weights for assets X and Y at time t. We only allow pairs trading: short one of the asset and long the other in equal dollar amount, i.e. π₁ (t) = - π₂ (t). In addition, we restrict our considerations to self-financing strategies.

We define admissible control and controlled process as in Korn and Kraft (2002).

Definition 2. [Control] Given a subset U of $ℝ^{2}$ , we denote by $U_{0}$ the set of all progressively measurable processes π = {π_t, t ≥ 0} valued in U. The elements of $U_{0}$ are called control processes.

Denote by V^π (t) the value of portfolio corresponding to strategy π at time t, which is given by $V^{π} (t) = \frac{π_{1} (t) V^{π} (t)}{X (t)} X (t) + \frac{π_{2} (t) V^{π} (t)}{Y (t)} Y (t) .$ (36) The dynamics of the portfolio value V^π associated with strategy π = (π₁, π₂) is given by ${dV}^{π} (t) = V^{π} (t) [π_{1} (t) \frac{dX (t)}{X (t)} + π_{2} (t) \frac{dY (t)}{Y (t)}]$ (37) Replacing the dynamics for X (t) and Y (t) into (37) we get:

$\begin{matrix} {dV}^{π} (t) = V^{π} (t) [π_{1} (μ dt + σ dW (t)) \\ - π_{1} (κ (\frac{X (t)}{Y (t)} - 1) dt + η d \tilde{W} (t))] . \end{matrix}$ (38)

Lemma 2. Denote $Z (t) : = \frac{X (t)}{Y (t)}$ . For the cointelation model (8) we obtain that Z (t) has the dynamics

$\begin{matrix} dZ (t) = & [μ + η^{2} - σ η ρ - κ (Z (t) - 1)] Z (t) dt \\ + Z (t) (σ dW (t) + η d \tilde{W} (t)) . \end{matrix}$ (39)

Proof. By Ito’s quotient rule:

$\begin{matrix} d (\frac{X (t)}{Y (t)}) = \frac{dX (t)}{X (t)} \frac{X (t)}{Y (t)} - \frac{dY (t)}{Y (t)} \frac{X (t)}{Y (t)} \\ + \frac{d 〈 Y, Y 〉_{t}}{Y (t)^{2}} \frac{X (t)}{Y (t)} - \frac{d 〈 X, Y 〉_{t}}{X (t) Y (t)} \frac{X (t)}{Y (t)} . \end{matrix}$ (40) Writing this in terms of Z (t) gives

$\begin{matrix} dZ (t) = & Z (t) (μ dt + σ {dW}_{t} - κ (Z (t) - 1) dt \\ - η d \tilde{W} (t) + \frac{η^{2} Y^{2} (t)}{Y^{2} (t)} dt - σ η ρ dt) \\ = & [μ + η^{2} - σ η ρ - κ (Z (t) - 1)] Z (t) dt \\ + (σ dW (t) - η d \tilde{W} (t)) Z (t), \end{matrix}$ (41) which proves the lemma.

For each control process $π \in U_{0}$ we rewrite the dynamics of two-dimensional state process, P = (V^π, Z), as follows $dP (t) = a (t, P (t), π (t)) dt + b (t, P (t), π (t)) dB (t) .$ (42) with initial value of P (t₀) = p₀ and $B = (W, \tilde{W})$ being the two-dimensional Brownian motion. The process P is called the controlled process. Let [t₀, T] with 0≤ t₀ < T < ∞ be the relevant time interval and define $Q : = [t_{0}, T) \times ℝ^{2}$ . The coefficient functions $a : Q \times U \to ℝ^{2},$ (43) $b : Q \times U \to ℝ^{2 \times 2},$ are all continuous. Further, for all π ∈ U let a (· , · , π) and b (· , · , π) be in C¹ (Q). We then define

Definition 3. [Admissible control] Denoting $A^{2}$ the set of all admissible controls, we say a control {π (t) } _t∈[t₀,T] will be called admissible if the following conditions hold

$\forall k \in ℕ$ the integrability condition $E (\int_{t_{0}}^{T} | π (s) |^{k} ds) < \infty$ (44) is satisfied,

the corresponding state process P^π satisfies $E^{t_{0}, p_{0}} (sup_{t \in [t_{0}, T]} | P^{π} (t) |^{k}) < \infty,$

only pairs trading is allowed: short one of the asset and long the other $π_{1} = - π_{2} .$ (45)

Since we consider a self-financing portfolio, then by equation (45) the dynamics of the state process, P = (V^π, Z), becomes $\begin{matrix} {dV}^{π} (t) = V^{π} (t) [(π_{1} [μ - κ (Z (t) - 1)]) dt \\ + π_{1} [σ dW (t)) + η d \tilde{W} (t)]], V^{π} (0) = v_{0}, \\ dZ (t) = [μ + η^{2} - σ η ρ - κ (Z (t) - 1)] Z (t) dt \\ + [σ dW (t) - η d \tilde{W} (t)] Z (t), Z (0) = z_{0} . \end{matrix}$

Optimal investment strategy

We assume that an investor’s preference is represented by the power utility function $U (x) = \frac{1}{γ} x^{γ},$ (46) with x ≥ 0 and risk aversion parameter γ < 1. Our aim is to maximize the objective functional J over all admissible controls, i.e. determine an admissible control π (·) such that for each initial value (t₀, v₀) the utility functional below is maximized: $J (t_{0}, v_{0}, z_{0}; π) : = E [U (V^{π} (T)) | V_{t_{0}} = v_{0}, Z_{t_{0}} = z_{0}] .$ (47) The optimization problem is to find $\tilde{v} (t, v, z)$ and $π \in A^{2}$ such that $\tilde{v} (t, v, z) : = sup_{π (\cdot) \in A^{2}} J (t, v, z, π) = J (t, v, z, π^{*}) .$ (48) Consider the function G (t, v, z) such that G ∈ C^1,2 (Q). The Hamilton-Jacobi-Bellman (HJB) equation corresponding to the stochastic control problem (48) is

$\frac{\partial G}{\partial t} (t, v, z) + sup_{π \in A^{2}} L^{π} G (t, v, z) = 0,$ (49) subject to terminal condition $G (T, v, z) = v^{γ} .$ (50) The infinitesimal generator, $L^{π} G (t, v, z)$ in (49) associated with the two dimensional state process P = (V, Z) is given by

$\begin{matrix} L^{π} G (t, v, z) = \frac{1}{2} [π_{1}^{2} (σ^{2} - 2 σ η ρ \\ + η^{2}) v^{2} G_{vv} + 2 π_{1} (σ^{2} - 2 σ η ρ + η^{2}) {vzG}_{vz} \\ + (σ^{2} - 2 σ η ρ + η^{2}) z^{2} G_{zz}] + [π_{1} [μ - κ (z - 1)]] {vG}_{v} \\ + [μ + η^{2} - σ η ρ - κ (z - 1)] {zG}_{z} . \end{matrix}$ (51)

Theorem 1. If there exists an optimal control π^* (·) then G coincides with the value function: $G (t, v, s) = \tilde{v} (t, v, z) = J (t, v; π^{*}) .$

Using separation ansatz we reduce a 3-dimensional HJB equation in (49) to the following 2-dimensional PDE:

$\begin{matrix} \tilde{σ} (γ - 1) {ff}_{t} - \frac{1}{2} {\tilde{σ}}^{2} γ z^{2} f_{z}^{2} - \frac{1}{2} γ [μ - κ (z - 1)]^{2} f \\ + \frac{1}{2} \tilde{σ} (γ - 1) z^{2} {ff}_{zz} - \tilde{σ} γ [μ - κ (z - 1)] {zff}_{z} \\ + \tilde{σ} (γ - 1) [μ + η^{2} - σ η ρ - κ (z - 1)] {ff}_{z}, with \\ f (T, z) = 1, (t, z) \in [0, T] \times ℝ, \forall z \in ℝ, \end{matrix}$ (52) where $\tilde{σ} = σ^{2} - 2 σ η ρ + η^{2}$ .

The issue at this stage is that this PDE does not have a closed for solution. This is a non standard PDE, which is not high dimensional but is nonlinear which makes using finite difference methods or any standard numerical methods inadequate. For this reason we propose to use the “Deep Galerkin Method” to solve the PDE in (52). Once the solution is found, we can write the optimal strategy as

$\begin{matrix} π_{1}^{*} & = - \frac{\tilde{σ} {zG}_{vz} + [μ - κ (z - 1)] G_{v}}{\tilde{σ} {vG}_{vv}} \\ = - \frac{\tilde{σ} z (f_{z} v^{γ - 1} γ) + [μ - κ (z - 1)] ({fv}^{γ - 1} γ)}{\tilde{σ} v ({fv}^{γ - 2} γ (γ - 1))} \\ = - \frac{\tilde{σ} {zf}_{z} + [μ - κ (z - 1)] f}{\tilde{σ} f (γ - 1)} \\ = - \frac{{zf}_{z}}{(γ - 1) f} - \frac{[μ - κ (z - 1)]}{\tilde{σ} (γ - 1)} . \end{matrix}$ (53) See Appendix B for the details.

3.3 Deep learning for solving PDE in stochastic control

Without an analytical solution to the non-standard 2-dimensional PDE in (52), we approximate the solution with the algorithm “Deep Galerkin Method” (DGM) proposed in Sirignano and Spiliopoulos (2018). DGM is a merger of the Galerkin method and deep neural network machine learning algorithm. The Galerkin method is a popular numerical method which seeks a reduced-form solution to a PDE as a linear combination of basis functions. The deep learning algorithm, or DGM, uses a deep neural network instead of a linear combination of basis functions. The algorithm is trained on batches of randomly sampled time and space points, therefore it is mesh free.

Brief review of DGM

In general case, consider a PDE with d spatial dimensions:

$\begin{matrix} \frac{\partial u}{\partial t} (t, x; θ) + L u (t, x) = 0, (t, x) \in [0, T] \times Ω, \\ u (t, x) = g (t, x), x \in \partial Ω, \\ u (t = 0, x) = u_{0} (x), x \in Ω \end{matrix}$ (54) where $x \in Ω \subset ℝ^{d}$ and $L$ is an operator of all the other partial derivatives. The goal is to approximate the U (t, x) with deep neural network f (t, x ; θ). Here $θ \in ℝ^{K}$ are the neural network parameters. We want to minimize the objective function associated to the problem (54) which consists of three parts:

A measure of how well the approximation satisfies the PDE: ${∥ \frac{\partial f}{\partial t} (t, x; θ) - L f (t, x; θ) ∥}_{[0, T] \times Ω, ν_{1}}^{2} .$ (55)

A measure of how well the approximation satisfies the boundary condition: ${∥ \frac{\partial f}{\partial t} (t, x; θ) - g (t, x) ∥}_{[0, T] \times \partial Ω, ν_{2}}^{2} .$ (56)

A measure of how well the approximation satisfies the initial condition: ${∥ \frac{\partial f}{\partial t} (0, x; θ) - u (0, x) ∥}_{Ω, ν_{3}}^{2} .$ (57)

Here all three errors are measured in terms of L²-norm, i.e.

{∥ f (y) ∥}_{Y, ν}^{2} = \int_{Y} | f (y) |^{2} ν (y) dy

with ν (y) being a density on region

Y

The sum of all three terms above gives us the objective function associated with the training of the neural network:

$\begin{matrix} J (f) & = & {∥ \frac{\partial f}{\partial t} (t, x; θ) - L f (t, x; θ) ∥}_{[0, T] \times Ω, ν_{1}}^{2} \\ + {∥ \frac{\partial f}{\partial t} (t, x; θ) - g (t, x) ∥}_{[0, T] \times \partial Ω, ν_{2}}^{2} \\ + {∥ \frac{\partial f}{\partial t} (0, x; θ) - u (0, x) ∥}_{Ω, ν_{3}}^{2} . \end{matrix}$ (58) Thus, the goal is to find a set of parameters θ such that the function f (t, x ; θ) minimizes the error J (f). When the dimension d is large, estimating θ by directly minimizing J (f) is infeasible. Therefore, one can minimize the error J (f) using a machine learning approach: stochastic gradient descent, where we use a sequence of time and space points drawn randomly. The algorithm for DGM method is described in Algorithm 1 below.

Algorithm 1 Deep Galerkin Method

Require: $L f (), u (), g ()$

Ensure: $L_{n}^{1} + L_{n}^{2} + L_{n}^{3}$ is minimized

Generate random points:

1: $(t_{n}, x_{n}) \leftarrow U \sim {[0, 1]}^{2}$

2: $(τ_{n}, z_{n}) \leftarrow U \sim {[0, 1]}^{2}$

3: $w_{n} \leftarrow U \sim [0, 1]$

4: s_n ← ((t_n, x_n) , (τ_n, z_n) , w_n)

Calculate the squared error:

5: $L_{n}^{1} \leftarrow {(\frac{\partial f}{\partial t} (t_{n}, x_{n}; θ_{n}) - L f (t_{n}, x_{n}; θ_{n}))}^{2}$

6: $L_{n}^{2} \leftarrow {(\frac{\partial f}{\partial t} (τ_{n}, z_{n}; θ_{n}) - g (τ_{n}, z_{n}))}^{2}$

7: $L_{n}^{3} \leftarrow {(\frac{\partial f}{\partial t} (0, x_{n}; θ_{n}) - u (0, w_{n}))}^{2}$

8: $G (θ_{n}, s_{n}) \leftarrow L_{n}^{1} + L_{n}^{2} + L_{n}^{3}$

Take a descent step at the random points:

9: $- \underset{θ_{n}}{argmax} G (θ_{n}, s_{n})$

10: α_n ← α_n-1 * λ

11: θ_n+1 ← θ_n - α_n ∇ _θG (θ_n, s_n)

Repeat until tolerance level 10^-8 for convergence criterion is achieved

Remark 1. The learning rate, α_n, is a configurable hyperparameter 3 used in the training of neural networks that controls how much to change the model in response to the estimated error. Each time the model weights are updated. Learning rate has a small positive value, often in the range between 0.0 and 1.0. Similar to Al-Aradi et al. (2018), we set α₀ = 0.001. Note that our learning rate α_n must decrease with n (see Sirignano and Spiliopoulos (2018)) and a simple enough way to do that is by using an exponential weighted method where α_n ← α_n-1 * λ with λ ∈] 0, 1 [.

The neural network (NN) architecture used in DGM is like a long short-term memory (LSTM) network though with small differences (see Sirignano and Spiliopoulos (2018)). We describe below the architecture of this NN:

$\begin{matrix} S^{1} = σ (w^{1} \cdot x + b^{1}) \\ Z^{l} = σ (u^{z, l} \cdot x + w^{z, l} \cdot S^{l} + b^{z, l}) & l = 1, \dots, L \\ G^{l} = σ (u^{g, l} \cdot x + w^{g, l} \cdot S^{l} + b^{g, l}) & l = 1, \dots, L \\ R^{l} = σ (u^{r, l} \cdot x + w^{r, l} \cdot S^{l} + b^{r, l}) & l = 1, \dots, L \\ H^{l} = σ (u^{h, l} \cdot x + w^{h, l} \cdot (S^{l} ⊙ R^{l}) + b^{h, l}) & l = 1, \dots, L \\ S^{l + 1} = (1 - G^{l}) ⊙ H^{l} + Z^{l} ⊙ S^{l} & l = 1, \dots, L \\ f (t, x, θ) = w \cdot S^{L + 1} + b \end{matrix}$

with ⊙ denoting Hadamard multiplication, L number of layers and σ the activation function. The rest of the superscripts refer to the neurones for our NN architecture of Figures 3 and 4.

Fig. 3

Bird’s-eye perspective of overall DGM architecture Al-Aradi et al. (2018).

Fig. 4

Operations within a single DGM layer Al-Aradi et al. (2018)

Remark 2. We can see the Bird Eye view of the DGM Al-Aradi et al. (2018), Sirignano and Spiliopoulos (2018) method in Figure 3 and its details in Figure 4. The rationale is explained in Al-Aradi et al. (2018), Sirignano and Spiliopoulos (2018).

Testing DGM on Merton problem

Because the DGM method was relatively new, we wanted to test the algorithm ourselves. More specifically the method was tested with several nonlinear, high-dimensional PDEs independently in Al-Aradi et al. (2018) and Sirignano and Spiliopoulos (2018), including the nonlinear HJB equations. We have tested the DGM algorithm on the HJB equation for the Merton problem ourselves. More specifically, Figures 5 and 6 show the plots of the analytical and approximated surface with DGM solution. We found the performance of the algorithm quite impressive. Indeed, Figure 7 shows the difference between the analytical and the approximate solution. Most of the time, the error is between 0% and 1%. The only criticism, though negligible that we can make is that the approximate solution does not do as well around t = 0 (the maximum error of 4% is around t = 0). This corroborates with the findings in Al-Aradi et al. (2018). The authors do not give an explanation of why this is the case but we did not think that this small imperfection was big enough for us to abandon the methodology.

Fig. 5

Analytical solution of the Merton Problem.

Fig. 6

Approximate solution of Merton problem using DGM.

Fig. 7

Error between analytical and approximate solution of Merton problem.

Solution to our PDE problem using DGM

Recall the PDE we want to solve is given in equation (52). In the absence of a closed form solution to this PDE we approximate the solution with the DGM algorithm described above. Figure 8 shows the approximate solution to the PDE in (52) for different parameter values. Recall, once we have the numerical solution f for the PDE above, we obtain the optimal weights as following:

Fig. 8

Approximate solutions to PDE (104) with DGM for four different scenarios of ρ and μ and fixed σ = 0.2, η = 0.19, γ = 0.5. (a) Approximate solution with low μ = 0.01 and low ρ = -0.5. (b) Approximate solution with low μ = 0.01 and high ρ = 0.5. (c) Approximate solution with high μ = 0.4 and low ρ = -0.5. (d) Approximate solution with high μ = 0.4 and high ρ = 0.5.

$\begin{matrix} π_{1}^{*} & = - \frac{\tilde{σ} {zG}_{vz} + [μ - κ (z - 1)] G_{v}}{\tilde{σ} {vG}_{vv}} \\ = - \frac{\tilde{σ} z (f_{z} v^{γ - 1} γ) - [μ - κ (z - 1)] ({fv}^{γ - 1} γ)}{\tilde{σ} v ({fv}^{γ - 2} γ (γ - 1))} \\ = - \frac{\tilde{σ} {zf}_{z} - [μ - κ (z - 1)] f}{\tilde{σ} f (γ - 1)} \\ = - \frac{{zf}_{z}}{(γ - 1) f} - \frac{[μ - κ (z - 1)]}{\tilde{σ} (γ - 1)}, \end{matrix}$ (60) with $π_{1}^{*} = - π_{2}^{*}$ .

3.4 Dynamic Switching: optimal strategies of mean-variance and power utility

Although in the previous two cases we assume that an investor has a certain risk preferences as modelled by a utility function (MVC and power utility), it is interesting to consider a limiting case where the investor can be always persuaded to go for more money (identical utility function U (x) = x, which is essentially the power utility function with risk aversion parameter γ = 1) when deciding between MVC or power utility. Assuming that an investors’ preference is modelled either as in equation (29) or as in equation (46), in order to improve further the portfolio returns we employ dynamic switching between the two optimal strategies

$ψ^{*} (t) = {\begin{matrix} π^{*} (t), if V^{π^{*}} (t) \geq V^{h^{*}} (t), \\ h^{*} (t), otherwise, \end{matrix}$ (61) where π^* (t) and h^* (t) are given in equations (60) and (35) and V^{π
^*} (t) and V^{h
^*} (t) are given in equations (36) and (26). The motivation behind the dynamic switching is that the investor wants to benefit from both the mean-reversion and the correlation elements of the cointelation model (8). More specifically, as the spread between two assets increases the investor implements pairs trading and makes profit, otherwise the MVC approach is used. The portfolio return over investment horizon [0, T] with T = 1000 days is $R (r_{p}) = \frac{V (0) - V (T)}{V (0)} .$ (62) We perform 500 simulations with the same model and present in Table 1 the average results. The average return at terminal time T obtained by using dynamic switching optimal strategies is higher than the average returns calculated by employing MVC or power utility maximizing optimal strategies.

Table 1

Average over 500 simulations of portfolio returns at terminal time T (day 1000) with dynamic switching (DS) is higher than average portfolio return with only stochastic control (SC) or only mean-variance-criterion (MVC)

Criterion	Average portfolio return $\hat{R} (r_{p})$
MVC	35%
SC	61%
DS	83%

4 Machine Learning formulation of the optimization problem

4.1 The portfolio optimization problem

We assume an initial wealth ${\tilde{w}}_{0} > 0$ at time t = 0. The investment behavior is modeled by an investment strategy w = (w₁, w₂). Here w₁ (t), w₂ (t) denote the percentages of wealth invested in asset X and Y respectively at time t. Let V (t) denote the portfolio value at time t and V^PnL (t) : = V (t) - V (0) denote the profit ant loss (P&L) over [0, t]. At each time t we allow either pairs trading: w₁ (t) = - w₂ (t) or long only strategies without leverage: w₁ (t) + w₂ (t) =1 with w₁ (t) , w₂ (t) >0.

The general optimization problem is to find an optimal strategy, w (t), such that the terminal P&L is maximized: $w^{*} (t) = \underset{w (t) \in A}{argmax} V^{PnL} (w, T),$ (63) where V^PnL (w, T) is profit and loss corresponding to the strategy w at time terminal time T. We use clustering analysis to device the bands and in each band we solve the following optimization problem $w_{i}^{*} (t) = \underset{w_{i} (t) \in A}{argmax} V^{PnL} (w_{i}, t),$ (64) where i = 1, . . . , n is the number of bands, V^PnL (w_i, t) is profit and loss corresponding to the strategy w_i at time t. Then the overall solution w^* is obtained via a linear interpolation of optimal weights per band $w_{i}^{*}$ . The advantage of the proposed method is that we do not impose a certain model on the asset prices, inline with Data Curation methodologies discussed by Bailey Jr (2018). Only data observations are required to calculate the optimal weight, meaning that the complex SDE calibration is avoided.

4.2 Review of Band-Wise Gaussian Mixture model

We review band-wise Gaussian mixture model because it inspires our method of selecting the bands. Consider a probability space $(Ω, F, ℙ)$ and let (P_t) _t≥0 denote the asset price. Mahdavi-Damghani and Roberts (2017) has recently introduced a generalised bumping SDE for the price dynamics of asset P_t. The SDE contains some secondary parameters whose purpose is empirical manual fitting. The generalized SDE is given by ${dP}_{t} = θ_{t, τ} (μ_{t, τ} - P_{t}) dt + σ P_{t}^{α} (1 - P_{t}^{2})^{β} {dW}_{t} .$ (65) Here θ_t is the speed of mean reversion, μ_t is the long term mean, α is the positivity flag enforcer, β is the [-1, + 1] boundary flag enforcer and ${⋃ {dW}_{i}}_{i = t - τ}^{t}$ is the set of historical deviations of the assumed model’s distribution (e.g.: all the historical absolute returns in the context of a normal diffusion).

This generalised SDE gives as a special case the cointelation model: take θ = - μ, μ = 0, α = 1 and β = 0 for the dynamic of X in (8); take θ_t,τ = κ, μ_t,τ = X_t, α = 1, β = 0 for the dynamics of Y in (8). The SDE in (65) can also model:

Proportional returns (log-normal diffusion) when θ = 0, α = 1, β = 0.

Absolute returns (normal diffusion) when θ = 0, α = 0, β = 0,

Mean reverting returns where we enforce positivity of returns (e.g. CIR Cox et al. (1985) diffusion when α = 1/2 and β = 0),

Mean reverting returns where we do not enforce positivity of the returns (e.g OU Uhlenbeck and Ornstein (1930) diffusion when α = 0 and β = 0).

In general calibrating parameters of the SDE in (65) to a real data is complex. Using data simulated with (65) their empirical distribution is approximated for the purpose of prediction by a band-wise Gaussian mixture model. This is done for a sequence of bands which are created using Machine Learning clustering method (see Mahdavi-Damghani and Roberts (2017)).

Algorithm 2Band-Wise Gaussian Mixture (P, h)

Require: array P_1:n and number of bands h

Ensure:Ω^(1:h), $[B_{(1 : h)}^{+}, B_{(1 : h)}^{-}]$ are returned.

Sorting state:

1: X_(1:h) ← QuickSort(X_1:n)

2: $[B_{(1 : h)}^{+}, B_{(1 : h)}^{-}]$ ← FindPercentileBands(X_(1:n), h)

3: Ω^{(1:⌈n/h⌉)} ← []

Allocation state:

4: forj = 1 to hdo

5: fori = 1 to ndo

6: if $B_{(1 : h)}^{-} \leq P_{(i)} < B_{(1 : h)}^{+}$ then

7: Amend(Ω^(j), P_(i))

8: end if

9: end for

10: end for

Checking Approximation state:

11: ${\hat{μ}}_{1 : h} \leftarrow$ mean(Ω^(1:h))

12: ${\hat{σ}}_{1 : h} \leftarrow$ stdev(Ω^(1:h))

13: Print( $\cup_{i = 1}^{h} N ({\hat{μ}}_{i}, {\hat{σ}}_{i})$ )

Return state:

14: Ω^(1:h), $[B_{(1 : h)}^{+}, B_{(1 : h)}^{-}]$

Let P = {p₁, …, p_n} be a set of empirical random variables sampled using equation (65) with cumulative distribution function F (p) and density f (p). Denote O = {p₍₁₎, …, p_(n)} the ordered set of P such that p₍₁₎ < p₍₂₎ < … < p_(n) and $O_{h}^{i} = {p_{(⌈ n ((i - 1) + 1) / h ⌉)}, \dots, p_{(⌊ n (i) / h ⌋)}} .$ Then the band-wise Gaussian mixture model for the empirical distribution function of the data simulated using the SDE in equation (65) is given as follows: ${\hat{F}}_{n} (p_{i} | F_{t}) = \frac{1}{n} \sum_{j = 1}^{h} \sum_{i = η}^{ζ} 1_{p_{i} \in O_{h}^{j}}$ (66) with η =⌈ n ((i - 1) +1)/h ⌉ and ζ =⌊ n (i)/h ⌋. For example in the case bands h = 3, using a Gaussian Mixture such that

$\begin{matrix} {\hat{F}}_{n} (p_{i} | F_{t}) = N (- 3, 1) 1_{p_{t} \in O_{3}^{1}} \\ + N (0, 1) 1_{p_{t} \in O_{3}^{2}} + N (3, 1) 1_{p_{t} \in O_{3}^{3}}, \end{matrix}$ (67) we obtain the approximate stratification in Figure 9. The stratification is made so that the cardinality in each $O_{h}^{j}$ region remains approximately the same, as opposed to being the result of a geometrical separation function of p₍₁₎ and p_(n). Theorem 1 in Mahdavi- Damghani and Roberts (2017) ensures a good approximation of the generalised SDE (65) by the Gaussian mixture model (66). The calibration for the band-wise Gaussian mixture is given in Algorithm 2. For our optimization problem we take a similar approach of dividing the range of observations into bands via the clustering algorithm, and then perform an optimization in each band via perturbation of weights.

Fig. 9

Two examples of Gaussian Mixture Simulations with different number of bands. (a) Empirical distribution of random variable sampled from cointelation model (8) in three different zones described in Figure 2. (b) Empirical distribution of random variable sampled from cointelation model (8) in five different zones: two additional zones were added to the initial three zones in Figure 2.

4.3 Optimal machine learning strategy

Based on the idea of band-wise Gaussian mixture model, we use clustering analysis to create bands, however not for the observed asset price data, but for the spread between two asset prices in (8), i.e. X_t - Y_t. Inside of each band instead of specifying the distribution as in band-wise Gaussian mixture, we test a set of strategies that maximizes the corresponding P&L. We record the optimal strategies within each band, and in live trading, whenever the spread of asset prices falls in a certain band we employ the optimal strategy for this specific band. We now present the trading signal that translates to investment strategy in machine learning approach.

The Bayesian set-up: We set from equation (8) B_t = X_t - Y_t and have $B_{t} = {B_{n, t}^{+}, B_{n - 1, t}^{+}, \dots, B_{1, t}^{+}, B_{1, t}^{1}, \dots, B_{n - 1, t}^{-}, B_{n, t}^{-}},$ such that $B_{n, t}^{+} > B_{n - 1, t}^{+} > \dots > B_{1, t}^{+} > 0 > B_{1, t}^{-} > \dots > B_{n - 1, t}^{-} > B_{n, t}^{-}$ . We know that depending on the spread, the resulting approximated distribution of the samples differ (see Mahdavi-Damghani and Roberts (2017)). The calibration algorithm will then consist of creating different zones and test the performance 4 of the possible strategies (“Long/Long”, “Long/Short”, “Short/Long”, “Short/Short”) within these bands. We take a direct approach (see Remark 3) consisting of 3 strategies and their cumulative P&Ls. Fixing the bands [a_i, b_i], with i = 1, 2, . . . , n we consider the following strategies:

Strategy S⁺⁺ in which we are long both X and Y at time t in between bands [a_i, b_i] and with P&L $V_{[a_{i}, b_{i}], t}^{+ +}$ .

Strategy S^+- in which we are long X and short Y at time t in between bands [a_i, b_i] and with P&L $V_{[a_{i}, b_{i}], t}^{+ -}$ .

Strategy S^-+ in which we are short X and long Y at time t in between bands [a_i, b_i] and with P&L $V_{[a_{i}, b_{i}], t}^{- +}$ .

The P&Ls corresponding to these strategies are defined as following:

\begin{matrix} V_{[a_{i}, b_{i}], T}^{+ +} = & \sum_{t = 0}^{T} [w_{[a_{i}, b_{i}], t}^{+ +} Δ X_{t} \\ + (1 - w_{[a_{i}, b_{i}], t}^{+ +}) Δ Y_{t}] 1_{a_{i} < Δ_{t} \leq b_{i}}, \\ V_{[a_{i}, b_{i}], T}^{+ -} = & \sum_{t = 0}^{T} [w_{[a_{i}, b_{i}], t}^{+ -} Δ X_{t} \\ - (1 - w_{[a_{i}, b_{i}], t}^{+ -}) Δ Y_{t}] 1_{a_{i} < Δ_{t} \leq b_{i}}, \\ V_{[a_{i}, b_{i}], T}^{- +} = & \sum_{t = 0}^{T} [- w_{[a_{i}, b_{i}], t}^{- +} Δ X_{t} \\ + (1 - w_{[a_{i}, b_{i}], t}^{- +}) Δ Y_{t}] 1_{a_{i} < Δ_{t} \leq b_{i}} . \end{matrix}

Remark 3. We call this approach direct, since ideally the number of strategies should consist of a more granular weight distribution. However for the sake of comparing with Financial Mathematics approach we consider the same set of strategies: long only, long/short.

We denote the maximum P&L achieved by each of these strategies by $V_{[a_{i}, b_{i}], T}^{\mp \mp, *}$ , as given by equation (68) and define $S_{[a_{i}, b_{i}], T}^{* *}$ of P&L $V_{[a_{i}, b_{i}], T}^{* *}$ (equation (69)), the optimal strategy using Gaussian Learning in band [a_i, b_i]. $V_{[a_{i}, b_{i}], T}^{\mp \mp, *} = \underset{w_{[a_{i}, b_{i}], t \in [0, T]}^{\mp \mp}}{argmax} V_{[a_{i}, b_{i}], T}^{\mp}, w_{[a_{i}, b_{i}], t}^{\mp \mp} \in [0, 1]$ (68) $V_{[a_{i}, b_{i}], T}^{* *} = max (V_{[a_{i}, b_{i}], T}^{+ +, *}, V_{[a_{i}, b_{i}], T}^{+ -, *}, V_{[a_{i}, b_{i}], T}^{- +, *}) .$ (69) In live trading we recombine the optimal weights per bands into an overall optimal solution via a linear interpolation: $w^{*} (t) = \sum_{i = 1}^{n} w_{i}^{*} 1_{{(X_{t} - Y_{t}) \in [a_{i}, b_{i}]}} .$ (70) Although we do not have a proof that the resulting interpolated strategy in (70) is optimal, we use it as a benchmark that still improves over the results with Financial Mathematics approach. Our goal is to apply Machine Learning approach to a pair of assets that exhibit some dependence, but this approach can be used for any model, i.e. it is model agnostic.

Algorithm 3Band-Wise ML for Cointelation (P, h)

Require: array P_1:n and number of bands h

Ensure:Ω^(1:h), $[B_{(1 : h)}^{+}, B_{(1 : h)}^{-}]$ are returned

Sorting state:

1: P_(1:h) ← QuickSort(P_1:h)

2: $[B_{(1 : \frac{h}{2})}^{+}, B_{(1 : \frac{h}{2})}^{-}]$ ← FindPercentileBands(P_(1:n), h)

3: B_(1:h) ← $[B_{(1 : \frac{h}{2})}^{+}, B_{(1 : \frac{h}{2})}^{-}]$

4: Ω^{(1:⌈n/h⌉)} ← []

Allocation state:

5: forj = 1 to hdo

6: fori = 1 to ndo

7: ifP_(i) ∈ Bⁱthen

8: Amend(Ω^(j), P_(i))

9: end if

10: end for

11: end for

Optimize the 3 types of P&L for each band:

12: fori = 1 to h

13: $V_{B_{i}, T}^{+ +, *} \leftarrow \underset{w_{B_{i}, t \in [0, T]}^{+ +}}{argmax} V_{B_{i}, T}^{+ +}$

14: $V_{B_{i}, T}^{+ -, *} \leftarrow \underset{w_{B_{i}, t \in [0, T]}^{+ -}}{argmax} V_{B_{i}, T}^{+ -}$

15: $V_{B_{i}, T}^{- +, *} \leftarrow \underset{w_{B_{i}, t \in [0, T]}^{- +}}{argmax} V_{B_{i}, T}^{- +}$

16: end for

Rank and return best strategy for each band:

17: fori = 1 to h

18: $V_{B_{i}, T}^{* *} \leftarrow max (V_{B_{i}, T}^{+ +, *}, V_{B_{i, T}}^{+ -, *}, V_{B_{i, T}}^{- +, *})$

19: $S_{T}^{*} \leftarrow$ ( $S_{B_{i}, T}^{+ +, *}, S_{B_{i}, T}^{+ -, *}, S_{B_{i}, T}^{- +, *}$ )

20: $S_{B_{i, T}}^{* *} \leftarrow$ returnCorrespondingStrat( $V_{B_{i, T}}^{* *}, S_{T}^{*}$ ),

21: end for

Forecasting :

22: signal^S, signal^{S
_l}← forecast( $S_{B_{i, T}}^{* *}, S_{t}, S_{l, t}$ )

Return buy/sell signals:

23: signal^S, ${signal}_{l}^{S}$

We further provide Algorithm 3 as the pseudo-code for the calibration process. Note that in both Algorithms 2 and 3, we have used a QuickSort which can be substituted by other sorting algorithms. Note that the use of self explanatory functions such as returnCorrespondingStrat(x,y) in line 20 of Algorithm 3 which given the set of strategies and the P&L returns, as its name indicates, outputs the corresponding strategy that maximizes P&L. The function forecast(x,y,z) in line 22 of Algorithm 3 takes as input the set of trained strategies and the current level of X_t and Y_t and returns a prediction of where the signals for the latter two should be. Finally the use of the argmax function in lines 13-16 can be replaced by a simple for loop but in the interest of not making the pseudocode too crowded we have kept it this way.

Remark 4.Mahdavi-Damghani and Roberts (2017) show that a reasonable risk manager or trader can assume the generalized SDE (65) with β = 0 and an α = 1, in order to enforce positivity for the simulated scenarios of our risk factor. This very reasonable assumption would have crashed the whole risk engine if it is no longer satisfied in the real markets. The approach we advocate would have, however, been able to continue its dynamical learning scenario without any problem since it is model agnostic.

5 Numerical results

5.1 Simulation

Figure 10 illustrates the ML and the DS approaches on one single simulated path. Note that when implementing the ML approach with a horizon of 1000 days, we double this data for training, i.e. we use 2000 historical daily prices. We have performed two sets of 500 simulations and we have gathered their results in the following two examples.

Fig. 10

(a) one simulated scenario based on cointelation model (8) with parameters: μ = 0.05, σ = 0.17, η = 0.16, κ = 0.1, ρ = -0.6 and scaled spread: κ (X_t - Y_t); (b) portfolio return and optimal weight of asset X with Dynamic Switching approach; (c) portfolio return and optimal weight of asset X and Y with Machine Learning approach.

Example 1. We have simulated 500 paths of X and Y based on cointelation model (8) with parameters μ = 0.05, σ = 0.17, η = 0.16, κ = 0.1, ρ = -0.6. Figure 11 illustrates that the Machine Learning approach with long/short strategies (ML_LS), on average performs slightly better in terms of P&L than the Stochastic Control approach (SC). However, based on histogram none of the approaches perform significantly better or significantly worse than the other at any time.

Fig. 11

Histogram of excess (P & L) for ML_LS vs SC at terminal time T.

Example 2. We have simulated 500 paths of X and Y based on cointelation model (8) with parameters μ = 0.05, σ = 0.17, η = 0.16, κ = 0.1, ρ = -0.6. Figure 12 illustrates how the ML approach seems to perform slightly better in terms of P&L than the FM approach about 55% of the time, while being outperformed the other 45% of the time. However, based on histogram we have noticed that sometimes the ML approach is being outperformed significantly more than it outperforms FM approach.

Fig. 12

Histogram of excess (P & L) for ML vs FM at terminal time T.

From our performance histogram in Figures 11 and 12 we have concluded that for parameters μ = 0.05, σ = 0.17, η = 0.16, κ = 0.1, ρ = -0.6 of cointelation model (8) we have the following rankings for the approaches: $SC < {ML}_{LS} < FM < ML .$ The reason for ML with full set of strategies (long only and long/short) outperforming the DS most of the time might be the fact that in long only optimal strategies of ML approach we have more variety in weights, whereas the closed form formula (35) in FM gives us almost constant weights (with small fluctuations).

5.2 Market Data: Bitcoin vs NVDA

5.2.1 Rational

During the peer reviewing process we were ask to use the model with real market data (on top of the simulations presented previously). The recommendation was to use two-asset model with uncorrelated memoryless jumps in both the leading asset and the spread to produce more realistic model. A pure cryptocurrency model was not recommended so we thought a hybrid model would be best. We decided to propose NVDA and Bitcoin to be cointelated pairs for the following reason. When cryptocurrencies prices go up, then the incentive to mine cryptocurrencies increases which increases, in turn, the value of companies involved in mining. The most well known cryptocurrency is Bitcoin, and the most well known mining equity remains NVDA. We have decided to study this pair, assuming they are cointelated. The data set was downloaded using a free data provider (yahoo finance) for the last 5 years. In Figure 15 we have displayed the price of Bitcoin, NVDA as well as their normalised spread in our timescale. The volatility of bitcoin being significantly higher than the one of NVDA, we decided to proxy their current volatility as their rolling 6 months standard deviation. We then constructed their spread using an inverse weighting function of their current volatility. Please refer to our github code in order to gain full transparency for the normalisation process. We roughly separated the “in sample” and “out of sample” data in half (2 years and 3 month each: the first 6 months being “burnt” in order to calculate current volatility). As per the described methodology the bands were chosen to have the same size (0 to 33 percentile, 33 to 67 percentile and finally 67 to 100 percentile). We also incorporated a simple cost function (adding conservatively a couple of bps per position change: 4 in total assuming a long position would become short and vice versa). We recorded our in and out of sample sharpe ratios in Table 2. We can observe few point. First, over-fitting is noticeable 5 since the SR of our overall strategy drops by 1.2. However this is mitigated by the fact that the out of sample SR remain at 1.7, a very reasonable figure at the low frequencies. Added to this point, all bands are doing reasonably well in both the in and the out of sample. We take this as being an evidence of a reasonable over-fitting. Figure 16 plots the spread of the cointelated pair on top as well as its 67 and 33 percentile bands.

Fig. 13

In FM approach optimal long/short strategies are more volatile than optimal long only strategies.

Fig. 14

In ML approach optimal long/short strategies are slightly more volatile than optimal long only strategies.

Fig. 15

Cumulative returns in the last 5 years for: (top) Bitcoin (BTC), (middle) NVDA, (bottom) their volatility normalised log spread.

Table 2

In and out of sample sharpe ratios by band

Band	In Sample SR	Out of Sample SR
Overall	2.9	1.7
Band Low	1.3	.5
Band Middle	1.4	1.4
Band high	2.2	1.0

Fig. 16

Strategy decomposition details: (top) volatility normalised log spread (in blue) with 50 percentile bands (in red), (middle) trading signal (+1 means long BTC and short NVDA and -1 the contrary), (bottom) cumulative return of the strategy.

6 Conclusion

6.1 Possible directions for future work

One direction for future work is to consider portfolio optimization problem for n-dimensional cointelation model. For instance, when n = 3 we can have something of the following form: $\begin{matrix} {dS}_{t}^{a} & = & σ S_{t}^{a} {dW}_{t}^{a} \\ {dS}_{t}^{b} & = & θ (S_{t}^{a} - S_{t}^{b}) dt + σ S_{t}^{a} {dW}_{t}^{b} \end{matrix}$ (71) ${dS}_{t}^{c} = θ (S_{t}^{a} - S_{t}^{c}) dt + σ S_{t}^{c} {dW}_{t}^{c}$ One natural question would first be about how to model this triplet? For instance would equation (72) with S^b and S^c reverting around S^a be more in-line with the pair from equation (8) or would S^b reverting around S^a and S^c reverting around S^b be better? Are they equivalent or is one more useful? What happens as n increases? We plan to examine these questions in the future.

6.2 Summary

We have studied the portfolio optimization problem of two assets that follow the cointelation model using two approaches: Financial Mathematics and Machine Learning. We first implemented the FM approach, where we use classic financial mathematics criteria: mean-variance and power utility maximization. Without an analytical solution to the PDE (52), we resort to the DGM method, a deep learning algorithm, to solve it numerically. The second approach we implemented is ML using clustering. The latter approach is easier to implement, it is model agnostic, therefore avoids the complex SDE calibration. In our case the Machine Learning approach slightly outperforms the Financial Mathematics approach.

Footnotes

Appendices

A time series X_t is integrated of order d if (1 - L) ^dX_t is a stationary process. Here L is a lag operator.

In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins whereas, the values of other parameters are derived via training.

E.g: sharpe ratio maximization

Note that this is almost always true in an low signal to noise ratio in the lower frequencies.

References

Al-Aradi,

, Correia,

, Naiff,

, Jardim,

, Sapito,

, 2018. Solving nonlinear and high-dimensional partial differential equations via deep learning, arXiv:1811.08782.

Alexander,

, 1999. Optimal hedging using cointegration, Philosophical Transactions of the Royal Society a Mathematical Physical and Engineering Sciences357, 2039–2058.

Alexander,

, 2001. Market models: a guide to financial data analysis, John Wiley Sons, Inc., Chichester, West Sussex.

BaileyJr, C., 2018. Research Data Curation Bibliography, Version 9.

Benaroya,

, Han,

, Nagurka,

, 2005. Probability Models in Engineering and Science, Vol. 40, CRC Press.

Bodie,

, Kane,

, Marcus,

, 1999. Investments. 4th edition, Irwin/McGraw-Hill, Chicago.

Cox,

, Ingersoll,

, Ross,

, 1985. A theory of the term structure of interest rates, Econometrica53, 385–407.

Engle,

, Granger,

, 1991. Long-run Economic Relationships: Readings in Cointe-gration, Oxford University Press, Oxford, New York.

Garcia,

, González,

, Contreras,

, Custodio,

, 2017. Applying modern portfolio theory for a dynamic energy portfolio allocation in electricity markets’, Electric Power Systems Research150, 11–22.

10.

Harrison,

, Kreps,

, 1979. Martingales and arbitrage in multiperiod securities markets., Journal of Economic Theory20, 381–408.

11.

Korn,

, Kraft,

, 2002. A stochastic control approach to portfolio problems with stochastic interest rates, SIAM Journal on Control and Optimization4, 1250–1269.

12.

Li,

, Ng,

, 2000. Optimal dynamic portfolio selection: Multiperiod mean-variance formulation, Mathematical Finance10, 387–406.

13.

Mahdavi-Damghani,

, 2013. The non-misleading value of inferred correlation: An introduction to the cointelation model, Wilmott Magazine67, 0–61.

14.

Mahdavi-Damghani,

, Roberts,

, 2017. A proposed risk modeling shift from the approach of stochastic differential equation towards machine learning clustering: Illustration with the concepts of anticipative and responsible var, SSRN:3039179.

15.

Mahdavi-Damghani,

, Welch,

, O’Malley,

, Knights,

, 2012. The misleading value of measured correlation, Wilmott Magazine61, 64–73.

16.

Mudchanatongsuk,

, Primbs,

, Wong,

, 2008. Optimal pairs trading: A stochastic control approach, in American Control Conference , IEEE, Seattle, Washington, USA.

17.

Sirignano,

, Spiliopoulos,

, 2018. DGM: A deep learning algorithm for solving partial differential equations, Journal of Computational Physics375, 1339–1364.

18.

Soeryana,

, Fadhlina,

, Sukono,

, Rusyaman,

, Supian,

, 2017. Mean-variance portfolio optimization by using time series approaches based on logarithmic utility function, in Materials Science and Engineering, IOP.

19.

Uhlenbeck,

, Ornstein,

, 1930. On the theory of Brownian motion, Physical Review36, 823–41.

20.

Vidyamurthy

, 2004. Pairs Trading: Quantitative Methods and Analysis, John Wiley Sons, Inc., Hoboken, New Jersey.

21.

Wilmott,

, 2007. Paul Wilmott Introduces Quantitative Finance, John Wiley Sons Ltd., Chichester, West Sussex.