Abstract
With the recent rise of Machine Learning (ML) as a candidate to partially replace classic Financial Mathematics (FM) methodologies, we investigate the performances of both in solving the problem of dynamic portfolio optimization in continuous-time, finite-horizon setting for a portfolio of two assets that are intertwined.
In the Financial Mathematics approach we model the asset prices not via the common approaches used in pairs trading such as a high correlation or cointegration, but with the cointelation model in Mahdavi-Damghani (2013) that aims to reconcile both short-term risk and long-term equilibrium. We maximize the overall P&L with Financial Mathematics approach that dynamically switches between a mean-variance optimal strategy and a power utility maximizing strategy. We use a stochastic control formulation of the problem of power utility maximization and solve numerically the resulting HJB equation with the Deep Galerkin method introduced in Sirignano and Spiliopoulos (2018).
We turn to Machine Learning for the same P&L maximization problem and use clustering analysis to devise bands, combined with in-band optimization. Although this approach is model agnostic, results obtained with data simulated from the same cointelation model gives a slight competitive advantage to the ML over the FM methodology 1 .
Keywords
Introduction
The concept of co-movement and correlation have been shown to be usually misunderstood Mahdavi-Damghani et al. (2012) by practitioners yielding the proposed Cointelation model Mahdavi-Damghani (2013). We solve the portfolio optimization problem employing a more general set of admissible strategies than long/short strategies used in pairs trading.
A pairs trading strategy involves matching a long position with a short position in two assets with a high correlation. Pairs trading was pioneered in the mid 1980s by a group of quantitative researchers from Morgan Stanley. For an introduction to pairs trading see Vidyamurthy (2004). The securities in a pairs trade must have a high positive correlation, which is the primary driver behind the strategy’s profits.
Pairs trading is based on the high historical correlation of two assets and a trader’s view that the two securities will maintain a specified correlation. A pairs trading strategy is applied when a trader identifies a correlation discrepancy. More specifically, the trader monitors performance of two historically correlated securities. When the correlation between the two securities temporarily weakens, i.e. the spread widens, the trader applies a trading strategy which shorts the high asset and buys the low asset. As the spread narrows again to some equilibrium value, a profit results.
However, many authors argue that correlation is an inappropriate measure of dependency in financial markets, since returns often exhibit a nonlinear co-dependence (e.g. Alexander (2001), Wilmott (2007)). Mahdavi-Damghani et al. (2012) showed that the measured correlation taken from the returns of a mean-reverting processes is misleading: indeed a strong positive correlation does not necessarily imply that two stochastic processes move in the same direction and vice versa. Also Correlation measures the short term risk but Cointegration, on the other hand, tests the long-term equilibrium relationships between assets and has been extensively used in low frequency pairs trading (see Vidyamurthy (2004)). Cointegration tests do not measure how well two variables move together, but rather whether the difference between their means remains constant. Sometimes series with high correlation will also be cointegrated, and vice versa, but this is not always the case.
The cointelation model was introduced in Mahdavi-Damghani (2013) as a hybrid model which reconciles correlation and cointegration by capturing both short-term risk and long-term equilibrium. The rationale for the long term risk is that during the time of rare market crashes all assets prices fall. However, in the more bullish periods, the short term risk increases, the long term risk becomes less pronounced and the “macro” driver less visible. These influences are accompanied with mean reversion forces from one asset to the other.
In this setting we consider a continuous-time, finite horizon portfolio optimization problems for pairs of assets whose prices follow the cointelation model of Mahdavi-Damghani (2013). Generally, the optimization problem is to find the optimal control
We solve the portfolio optimization problem in (1) with Financial Mathematics and Machine Learning methodologies and compare their performance. In the Financial Mathematics approach we use SDE evolution of asset prices, whereas the Machine Learning approach does not assume an underlying model and applies generally to any pair of assets.
In Section 2 we review the cointelation model. In Section 3 we use the classical Financial Mathematics criteria: mean-variance optimization and power utility maximization. In Section 4 we use clustering analysis from Machine Learning to solve the P&L maximization problem. We present the results of each approach in Section 5 and discuss them comparatively.
We first present the usual way correlation is calculated in the financial industry (see e.g. p.274 Wilmott (2007), Alexander (2001)). Assume we have two assets with prices modeled by stochastic processes (X
t
) t≥0 and (Y
t
) t≥0 on a probability space
For correlation to be an appropriate choice of measure of co-dependence the assumption of linear dependency between series needs to be satisfied (see Chapter 1.4 Alexander (2001)). Often in financial markets with a non-linear dependence between returns, the correlation is an misleading measure of co-dependency, especially when used to capture long-term relationship between assets (see Mahdavi-Damghani et al. (2012) and Mahdavi-Damghani (2013)).
An alternative statistical measure to correlation is cointegration. If two time series X t and Y t are integrated 2 of order d and there exists β such that a linear combination X t + βY t is integrated of order less that d, then X t and Y t are cointegrated (see Engle and Granger (1991)). Since the spread of cointegrated asset prices is mean reverting, they have a common stochastic trend, i.e. the asset prices are ‘tied together’ in the long term, although they might drift apart in the short-term (see Alexander (1999)). Because the cointegration requires sophisticated statistical analysis, it has not been used as widely as correlation in the financial industry.
Although correlation and cointegration are related, they are different concepts. High correlation does not necessarily imply high cointegration, and neither does high cointegration imply high correlation (e.g. see Figure 4 in Mahdavi-Damghani et al. (2012)). Two assets may be perfectly correlated over short timescales yet diverge in the long run, with one growing and the other decaying. Conversely, two assets may follow each other, with a certain finite spread, but with any correlation, positive, negative or varying.
Mahdavi-Damghani (2013) proposed cointelation as a hybrid model that aims to mediate between correlation and cointegration. It captures both short-term and long-terms relationships between the assets.
The processes (X) t≥0 and (Y) t≥0 are called the leading process and the lagging process, respectively. This is due to the fact that the lagging process reverts around the leading process.
We present here the concepts of inferred correlation function and number of crosses formula introduced in Mahdavi-Damghani (2013) in order to device a test whether two pairs are cointelated.
Let
The motivation for inferred correlation approximation formula (10), is that in the discrete version of the processes in equation (8) the measured correlation increases as the time increment, Δt, increases (e.g. correlations calculated using daily, weekly, monthly returns). Moreover, the measured correlation of cointelated pairs will converge to 1 faster as the speed of mean reversion parameter κ increases. If we set ρ = -1 in (8), the inferred correlation of cointelated asset prices may cover the whole correlation spectrum [-1, 1] (see Figure 1).

(Up) Simulated path of cointelation model (8) with ρ = -1, θ = 0.1, σ = 0.01; (Down) Corresponding measured correlation (7) as a function of the time increment increases from -1 to 1.
Another way for testing if two times series are cointelated is to study how many times the normalized series cross paths. If one discretizes equation (8), then one can approximate the expectation of the number of times, Γx,y, the two stochastic process, x = Xi∈[1,2,…N] and y = Yi∈[1,2,…N], cross paths as follows
Compared to the number of times purely correlated SDEs (eg: without the mean reversion component, i.e. when κ = 0) the number of times the discrete version of the cointelated SDEs cross paths is larger than if they were random, and the bigger the κ the more often the paths of discretized SDEs cross each other per unit of time.
Then two stochastic processes are cointelated (see Mahdavi-Damghani (2013)) if Inferred correlation formula in equation (10) is verified; The number of crosses formula in equation (11) is verified; the underlying assets have a reasonable physical connection that would suggest their spread should mean revert (e.g. oil and BP share prices).
The parameters in cointelation model (8) can be estimated using the inferred correlation formula (10) and the number of crosses formula (11) (see Mahdavi-Damghani (2013)). Similarly to the variance reduction methodology described in Mahdavi-Damghani et al. (2012), Mahdavi-Damghani (2013), we can define

Hypothetical spread split in three different zones for risk management or/and trading purposes.
We consider the portfolio of two assets and model the their prices with the cointelation in (8). We approach the optimization problem of this portfolio with classic Financial Mathematics criteria: mean-variance and power utility maximization. Since the cointelated assets are characterized by both correlation and mean-reversion components, we formulate the mean-variance optimization problem for long only strategies and we calculate the optimal strategies to make profit on correlation. To make profit on mean-reversion property of the cointelated assets we use stochastic control formulation of the power utility maximization problem for long/short strategies and calculate the optimal weights. We then maximize portfolio P&L by dynamically switching between these two optimal strategies.
Mean-variance optimization
We first review fundamental notions and concepts for mean-variance optimization.
Returns: A portfolio considers a combination of n potential assets, with an initial capital V (0) and weights w1, w2, . . . , w
n
, such that
Expectation and variance of returns: By the linearity property of expected value operator, the expected return of portfolio, E (r p ), is
The variance of the return of portfolio, Var (r p ), is given by
where Σ denotes the covariance matrix of the asset returns, composed of all covariances between the returns of assets i and j defined as σ (r i , r j ). The variance of asset i’s return, which constitute the diagonal of the covariance matrix, is σ (r i , r i ).
Optimal investment strategy using mean-variance criterion
We consider a portfolio consisting of two assets. The uncertainty is modelled by a probability space
Denoting by
Let Given v0 > 0 the wealth process Vv0,h (·) corresponding to w0, h satisfies
h
i
(t) ≥0 for all i = 1, 2,
An investment strategy,
We define a utility function, U (t, h), as in Bodie et al. (1999):
h
i
≥ 0 ∀ i.
and U (t, h) given in (29).
Thus we have optimization problem in equation (30). From equation (20) we have that the rate of return of our portfolio, R
p
, over [t - Δt, t] is
The expectation of portfolio return over [t - Δt, t] is
The variance of portfolio return over [t - Δ, t] is
Var (r
X
(t)) = σ2Δt is the variance of return of asset price X over the horizon [t - Δt, t];
where
The optimal weights for mean-variance criterion were derived in Soeryana et al. (2017). We state the following proposition from Soeryana et al. (2017) applied to the cointelation model (8).
Replacing these formulas for expectation, variance and covariance of the returns of asset prices in equation (35), we get optimal strategies for mean-variance optimization problem. We will present numerical examples in Section 5.
Power utility maximization problem
We now use a stochastic control approach to the power utility maximization problem. Here we mainly follow Mudchanatongsuk et al. (2008), but with modified dynamics for asset prices. More specifically, they assume the price dynamics of one of the assets is a geometric Brownian motion and model the log-spread as an Ornstein-Uhlenbeck process. We, however, assume the dynamics of asset prices are governed by the cointelation model in equation (8), where one of the assets follow the geometric Brownian motion and the second asset mean reverts around the first one.
Let
We assume an initial wealth v0 > 0 at time t = 0. Initial wealth is held in a margin account. For simplicity we assume that the interest rate for margin account is 0, r = 0. Margining constraints can be quite punitive financial. The holdings are allowed to be adjusted continuously up to a fixed horizon T. The investment behavior is modelled by an investment strategy π = (π1, π2). Here, π i (t), i = 1, 2, denotes the percentage of total wealth invested in i-th asset at time t (see equation (17)). Let π1 (t), π2 (t) be respectively the portfolio weights for assets X and Y at time t. We only allow pairs trading: short one of the asset and long the other in equal dollar amount, i.e. π1 (t) = - π2 (t). In addition, we restrict our considerations to self-financing strategies.
We define admissible control and controlled process as in Korn and Kraft (2002).
Denote by V
π
(t) the value of portfolio corresponding to strategy π at time t, which is given by
For each control process
the corresponding state process P
π
satisfies
only pairs trading is allowed: short one of the asset and long the other
Since we consider a self-financing portfolio, then by equation (45) the dynamics of the state process, P = (V
π
, Z), becomes
Optimal investment strategy
We assume that an investor’s preference is represented by the power utility function
Using separation ansatz we reduce a 3-dimensional HJB equation in (49) to the following 2-dimensional PDE:
The issue at this stage is that this PDE does not have a closed for solution. This is a non standard PDE, which is not high dimensional but is nonlinear which makes using finite difference methods or any standard numerical methods inadequate. For this reason we propose to use the “Deep Galerkin Method” to solve the PDE in (52). Once the solution is found, we can write the optimal strategy as
Without an analytical solution to the non-standard 2-dimensional PDE in (52), we approximate the solution with the algorithm “Deep Galerkin Method” (DGM) proposed in Sirignano and Spiliopoulos (2018). DGM is a merger of the Galerkin method and deep neural network machine learning algorithm. The Galerkin method is a popular numerical method which seeks a reduced-form solution to a PDE as a linear combination of basis functions. The deep learning algorithm, or DGM, uses a deep neural network instead of a linear combination of basis functions. The algorithm is trained on batches of randomly sampled time and space points, therefore it is mesh free.
Brief review of DGM
In general case, consider a PDE with d spatial dimensions:
A measure of how well the approximation satisfies the PDE:
A measure of how well the approximation satisfies the boundary condition:
A measure of how well the approximation satisfies the initial condition:
Here all three errors are measured in terms of L2-norm, i.e.
The sum of all three terms above gives us the objective function associated with the training of the neural network:
1:
2:
3:
4: s n ← ((t n , x n ) , (τ n , z n ) , w n )
5:
6:
7:
8:
9:
10: α n ← αn-1 * λ
11: θn+1 ← θ n - α n ∇ θ G (θ n , s n )
The neural network (NN) architecture used in DGM is like a long short-term memory (LSTM) network though with small differences (see Sirignano and Spiliopoulos (2018)). We describe below the architecture of this NN:
with ⊙ denoting Hadamard multiplication, L number of layers and σ the activation function. The rest of the superscripts refer to the neurones for our NN architecture of Figures 3 and 4.

Bird’s-eye perspective of overall DGM architecture Al-Aradi et al. (2018).

Operations within a single DGM layer Al-Aradi et al. (2018)
Testing DGM on Merton problem
Because the DGM method was relatively new, we wanted to test the algorithm ourselves. More specifically the method was tested with several nonlinear, high-dimensional PDEs independently in Al-Aradi et al. (2018) and Sirignano and Spiliopoulos (2018), including the nonlinear HJB equations. We have tested the DGM algorithm on the HJB equation for the Merton problem ourselves. More specifically, Figures 5 and 6 show the plots of the analytical and approximated surface with DGM solution. We found the performance of the algorithm quite impressive. Indeed, Figure 7 shows the difference between the analytical and the approximate solution. Most of the time, the error is between 0% and 1%. The only criticism, though negligible that we can make is that the approximate solution does not do as well around t = 0 (the maximum error of 4% is around t = 0). This corroborates with the findings in Al-Aradi et al. (2018). The authors do not give an explanation of why this is the case but we did not think that this small imperfection was big enough for us to abandon the methodology.

Analytical solution of the Merton Problem.

Approximate solution of Merton problem using DGM.

Error between analytical and approximate solution of Merton problem.
Solution to our PDE problem using DGM
Recall the PDE we want to solve is given in equation (52). In the absence of a closed form solution to this PDE we approximate the solution with the DGM algorithm described above. Figure 8 shows the approximate solution to the PDE in (52) for different parameter values. Recall, once we have the numerical solution f for the PDE above, we obtain the optimal weights as following:

Approximate solutions to PDE (104) with DGM for four different scenarios of ρ and μ and fixed σ = 0.2, η = 0.19, γ = 0.5. (a) Approximate solution with low μ = 0.01 and low ρ = -0.5. (b) Approximate solution with low μ = 0.01 and high ρ = 0.5. (c) Approximate solution with high μ = 0.4 and low ρ = -0.5. (d) Approximate solution with high μ = 0.4 and high ρ = 0.5.
Although in the previous two cases we assume that an investor has a certain risk preferences as modelled by a utility function (MVC and power utility), it is interesting to consider a limiting case where the investor can be always persuaded to go for more money (identical utility function U (x) = x, which is essentially the power utility function with risk aversion parameter γ = 1) when deciding between MVC or power utility. Assuming that an investors’ preference is modelled either as in equation (29) or as in equation (46), in order to improve further the portfolio returns we employ dynamic switching between the two optimal strategies
Average over 500 simulations of portfolio returns at terminal time T (day 1000) with dynamic switching (DS) is higher than average portfolio return with only stochastic control (SC) or only mean-variance-criterion (MVC)
The portfolio optimization problem
We assume an initial wealth
The general optimization problem is to find an optimal strategy, w (t), such that the terminal P&L is maximized:
We review band-wise Gaussian mixture model because it inspires our method of selecting the bands. Consider a probability space
This generalised SDE gives as a special case the cointelation model: take θ = - μ, μ = 0, α = 1 and β = 0 for the dynamic of X in (8); take θt,τ = κ, μt,τ = X
t
, α = 1, β = 0 for the dynamics of Y in (8). The SDE in (65) can also model: Proportional returns (log-normal diffusion) when θ = 0, α = 1, β = 0. Absolute returns (normal diffusion) when θ = 0, α = 0, β = 0, Mean reverting returns where we enforce positivity of returns (e.g. CIR Cox et al. (1985) diffusion when α = 1/2 and β = 0), Mean reverting returns where we do not enforce positivity of the returns (e.g OU Uhlenbeck and Ornstein (1930) diffusion when α = 0 and β = 0).
In general calibrating parameters of the SDE in (65) to a real data is complex. Using data simulated with (65) their empirical distribution is approximated for the purpose of prediction by a band-wise Gaussian mixture model. This is done for a sequence of bands which are created using Machine Learning clustering method (see Mahdavi-Damghani and Roberts (2017)).
1: X(1:h) ← QuickSort(X1:n)
2:
3: Ω(1:⌈n/h⌉) ← []
4:
5:
6:
7: Amend(Ω(j), P(i))
8:
9:
10:
11:
12:
13: Print(
14: Ω(1:h),
Let

Two examples of Gaussian Mixture Simulations with different number of bands. (a) Empirical distribution of random variable sampled from cointelation model (8) in three different zones described in Figure 2. (b) Empirical distribution of random variable sampled from cointelation model (8) in five different zones: two additional zones were added to the initial three zones in Figure 2.
Based on the idea of band-wise Gaussian mixture model, we use clustering analysis to create bands, however not for the observed asset price data, but for the spread between two asset prices in (8), i.e. X t - Y t . Inside of each band instead of specifying the distribution as in band-wise Gaussian mixture, we test a set of strategies that maximizes the corresponding P&L. We record the optimal strategies within each band, and in live trading, whenever the spread of asset prices falls in a certain band we employ the optimal strategy for this specific band. We now present the trading signal that translates to investment strategy in machine learning approach.
The Bayesian set-up: We set from equation (8) B
t
= X
t
- Y
t
and have
Strategy S++ in which we are long both X and Y at time t in between bands [a
i
, b
i
] and with P&L Strategy S+- in which we are long X and short Y at time t in between bands [a
i
, b
i
] and with P&L Strategy S-+ in which we are short X and long Y at time t in between bands [a
i
, b
i
] and with P&L
The P&Ls corresponding to these strategies are defined as following:
We denote the maximum P&L achieved by each of these strategies by
1: P(1:h) ← QuickSort(P1:h)
2:
3: B(1:h) ←
4: Ω(1:⌈n/h⌉) ← []
5:
6:
7:
8: Amend(Ω(j), P(i))
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22: signal
S
, signal
S
l
← forecast(
23: signal
S
,
We further provide Algorithm 3 as the pseudo-code for the calibration process. Note that in both Algorithms 2 and 3, we have used a QuickSort which can be substituted by other sorting algorithms. Note that the use of self explanatory functions such as returnCorrespondingStrat(x,y) in line 20 of Algorithm 3 which given the set of strategies and the P&L returns, as its name indicates, outputs the corresponding strategy that maximizes P&L. The function forecast(x,y,z) in line 22 of Algorithm 3 takes as input the set of trained strategies and the current level of X t and Y t and returns a prediction of where the signals for the latter two should be. Finally the use of the argmax function in lines 13-16 can be replaced by a simple for loop but in the interest of not making the pseudocode too crowded we have kept it this way.
Simulation
Figure 10 illustrates the ML and the DS approaches on one single simulated path. Note that when implementing the ML approach with a horizon of 1000 days, we double this data for training, i.e. we use 2000 historical daily prices. We have performed two sets of 500 simulations and we have gathered their results in the following two examples.

(a) one simulated scenario based on cointelation model (8) with parameters: μ = 0.05, σ = 0.17, η = 0.16, κ = 0.1, ρ = -0.6 and scaled spread: κ (X t - Y t ); (b) portfolio return and optimal weight of asset X with Dynamic Switching approach; (c) portfolio return and optimal weight of asset X and Y with Machine Learning approach.

Histogram of excess (P & L) for ML LS vs SC at terminal time T.

Histogram of excess (P & L) for ML vs FM at terminal time T.
From our performance histogram in Figures 11 and 12 we have concluded that for parameters μ = 0.05, σ = 0.17, η = 0.16, κ = 0.1, ρ = -0.6 of cointelation model (8) we have the following rankings for the approaches:
Rational
During the peer reviewing process we were ask to use the model with real market data (on top of the simulations presented previously). The recommendation was to use two-asset model with uncorrelated memoryless jumps in both the leading asset and the spread to produce more realistic model. A pure cryptocurrency model was not recommended so we thought a hybrid model would be best. We decided to propose NVDA and Bitcoin to be cointelated pairs for the following reason. When cryptocurrencies prices go up, then the incentive to mine cryptocurrencies increases which increases, in turn, the value of companies involved in mining. The most well known cryptocurrency is Bitcoin, and the most well known mining equity remains NVDA. We have decided to study this pair, assuming they are cointelated. The data set was downloaded using a free data provider (yahoo finance) for the last 5 years. In Figure 15 we have displayed the price of Bitcoin, NVDA as well as their normalised spread in our timescale. The volatility of bitcoin being significantly higher than the one of NVDA, we decided to proxy their current volatility as their rolling 6 months standard deviation. We then constructed their spread using an inverse weighting function of their current volatility. Please refer to our github code in order to gain full transparency for the normalisation process. We roughly separated the “in sample” and “out of sample” data in half (2 years and 3 month each: the first 6 months being “burnt” in order to calculate current volatility). As per the described methodology the bands were chosen to have the same size (0 to 33 percentile, 33 to 67 percentile and finally 67 to 100 percentile). We also incorporated a simple cost function (adding conservatively a couple of bps per position change: 4 in total assuming a long position would become short and vice versa). We recorded our in and out of sample sharpe ratios in Table 2. We can observe few point. First, over-fitting is noticeable 5 since the SR of our overall strategy drops by 1.2. However this is mitigated by the fact that the out of sample SR remain at 1.7, a very reasonable figure at the low frequencies. Added to this point, all bands are doing reasonably well in both the in and the out of sample. We take this as being an evidence of a reasonable over-fitting. Figure 16 plots the spread of the cointelated pair on top as well as its 67 and 33 percentile bands.

In FM approach optimal long/short strategies are more volatile than optimal long only strategies.

In ML approach optimal long/short strategies are slightly more volatile than optimal long only strategies.

Cumulative returns in the last 5 years for: (top) Bitcoin (BTC), (middle) NVDA, (bottom) their volatility normalised log spread.
In and out of sample sharpe ratios by band

Strategy decomposition details: (top) volatility normalised log spread (in blue) with 50 percentile bands (in red), (middle) trading signal (+1 means long BTC and short NVDA and -1 the contrary), (bottom) cumulative return of the strategy.
Possible directions for future work
One direction for future work is to consider portfolio optimization problem for n-dimensional cointelation model. For instance, when n = 3 we can have something of the following form:
We have studied the portfolio optimization problem of two assets that follow the cointelation model using two approaches: Financial Mathematics and Machine Learning. We first implemented the FM approach, where we use classic financial mathematics criteria: mean-variance and power utility maximization. Without an analytical solution to the PDE (52), we resort to the DGM method, a deep learning algorithm, to solve it numerically. The second approach we implemented is ML using clustering. The latter approach is easier to implement, it is model agnostic, therefore avoids the complex SDE calibration. In our case the Machine Learning approach slightly outperforms the Financial Mathematics approach.
Footnotes
Appendices
A time series X t is integrated of order d if (1 - L) d X t is a stationary process. Here L is a lag operator.
In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins whereas, the values of other parameters are derived via training.
E.g: sharpe ratio maximization
Note that this is almost always true in an low signal to noise ratio in the lower frequencies.
