Abstract
In this paper we propose stochastic time series prediction by autoregressive Hidden Markov Models (AR-HMM). The model parameter estimation, hence the prediction, is carried out by Markov chain Monte Carlo (MCMC) sampling instead of finding a single maximum likelihood model. Estimating the whole distribution can provide us with more insight about the underlying stochastic process.
As opposed to trading directly on a financial instrument, the predicted future distribution of the underlying asset is then used for option portfolio optimization, where we consider a portfolio of plain vanilla put and call European options with different strike prices. The optimization itself is carried out using linear programming with optional risk constraints.
The nature of MCMC sampling of AR-HMMs exhibits algorithmic properties which make a massively parallel implementation feasible and beneficial. The models are implemented using Graphics Processing Units (GPU) to achieve superior performance. The performance of the novel methods has been extensively tested on real financial time series, such as SPY and USO, where they could secure a profit and outperformed the traditional maximum likelihood approaches.
Introduction
A Hidden Markov Model (HMM) (Baum and Petrie, 1966) is a statistical model which is widely used for predictions in diverse fields such as finance and econometrics (Sipos and Levendovszky, 2013; Hassan and Nath, 2005; Mamon and Elliott, 2007; Hamilton, 1990), genomics (Durbin, 1998), linguistics and speech recognition (Jurafsky and Martin, 2009; Rabiner, 1990). It is able to model changing stochastic properties over time. One of its possible generalizations is the autoregressive-HMM (AR-HMM) (Murphy, 2012), which can describe a wide range of random processes (Sipos and Levendovszky, 2015). The model and the notations are introduced in Section 2.
The model parameter estimation or model fitting, which is often referred to as learning, is of utmost importance when HMMs are used to predict future values of time series. Unfortunately, computationally efficient algorithms have not yet been developed for finding the best maximum likelihood estimator of the free parameters which reaches global optimum. Traditionally, the Baum-Welch expectation maximization algorithm (Baum and Petrie, 1966) is used, although it can easily get stuck in one of the local minima of the error function providing only suboptimal solutions.
In many applications of HMMs the goal is to predict the most probable state sequence. In turn, when using them for predicting time series, a single maximum likelihood (ML) model estimation is not sufficient to capture the stochastic behaviour. Considering a set of possible models and estimating the whole distribution of the future value can provide us with more insight about the underlying stochastic process than picking the single most likely model and its most likely outcome or expected value.
Markov chain Monte Carlo (MCMC) (Andrieu et al., 2003) is a method which, in our case, allows us sampling from the distribution of the AR-HMM model parameters, and therefore from the predictions, by constructing a Markov chain with the same probability distribution. The Metropolis-Hastings algorithm (Hastings, 1970) was used to carry out an MCMC sampling. This sampling process casts the learning phase unnecessary, directly providing a set of samples from the distribution of the future values of the time series instead. The concept of prediction by MCMC sampling approach for AR-HMMs is outlined in Section 3.
The nature of the required algorithms to perform MCMC sampling of HMMs, including linear algebraic operations and the forward-backward algorithm (Rabiner, 1990) for HMMs, exhibits such algorithmic properties that makes a massively parallel implementation feasible and beneficial. The models are implemented using the concept of general-purpose computing on Graphics Processing Units (GPU) and can achieve superior performance compared to traditional CPUs both in terms of both speed and power consumption (Nvidia, 2014). The computational model is mapped out and the parallel implementation details are described giving rise to enhanced modelling in Section 3.3.
In our paper the MCMC sampling of AR-HMMs is used for predicting financial time series. By using Monte Carlo methods it is straightforward to generate future trajectories for each sampled model. To exploit these predictions, a portfolio of different options on the underlying asset is introduced with a corresponding portfolio optimization function. During the optimization the expected payoff is maximized meanwhile having an upper bound on the maximal losses. The optimization itself is carried out by linear programming. Finally, a trading strategy is employed to secure profit using these optimal options portfolios based on predictions coming from sampling AR-HMMs by MCMC. The purpose of the paper is to show that predicting a whole distribution as opposed to a maximum likelihood estimation provides us with better predictions. Furthermore, an option portfolio based trading strategy exploiting this information can outperform traditional approaches on real financial time series. In addition, our approach can manage the associated risks in a more direct manner. Fig. 1 depicts our computational approach to achieve this.

Computational approach.
The results are treated in the following structure: in Section 2 the model is introduced including the definition of HMMs and AR-HMMs; in Section 3 we demonstrate how to use MCMC sampling of AR-HMMs for prediction, its parallel version with a GPU implementation is also discussed; in Section 4 portfolio optimization is investigated based on the predicted distribution; in Section 5 it is demonstrated that a good profit can be achieved by trading with derivatives on these underlyings by testing the performance on real financial time series, such as SPY and USO. finally, in Section 6, some conclusions are drawn.
In order to predict the future distribution of financial time series Hidden Markov Models (Banum and Petrie, 1966) (HMM) will be utilised. An HMM is a statistical model which is an extension of Markov chains. In the case of an HMM, the current state is no longer directly visible to the observer, but in each each state HMM emits an observable output sequence in the time period t = 1 . . T denoted by
The probability of emitting a specific output sequence is determined by the conditional probabilities
There are multiple approaches to model o
t
and describe its stochastics via Equation (3). Continuous approach is found to be more favourable to discrete modelling in Sipos and Levendovszky (2013). More specifically, the distribution of the observable output is given by a univariate Gaussian probability density function:
The hidden Markov chain and the observed layer of such HMMs are depicted by Fig. 2.

A Hidden Markov Model with continuous observations.
Unlike the standard HMM assumption, in the case of autoregressive-HMMs the emissions are not independent but also depend on the previous emission given the hidden state (Murphy, 2012). The probability of emitting a specific output is then determined by the following conditional probability
An AR-HMM can then be described by the following quantities.
In many cases the model parameters described by Θ are not known, therefore one needs to fit these parameters based on observed processes. Traditionally the likelihood (or in practice, due to the small order of magnitude of such probabilities, the log-likelihood) of the model is maximized based on the given observations (Rabiner and Juang, 1986):
In our application the model is used to predict the future values (i.e. ot+τ) of the time series. The traditional way to obtain such a prediction is to first get the maximum likelihood estimation of the model parameters Θ
opt
. Then for a given observation
While the maximum likelihood estimation methods described in the previous section are sufficiently suitable in many other applications, they have some key drawbacks, most importantly their sensitivity to the observations as argued in Sipos (2016). For example, noisy observations or a sample from a slightly different time window (e.g. next day) may result in a completely different model implying a rather different prediction. Such property is less appropriate for the purpose of algorithmic trading as alternating predictions will imply unstable trading signals. Also, in order to better estimate the risks less likely models should not been overseen by picking a single ML estimation. As a benchmark, even if the underlying model is known, when larger variance is present the ML estimation based on its generated observations could be a fairly different one.
To deal with these shortcomings, instead of forming a prediction based on a single maximum likelihood model, the intuitive idea of the proposed solution is to consider every possible latent model for a more reliable prediction:
The Metropolis-Hastings algorithm (Hastings, 1970), as a type of Markov chain Monte Carlo method (Andrieu et al., 2003) is a statistical sampling method. The algorithm allows us to draw samples of AR-HMMs (Θ) generated by their explicitly unknown probability distribution given the observed time window. This sequence of random samples provides an approximation for their distribution, and more importantly for the distribution of the predictions belonging to the sampled models. Let
In each step the model parameters are perturbed randomly, such that
After taking K samples, denote the set of sampled AR-HMMs as S0, one can estimate ot+τ as
Similarly,
As opposed to the sequential version, the parallel MCMC sampling runs from several (M) initial points at the same time. With a naive approach, we could simply consider the union of samples generated by each run:
In practice, representative samples can be taken after the sampling converged, hence a so-called burn-in period is needed before we actually consider the samples. In most cases, this state is reached relatively fast, however, certain initializations lead to a much larger number of steps until convergence. On the other hand, due to the nature of massively parallel architectures each MCMC run should take the same amount steps. Thus, sampling in a parallel way raises the question to decide which runs has converged and which has not.
In theory, the distribution of the samples should be the same after the sampling has converged. This motivates a method to compare the standard deviation of the samples in a given run with the standard deviation if samples from another run are appended (Cowles and Carlin, 1996). From M runs the largest subset should be selected where the pairwise changes in the standard deviation does not exceed a given threshold. The rest of the runs are then marked as not converged and those samples are not considered further.
Executing the parallel MCMC sampling described above on a parallel architecture gives the benefit of obtaining much more samples during the available computational time interval. Having more samples yields a lower uncertainty in the predictions. Alternatively, since computational time is a very important aspect of algorithmic trading, and a massively parallel software implementation would ensure fast execution. This would enable the use of MCMC sampling in higher frequency trading environments as well. The algorithms introduced in the previous chapters for AR-HMM based prediction allows parallel implementation and requires relatively few data transfer for the input and output variables (Sipos et al., 2017). These algorithmic properties make a massively parallel implementation feasible and beneficial. One can utilize the massively parallel architecture of a Graphics Processing Unit (GPU) to solve arbitrary numerical problems (Nvidia, 2014; Cook, 2013). In terms of floating-point operations per second, which is the primary measure of hardware performance for scientific or engineering computations, GPUs are superior compared to traditional CPUs, while also being favourable in terms of energy efficiency for many application. This motivates the growing interest for using the GPU in the field of computational finance (Pagès and Wilbertz, 2011). Other alternatives, like FPGAs, were not in the scope of this research due to their poorer availability and flexibility.
The parallelism of the proposed method is twofold with respect to granularity. Fine-grained parallelism occurs in each step of the MCMC sampling, (i) computing the model likelihood for a given AR-HMM by using the forward-backward algorithm, which is done with dynamic programming on N threads along the hidden states; and (ii) linear algebraic operations for calculating the prediction is also carried out in a parallel manner. Coarse-grained parallelism lies in the fact that we can run multiple MCMC algorithms simultaneously.
Accordingly, the computational framework is shown by the block diagram of Fig. 3. First, the forward-backward algorithm provides us with the state distributions, i.e. what is the probability of being in each state at each time instance. These probabilities combined with

Parallel computational approach for GPU.
The proposed parallel MCMC sampling of AR-HMMs can provide us with predictions regarding the future price distribution of an underlying financial instrument. In order to turn such information into actual financial gains a corresponding trading strategy is needed. As opposed to more traditional methods (Idvall and Jonsson, 2008), where only the future price is predicted, to exploit the additional information on its distribution and also manage the risks trading with options on the underlying is proposed.
For prediction based portfolio optimization, we consider a portfolio of plain vanilla put and call European options with different strike prices. Each option has the same predefined maturity T. Let
Using the samples of predicted trajectories of the underlying and the payoff function g conditional on the future value, one can construct the expected payoff of each option follows:
Besides maximizing the payoff, investors seek to strike a good balance between return and the associated risks. Trading with options allows us to limit the maximum amount of loss on a trade, introducing these bounds to the optimal portfolio yields the following constrained optimization problem:
This is a linear programming problem, which can be efficiently solved, for example, by the simplex method (Dantzig et al., 1955).
Finally, to exploit the portfolio optimized in this manner, a trading strategy is needed. One can carry out the MCMC based prediction and the subsequent portfolio optimization at an arbitrary time and with arbitrary maturities independently. In our analyses, as shown in the next section, we chose a week as a holding time and a sliding window trading strategy. In other words, we engage in the optimized option portfolio at the end of each week, and considering the profits and liabilities at the end of next week. The owned positions are never sold before maturity dates.
In this section, we analyse the numerical results obtained on the following Exchange Traded Funds (ETF): SPY and USO, in the period of January 1, 2014 to December 31, 2016 in daily resolution (Yahoo, 2017). These marketable securities can track bonds, commodities like West Texas Intermediate crude oil in case of USO or even indexes like S&P 500 in case of SPY. The data set contains both the basic product on which the prediction was made and the corresponding options derivatives also.
In each simulation the real bid-ask spread has been taken into account, meaning that the option is bought on the ask price and the option is sold on the bid price. In each week the agent can buy options for USD 100 (p max ) or 100 from the basic product. In case of MCMC, 2000 samples are made, the first 1000 samples are dropped as a burn-in and the second K = 1000 samples are kept for calculation. At MCMC M = 32 parallel runs are made which ensure the necessary robustness. The length of time-window is set to T = 128 days which is sufficiently long but at the same time outdated data points are not taken into account. The number of hidden states of the underlying Markov chain was set to N = 3.
At the end of each week (at each Friday) the predictions are made for 5 weekdays ahead (for the next Friday) in τ = 5 steps. When predicting more steps further, the AR-HMMs would reach their stationary state. Based on the prediction, the optimal option portfolio is selected by the linear programming method detailed in Section 4. The optimal portfolio, which might be a non-integer number due to the optimization method, is rounded to integer and bought. The option portfolio is held until expiration. Those options which contain intrinsic value are exercised (either it has a positive or negative outcome).
Fig. 4 contains the cumulative profit of weeks, the balance, and shows the effect on balance by setting different p maxloss on USO dataset. One can see that the larger the p maxloss parameter is the larger profit can be achieved, but the standard deviation is also increasing.

Results of the trading algorithm with portfolio optimization on USO historical options data set in case of different p maxloss .
Next, we compared the effect of introducing MCMC sampling instead of the traditional Maximum Likelihood estimation of AR-HMM model parameters (ML ETF). In the second case, the agent traded directly on the underlying ETF products. Also, when using MCMC, trading on the underlying (MCMC ETF) is compared to the proposed portfolio optimization method using options (MCMC options).
Fig. 5 and 6 show the cumulative results of 149 trading weeks. It is clear, that in case of using MCMC sampling and trading with options, the final and average profit are significantly higher in case of both assets. In the case of using Maximum Likelihood Estimation and trading with SPY basic product, the agent could not secure profit.

Results of the trading algorithms with portfolio optimization in case of MCMC and with basic product in case of MCMC and ML on USO historical options data sets (p maxloss = 500).

Results of the trading algorithms with portfolio optimization in case of MCMC and with basic product in case of MCMC and ML on SPY historical options data sets (p maxloss = 500).
We also evaluated the running time of different use-cases. The experiments were conducted on a PC with a second generation Intel i7 CPU and on one NVIDIA GeForce GTX 960 (2 GB) GPU. Fig. 7 shows the computation time needed to take one sample.

Average running time of sampling AR-HMMs as a function of number of parallel AR-HMMs.
The performance in the case of small number of HMMs is relatively low, which is not surprising in the light of the fact the GPU computing power remains underutilized. If we increase the number of optimized HMMs then more and more blocks will become active on the GPU which, in turn, will increase the performance. Because of the limitations of this GPU, the maximum number of parallel AR-HMMs is 96. In our scenario (get the predictions for 5 time steps ahead and sampling 32 AR-HMMs 2000 times) the average running time is 7.06 seconds. Thus, this enables to use the method for intra-day trading or any kind of higher frequency applications.
In this paper, novel algorithms were proposed for trading option portfolios based on predictions by AR-HMMs. First, the robustness of the prediction has been improved by applying MCMC sampling in the space of AR-HMM models as opposed to the traditional maximum likelihood approaches. As a second contribution, instead of exploiting predictions for the expected future values by trading directly on the underlying financial instruments, choosing portfolios of options containing calls and puts with different strike prices has allowed us to benefit from the additional information predicted for the future distribution. As the proposed methods benefit from a higher sample size, an efficient parallel implementation on GPU is also introduced.
The performance analysis demonstrated that the proposed algorithm could substantially increase the trading efficiency and the obtained profit proved to be better than the one obtained in the case of applying the traditional strategies. Moreover, the proposed portfolio optimization also lets the investors to set their risk aversion level by imposing an upper bound on the maximal loss per transaction.
As a direction for future research, using a more sophisticated objective function for portfolio optimization and correspondingly a more complex trading strategy could further improve the trading performance. The introduced GPU implementation also enables higher-frequency trading. Computational time also could be reduced by using more efficient algorithms, like the fixed lag smoothing or faster mixing MCMC implementations. In addition, other applications could also benefit from MCMC sampling AR-HMMs, where robust time series prediction is needed. One example could be modelling electricity consumption.
Footnotes
Acknowledgement
The research reported in this paper was supported by the Higher Education Excellence Program of the Ministry of Human Capacities in the frame of Artificial Intelligence research area of Budapest University of Technology (BME FIKP-MI/SC).
We gratefully acknowledge the support of NVIDIA Corporation with the donation of a GPU device helping this research.
