A new hybrid method of recurrent reinforcement learning and BiLSTM for algorithmic trading

Abstract

Recently, the algorithmic trading of financial assets is rapidly developing with the rise of deep learning. In particular, deep reinforcement learning, as a combination of deep learning and reinforcement learning, stands out among many approaches in the field of decision-making because of its high performance, strong generalization, and high fitting ability. In this paper, we attempt to propose a hybrid method of recurrent reinforcement learning (RRL) and deep learning to figure out the algorithmic trading problem of determining the optimal trading position in the daily trading activities of the stock market. We adopt deep neural network (DNN), long short-term memory neural network (LSTM), and bidirectional long short-term memory neural network (BiLSTM) to automatically extract higher-level abstract feature information from sequential trading data, respectively, and then generate optimal trading strategies by interacting with the environment in a reinforcement learning framework. In particular, the BiLSTM consisting of two LSTM models with opposite directions is able to make full use of the information from both directions in attempting to capture more effective information. In experiments, the daily data of Dow Jones, S&P500, and NASDAQ (from Jan-01, 2005 to Dec-31, 2020) are applied to verify the performance of the newly proposed DNN-RL, LSTM-RL, and BiLSTM-RL trading systems. Experimental results show that the proposed methods significantly outperform the benchmark methods, such as RRL and Buy and Hold, with higher scalability and better robustness. Especially, BiLSTM-RL performs better than other methods.

Keywords

Reinforcement learning deep learning trading strategy Sharpe ratio

1 Introduction

In recent years, algorithmic trading has played an important role in financial trading and has become a trend in today’s financial markets. The trading process can be described as a decision-making process that aims to maximize returns while reducing risk. Various financial assets are analyzed and traded within the framework of traditional algorithms such as GARCH [1, 2] and ARIMA [3, 4], as well as the latest machine learning-based algorithms [5, 6]. As part of a broader family of machine learning methods, deep learning is often used to forecast stock prices or trend movements in order to build financial trading strategies. But such methods mainly depend on the accuracy of the forecast and can easily lead to over-fitting. Also, these methods cannot handle continuous decision-making problems in financial markets.

Compared with forecast-based methods, deep reinforcement learning enables mapping from state space to action space through continuous and online self-learning. DQN (Deep Q-Network), a primary method that firstly combines deep learning and reinforcement learning [7], outperformed humans on the Atari game platform [8]. In particular, AlphaGo developed by Google DeepMind team pushed deep reinforcement learning to a new hotspot and height, which became a new milestone in the history of artificial intelligence [9]. deep reinforcement learning has achieved remarkable success in solving complex sequential decision problems, and therefore more and more research is focusing on the combination of deep reinforcement learning and investment decision-making. It can extract features directly from high-dimensional raw financial data in a deep learning module, and then find optimal dynamic trading strategies to maximize risk-adjusted returns by interacting with the environment in a reinforcement learning module.

In 2001, Moody et al. proposed the recurrent reinforcement learning (RRL) algorithm with return as the input and the difference Sharpe ratio as the objective function for single assets and portfolios with a transaction cost of 0.005 [10]. Nevertheless, the model extracts features from time series data in a linear manner, but financial markets are noisy, and a nonlinear model is needed to extract higher-level features. In contrast, deep learning combines features through multilayer network structures and nonlinear transformations with strong perception and representation capabilities. In this paper, we propose an extended RRL algorithm based on DNN, LSTM, and BiLSTM, respectively, to learn higher-level and abstract feature information from low-level raw time series data, and find the optimal trading strategy by reinforcement learning method with Sharpe ratio as the objective function.

In conclusion, the main contributions of this paper can be summarized as follows:

Proposed an improved deep reinforcement learning algorithm based on Moody’s RRL algorithm by employing DNN, LSTM, and BiLSTM into RRL to extract higher-level features.

Compared with Moody’s RRL, our method only considers the sequence of return without considering the previous moment’s trading position. Experimental results show that our method can achieve higher profits without previous trading positions.

Compared the performance of our DNN-RL, LSTM-RL, and BiLSTM-RL models and the baseline models. Since BiLSTM can capture more effective information by making full use of both directions, the results show that the BiLSTM-based approach can outperform other methods in the U.S. index market.

The remainder of this paper is arranged as follows. Section 2 generally reviews some related works on deep reinforcement learning for studying algorithmic trading. Section 3 depicts the details of RRL and presents new hybrid methods, DNN-RL, LSTM-RL, and BiLSTM-RL. Section 4 shows the experiments and the comparison of DNN-RL, LSTM-RL, and BiLSTM-RL methods with baseline models. Section 5 gives the conclusions and future work.

2 Related work

Traditional algorithmic trading is based on techniques developed by mathematics or the experience of human experts, such as trend following and mean reversion strategies. Parisi et al. used moving averages and trading range breakout rules to construct stock trading decisions and found that the strategy achieved consistent profits [11]. Gerritsen et al. applied seven trend-following indicators to assess the profitability of technical trading rules for Bitcoins data, finding trading range breakouts that contain significant predictive power for bitcoin prices and were profitable [12]. However, artificial intelligence methods play an important role in the study of trading in financial markets.

Among the different methods and techniques of machine learning, deep learning mainly fuse different data to extract time series rules to forecast stock prices and generate trading signals. Persio et al. considered MLP, CNN, and LSTM techniques to predict trends [13]. Zhou et al. proposed a prediction model combining social media information by analyzing stock-related tweets on Weibo. Experiments show that the proposed model can outperform models that rely only on pure financial series data [14]. Verma et al. presented an LSTM-based forecasting model to forecast stock movement direction [15]. Roondiwala et al. proposed a recurrent neural network (RNN) and Long Short-Term Memory (LSTM) approach to forecast stock market indices [16]. Althelaya et al. applied bidirectional and stacked LSTMs to short-term and long-term forecasting of financial time series. The experimental results show that the bidirectional and stacked LSTM outperformed shallow neural networks and vanilla LSTM models in predicting stock movements [17]. Huang et al. attempted to use Bayesian models to predict trading rules from trading patterns detected by the proposed double clustering algorithm [18]. Xu et al. introduced a StockNet to forecast stock movements, using GRU to learn from tweets and historical prices [19]. Chen et al. investigated the use of attention mechanism in LSTM networks for stock price trend prediction. The experimental results in Hong Kong stock market show that the proposed model greatly enhanced the predictive performance of the LSTM [20]. Shen et al. proposed a comprehensive customization of feature engineering and LSTM-based model to predict the price trend of the stock market [21]. AlAradi et al. proposed a framework combining LSTM and DNN to forecast price and movement direction by using multiple datasets including news sentiment, social sentiment, company earnings announcements, and technical indicators. The experimental results on the Apple Inc. stock show that the superiority of LSTM model was better than the DNN model [22]. Mundra et al. proposed a hybrid approach to predict stock prices based on support vector machines (SVM) and LSTM. The experimental results showed that the proposed model has a prediction accuracy of 97% on the TATA Global Beverage stock dataset [23]. Mai et al. analyzed LSTM and BiLSTM to forecast stock prices, which allows the model to obtain the best performance by capturing the temporal evolution of information [24–26]. Ni et al. proposed a hybrid method for forecasting stock market based on tweet embeddings and historical prices [27]. Yang et al. applied BiLSTM in financial time series prediction, and compared with LSTM, support vector regression (SVR) and differential autoregressive moving average model (ARIMA) are compared. Experiments show that BiLSTM model has the highest prediction accuracy [28]. Lee et al. applied technical indicators to LSTM-attention time series models for stock price forecasting and trading strategy design [29]. Chandola et al. proposed a hybrid deep learning model combining Word2Vec and LSTM algorithms to predict directional movements in stock market prices using financial time series and news headlines as inputs [30]. Md et al. proposed a novel optimization method based on a multi-layer Sequential Long Short Term Memory model for predicting stock prices by considering past performance information as well as past trends and patterns [31]. Bhandar et al. adopted an LSTM to predict the next-day closing price of the S&P 500 index under the umbrella of fundamental market data, macroeconomic data, and technical indicators [32].

Different from the above methods of directly forecasting stock prices, the reinforcement learning methods find the optimal dynamic trading strategy to maximize profits by interacting with the environment. Deng et al. introduced a recurrent deep neural network for real-time financial signal representation and proposed a task-aware backpropagation method to alleviate the gradient vanishing problem in DRL [33]. Lu proposed a model that combines long short-term memory (LSTM) recurrent structure and reinforcement learning. Experiments show that the system created successful strategies to achieve human-level long-term reward in GBPUSD [34]. Almahdi et al. used the recurrent reinforcement learning method to solve a dynamic portfolio optimization problem. Experiments show that the expected maximum drawdown risk-based objective function has higher return performance compared to the previously proposed RRL objective functions [35]. Aboussalah et al. proposed a stacked deep dynamic recursive reinforcement learning (SDRRL) architecture to construct real-time optimal portfolios by capturing the latest market conditions and rebalancing the portfolios accordingly. Experiments show that the proposed SDRRL achieves better performance than three benchmarks, including the rolling horizon mean-variance optimization (MVO) model, the rolling horizon risk parity model, and the uniform buy-and-hold (UBAH) index [36]. Nguyen et al. investigated three DRL algorithms, namely Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC), for their ability to solve automated stock trading problems [37]. Li et al. proposed a set of ensemble automated trading strategies in a deep reinforcement learning-based framework, which combined the advantages of the three algorithms, PPO, A2C, and SAC [38]. Ge et al. applied the A2C, PPO, DDPG, TD3, and SAC models to automate the trading of a single stock [39]. Li et al. proposed a new deep reinforcement learning model for stock trading, which fused features from different data sources extracted by deep neural networks into the state of the stock market, and then the agent makes trading decisions through reinforcement learning [40]. Zou et al. proposed a deep reinforcement learning-based stock trading system with cascaded LSTM, which first used LSTM to extract the time-series features from stock daily data, and then the features extracted are fed to the agent for training, while the strategy functions in reinforcement learning also use another LSTM for training [41]. Malibari et al. combined deep reinforcement learning with a transformer network to produce a decision transformer architecture for online trading [42]. Zhou et al. proposed an improved deep recurrent DRQN-ARBR model by changing the fully connected layer in the original model to the LSTM layer and selecting more technical indicators, especially the emotion indicator ARBR to construct a trading strategy [43].

However, the RRL method only uses a simple full connection and a hidden layer network to jointly generate the trading signal through the previous trading signal and the return sequence. In general, the more strategy rules and parameters a model has, the more likely it is to lead to data accommodation bias. At the same time, according to the definition of the Markov decision process, when the environment state is fully observable, the optimal strategy is only related to the current state, without considering historical information. Different from RRL, we propose a simplified method to RRL by automatically perceiving dynamic market conditions through DNN/LSTM/BiLSTM, extracting information features, generating trading signals directly, and optimizing the parameters of the trading system by backpropagation through time.

3 Methodology

3.1 Recurrent reinforcement learning

Intuitively, Moody’s RRL framework trains trading systems and portfolios by optimizing risk-adjusted investment returns (such as differential Sharpe ratio), taking into account the impact of transaction costs. It can be viewed as a stochastic control problem with investment decision-making. Fig. 1 shows the trading system based on recurrent reinforcement learning.

Fig. 1

Trading system based on recurrent reinforcement learning.

Moody’s RRL considered agents that trade fixed position sizes in a single security. It assumes that the trader can take a short, neutral, or long position with constant magnitude. The position F_t ∈ {-1, 0, 1} is established or maintained at the end of each time interval t and is evaluated at the end of period t + 1. Thus, it is possible to trade at the end of each period. The decision function is defined as follows, $F_{t} = F (θ; F_{t - 1}, I_{t}), with I_{t} = {z_{t}, \dots; y_{t}, \dots},$ (1) where θ_t denotes the (learned) model parameters at time t, I_t denotes the information set at time t, including present and past values of the price z_t, and an arbitrary number of other external variables represented by y_t.

Specifically, the decision function is formalized as follows,

$\begin{matrix} F_{t} = & \tanh (μ F_{t - 1} + v_{0} r_{t} + v_{1} r_{t - 1} \\ + \dots + v_{m} r_{t - m} + w), \end{matrix}$ (2) where r_t are the returns of z_t, i.e., r_t = z_t - z_t-1, and the model parameters θ are the weights {μ, v_i, w}. Meanwhile, a trading system return R_t is realized at the end of the time interval (t, t + 1], including the profit or loss during (t, t + 1] and the transaction cost. With these notations defined above, the profit R_t at time point t can be denoted by $R_{t} = F_{t - 1} \cdot r_{t} - δ \cdot | F_{t} - F_{t - 1} |,$ (3) where the first term is the profit/loss from the market fluctuations, and the second term is the total transaction cost with the ratio δ.

Therefore, the trading systems can be optimized by maximizing performance functions such as the Sharpe ratio [44], $S_{T} = \frac{Average (R_{t})}{Standard Deviation (R_{t})} = \frac{𝔼 [R_{t}]}{\sqrt{σ (R_{t})}},$ (4) where R_t is the return within trading period t, and $𝔼 [.]$ denotes the expectation. However, online learning requires calculating the impact of trading returns R_t at time t on the Sharpe ratio. Therefore, Moody derived a new objective function, called the differential Sharpe ratio, for optimizing the performance of trading systems online. The differential Sharpe ratio is defined as $D_{t} = \frac{d {\hat{S}}_{t}}{d η} |_{η = 0} = \frac{B_{t - 1} Δ A_{T} - \frac{1}{2} A_{t - 1} Δ B_{t}}{(B_{t - 1} - A_{t - 1}^{2})^{\frac{3}{2}}},$ (5) where $\begin{matrix} A_{t} & = A_{t - 1} + η (R_{t} - A_{t - 1}) = A_{t - 1} + η Δ A_{t} \\ = \frac{1}{T} \sum_{t = 1}^{T} R_{t}, \\ B_{t} & = B_{t - 1} + η (R_{t}^{2} - B_{t - 1}) = B_{t - 1} + η Δ B_{t} \\ = \frac{1}{T} \sum_{t = 1}^{T} R_{t}^{2}, \end{matrix}$ where A_t and B_t are the estimates of the exponential move at the first and second moments of R_t, respectively. In modern portfolio theory, investment strategies with higher Sharpe ratios rely on less volatile trends to make profits. Finally, reinforcement learning maximizes the utility function using the gradient ascent method with online stochastic optimization. After a sequence of t, the gradient of S_t, w.r.t. the system’s parameter θ, is $\frac{{dS}_{t} (θ)}{d θ_{t}} \approx \frac{{dS}_{t}}{{dR}_{t}} {\frac{{dR}_{t}}{{dF}_{t}} \cdot \frac{{dF}_{t}}{d θ_{t}} + \frac{{dR}_{t}}{{dF}_{t - 1}} \cdot \frac{{dF}_{t - 1}}{d θ_{t - 1}}},$ (6) So, given a trading model F_t, it adjusts the parameter θ_t to maximize S_t. The parameter can be updated by $θ_{t} = θ_{t - 1} + ρ \frac{{dS}_{t} (θ_{t})}{d θ_{t}},$ (7) where θ_t denotes the (learned) model parameters at time t, including{u, v_i, w}.

The RRL model considers the effects of transaction costs, market impact, and taxes in a trader’s decision-making. Thus, transaction costs and market impact on the trading system are taken into account in the decision function. However, the reward at the time t already takes into account the transaction costs. Therefore, there is no need to include the transaction costs in the decision function. The details are discussed in Section 3.2.

3.2 Proposed hybrid methods

Herein, we extend from RRL, by integrating DNN, LSTM, and BiLSTM. Firstly, we adopt DNN, LSTM, and BiLSTM respectively to replace decision functions to map the state space to action space, and then maximize the Sharpe ratio by gradient ascent method to generate optimal trading strategies, respectively.

A simple decision function is defined as $F_{t} = F (θ; I_{t}), with I_{t} = {z_{t}, z_{t - 1} \dots},$ (8) where θ_t denotes the (learned) model parameters at time t, and I_t denotes the information set at time t, including current and past values of the price series.

Traders learn strategies through trial and error exploration, take action and receive positive or negative reinforcement depending on the results. A trading performance function U (θ), such as profit, utility, or risk-adjusted return, is used to directly optimize the trading system parameters θ. Therefore, we adopt reinforcement learning to update the weights in a deep neural network (such as DNN, LSTM, and BiLSTM) via gradient ascent in the utility function U (θ). Figure 2 shows the structure of DNN/LSTM/BiLSTM-RL.

Fig. 2

The structure of DNN/LSTM/BiLSTM-RL.

Some basic elements for constructing a trading system based on DNN/LSTM/BiLSTM-RL are listed as follows:

State All market information for underlying financial assets forms the state of the environment. Here, the returns of the past M trading days are integrated as the agent’s inputs at t time, defined as x_t = [r_t, . . . , r_t-M].

Action The agent in the trading system will try to maximize the Sharpe ratio in the given time series (State). F_t represents the trading decision at time t. - if y_t ≤ -0.33 then F_t = -1, and short-sells the considered security; - if -0.33 < y_t < 0.33 then F_t = 0, and does nothing; - if y_t ≥ 0.33 then F_t = 1, and long-buys the considered security.

Reward Function The reward function R_t obtained by the trading system at the time point t takes the form of Moody’s reward function. See Equation (3) for details.

Utility Function A trading system can be optimized by maximizing a performance function, such as a utility function of profit, wealth, or a risk-adjusted return like the Sharpe ratio. Here, we take the Sharpe ratio S_t.

In the following sections, we will focus on the details of integrating DNN, LSTM, and BiLSTM into RRL, which can effectively capture information on financial time series.

3.2.1 DNN-RL

Deep Neural Network (DNN) is a neural network with more than one hidden layer, also known as Deep Feedforward Network (DFN), Multi-Layer Perceptron (MLP). According to the location of different layers, the neural network layers can be divided into the following: input layer, hidden layer, and output layer inside DNN, generally the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Herein, we transform the 20 features of the network input into one output using a fully connected layer and map that output to [-1, 1] using the tanh activation function.

However, it is still a challenging task to extract time-series data, such as stock data, and DNN may still be limited in their ability to deal with such data. When dealing with continuous data, one cannot process the correlated time series at each time step. each time step of the correlated time series and save the entire state of the series. the entire state of the sequence. Therefore, we further introduce LSTM/BiLSTM to efficiently capture both long-term and short-term information of financial time series. In the next section, we proposed a new hybrid method based on recurrent reinforcement learning and LSTM/BiLSTM for Algorithmic Trading. The LSTM/BiLSTM is used to approximate the decision function, which maps the state space to the action space, and then maximizes the Sharpe ratio by the gradient ascent method to generate the best trading strategy, respectively. Our proposed method is inspired by the RRL method and takes advantage of LSTM/BilLSTM, that is, their hidden layer state is not only related to the input at the current moment but also related to the hidden layer state at the previous moment to approximate the decision function.

3.2.2 LSTM-RL

According to the time characteristics of financial data, we adopt Long Short Term Memory network to capture the temporal evolution of information [45]. LSTM can use its internal state (memory) to process variable length sequences of inputs. A common LSTM cell consists of a cell c_t, an input gate i_t, an output gate o_t, and a forget gate f_t. The cell memorizes values at arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. Figure 3 shows the detailed structure of the LSTM.

Fig. 3

The structure of LSTM.

The cell ensures that the state of the memory cell c_t remains constant from one-time step to another. The input gates i_t allow incoming signals to change the state of the memory unit or block it. The output gates o_t can allow the state of the memory cell to affect other neurons or block them. Finally, the forget gate f_t regulates the self-recurrent connections of the memory cell, allowing the cell to remember or forget its previous state as needed. Suppose the input is $x_{t} \in ℝ^{n \times d}$ (n: batch size, d: the number of input units), the hidden state $h_{t - 1} \in ℝ^{n \times h}$ (h: the number of hidden units) of the previous time step, the input gate $i_{t} \in ℝ^{n \times h}$ , the forget gate $f_{t} \in ℝ^{n \times h}$ , and the output gate $o_{t} \in ℝ^{n \times h}$ are calculated as follows: $\begin{matrix} i_{t} & = σ (x_{t} w_{xi} + h_{t - 1} w_{hi} + b_{i}), \\ f_{t} & = σ (x_{t} w_{xf} + h_{t - 1} w_{hf} + b_{f}), \\ o_{t} & = σ (x_{t} w_{xo} + h_{t - 1} w_{ho} + b_{o}), \\ {\tilde{c}}_{t} & = σ (x_{t} w_{xc} + h_{t - 1} w_{hc} + b_{c}), \\ c_{t} & = F_{t} ⨀ c_{t - 1} + i_{t} ⨀ {\tilde{c}}_{t}, \\ h_{t} & = o_{t} ⨀ \tanh (c_{t}), \end{matrix}$ (9) where the weights matrix $w_{xi}, w_{xf}, w_{xo}, w_{xc} \in ℝ^{d \times h}, w_{hi}, w_{hf}, w_{ho}, w_{hc} \in ℝ^{h \times h}$ , and biases $b_{i}, b_{f}, b_{o}, b_{c} \in ℝ^{1 \times h}$ are all the model parameters.

Therefore, we propose an LSTM-based agent to learn the features of massive stock data and adopt a reinforcement learning framework to generate trading strategies and improve the scalability and robustness of the model in financial trading. Figure 4 shows the structure of the decision function with LSTM. The whole structure can be divided into two parts, the LSTM network and a layer of fully connected nonlinear transformations. The financial time series data are represented as price returns with window length M, then the features are extracted by the LSTM module, and finally, the corresponding actions are generated by the fully connected layer.

Fig. 4

The structure of the LSTM-based decision function.

3.2.3 BiLSTM-RL

When our input is time series data with fixed length M, we can extract features from more perspectives. For example, the output of a moment is not only related to the information of past moments, but also to the information of subsequent moments. Therefore, we can add a network layer that transmits information in the reverse order of time to enhance the network’s capability. The bidirectional RNNs were introduced by [46]. The bidirectional LSTM consists of two layers of LSTM, which have the same input but different directions of information transmission [47]. Figure 5 shows the structure of the BiLSTM-based decision function.

Fig. 5

The structure of the BiLSTM-based decision function.

BiLSTM is a deep recurrent neural network. Unlike the standard LSTM, the input flows in both directions and it is able to utilize information from both directions. The forward LSTM captures feature information on the forward aspect, while the backward LSTM captures feature information on the forward aspect. BiLSTM exploits directional temporal information through a backward updating hidden layer. For any time step t, suppose the input is $x_{t} \in ℝ^{n \times d}$ (n: batch size, d: the number of input units), forward hidden state $\vec{h_{t}} \in ℝ^{n \times h}$ , backward hidden state $\overset{\leftarrow}{h_{t}} \in ℝ^{n \times h}$ (h: the number of hidden units), hidden state $h_{t} \in ℝ^{2 h \times q}$ (q: the number of output units), and the output $o_{t} \in ℝ^{n \times q}$ are calculated as follows: $\begin{matrix} \vec{h_{t}} & = φ (x_{t} w_{xh}^{(f)} + \vec{h_{t - 1}} w_{hh}^{(f)} + b_{h}^{(f)}), \\ \overset{\leftarrow}{h_{t}} & = φ (x_{t} w_{xh}^{(b)} + \overset{\leftarrow}{h_{t - 1}} w_{hh}^{(b)} + b_{h}^{(b)}), \\ h_{t} & = [\vec{h_{t}}, \overset{\leftarrow}{h_{t}}], \\ o_{t} & = h_{t} w_{hq} + b_{q}, \end{matrix}$ (10) where the weights matrix $w_{xh}^{(f)} \in ℝ^{d \times h}, w_{hh}^{(f)} \in ℝ^{h \times h}, w_{xh}^{(b)} \in ℝ^{d \times h}, w_{hh}^{(b)} \in ℝ^{h \times h}, w_{xq} \in ℝ^{2 h \times q}$ , and biases $b_{h}^{(f)} \in ℝ^{1 \times h}, b_{h}^{(b)} \in ℝ^{1 \times h}, b_{q} \in ℝ^{1 \times q}$ are all the model parameters.

Therefore, the agent adopts LSTM/BiLSTM based on the current state to provide the corresponding strategy directly and still considers the transaction cost in the reward. The detailed procedure of training LSTM/BiLSTM-RL has been summarized in Algorithm 1.

Algorithm 1 LSTM/BiLSTM-RL algorithm


Input: Price series p₁, ⋯ , p_T;
1: Initialize the LSTM/BiLSTM network with random weights θ;
Ouput: the optimal trading strategy F_t;
2: for episode = 1 to N do
3: for t=1 to T do
4: Generate feature vector r_t for the price of stock index;
5: Agent obtains the feature vector r_t and outputs y_t by LSTM/BiLSTM;
6: Set: $\begin{matrix} F_{t} = {\begin{matrix} - 1, & y_{t} \leq - 0.33, \\ 0, & - 0.33 < y_{t} < 0.33, \\ 1, & y_{t} \geq 0.33 . \end{matrix} \end{matrix}$
7: Calculate the current Sharpe ratio S_t;
8: Maximize the differential Sharpe ratio ∇ (S_t) _θ;
9: Update the strategy’s parameter $θ_{t} = θ_{t - 1} \frac{d (S_{t})}{d θ_{t}}$ ;
10: end for
11: end for
12: Until Convergence

4 Experiment and result

The proposed trading system is evaluated on U.S. stock markets, Dow Jones (DJI), S&P500 (S&P), and NASDAQ (IXIC). We first describe the data, evaluation metrics, and baseline methods, and then present the performance results.

4.1 Datasets

The data used in the experiment involves 15 years of daily data of DJI (Fig. 6), S&P (Fig. 7), and IXIC (Fig. 8), ranging from 01/01/2005 to 12/31/2020, divided into a train set 80% and a test set 20%. Only daily closing prices are used in this study.

Fig. 6

DJI stock index [01/01/2005 - 12/31/2020][Train set in orange, Test set in blue].

Fig. 7

S&P stock index [01/01/2005 - 12/31/2020][Train set in orange, Test set in blue].

Fig. 8

IXIC stock index [01/01/2005 - 12/31/2020][Train set in orange, Test set in blue].

4.2 Evaluation metrics

Total Profit (TP) As shown in Equation (11), the total profit directly evaluates the return of the strategy in the backtest, where P_t is the return in the period of (t - 1, t]. $TP = \sum_{t = 1}^{T} P_{t},$ (11)

Sharpe ratio (SR) It shows the average return over the risk-free rate per unit of total risk, where R_f is the return on the risk-free asset and $𝔼 {R_{p}}$ is the expected value of the portfolio value. The Sharpe ratio is given in Equation 12, assuming R_f = 0 in our study. $SR = \frac{𝔼 {R_{p}} - R_{f}}{σ_{p}},$ (12)

4.3 Baseline methods

Buy and Hold (B&H): The buy-and-hold strategy is a classic investment strategy, that is, buy stocks with the maximum amount of money, do not change the buying status for a period of time, and finally, obtain the gains or losses of the stocks. This strategy does not focus on short-term price movements and technical indicators, but can effectively reduce transaction costs and tax costs. In this strategy, the investor buys the asset at the first time step of the investment. The asset is sold until the last time step of the investment, regardless of the price fluctuations during the holding period.

RRL In this strategy, an agent can trade a fixed position size in a single security. It assumes that the trader can take a short, neutral, or long position of constant magnitude with the objective function of maximizing the difference Sharpe ratio. Check it out in section 3.1.

4.4 Experimental setup

We compare DNN-RL, LSTM-RL, and BiLSTM-RL methods with two baselines: the buy-and-hold (B&H) strategy and the recurrent reinforcement learning (RRL). For the deep learning part, the return window length is 20 with three layers (input, hidden, and fully connected layers) and the number of units is set to 64. The input unit (features) consists of a series of returns (r_t = z_t - z_t-1) for prices over the past M = 20 days, and the output unit correspond to the best action of the three choices in the trading. The transaction cost is set to 0.005.

To ensure that the inputs to the neural network are all within a reasonable range, we normalize all price returns relative to these values by calculating the mean and variance of the price returns to normalize the data, and used Adam to optimize all models with a learning rate of 0.001. The models are implemented using the Pytorch library in Python and training stopped after 1000 iterations. As for the RRL method, the code provided by [48] is used to build the system, the learning rate is set to 0.3, and the maximum training epoch is set to 1000.

4.5 Experimental results

Figure 9–11 show the total profit curves for the five algorithms under the test sets of Dow Jones Industrial Average (DJI), S&P 500 Index (S&P500), and IXIC, respectively. The vertical axis in the figure represents the cumulative returns of the assets and the horizontal axis represents the out-of-sample test time points. As can be seen from the figure, the total profit curves of DNN-RL, LSTM-RL, and BiLSTM are significantly higher than those of the RRL, B&H methods. Comparing DNN-RL, LSTM-RL, BiLSTM-RL, and RRL, the former three perform more consistently than the latter.DNN-RL, LSTM-RL, and BiLSTM-RL perform well in all three markets, whereas RRL performs poorly on S&P500 than the B&H method. This advantage of LSTM-RL and BiLSTM-RL can be summarized in the following two main points. One is the ability of LSTM and BiLSTM to detect market states from raw and noisy data. The other is the online nature, which can be quickly adapted to new market states. In particular, BiLSTM-RL outperforms LSTM-RL, because BiLSTM can fully capture past and future data information simultaneously and take the reverse relationship of data into account.

Fig. 9

Comparison of total profit by different methods on DJI test set.

Fig. 10

Comparison of total profit by different methods on S&P test set.

Fig. 11

Comparison of total profit by different methods on IXIC test set.

As shown in Fig. 12, agents can generate proper trading signals with RRL, DNN-RL, LSTM-RL, and BiLSTM-RL in the test set of three stock indices. More interesting observations can be found in the three trading signal plots. From the trading signal by the DNN-RL, LSTM-RL, and BiLSTM-RL methods, It seems that the agents have learned how to act differently in different market situations. For example, in a downside market, it learns that selling is more profitable than other actions and therefore tends to hold a short position. When in an upside market, it learns that buying is more profitable than other actions and therefore tends to hold a long position. This sensitivity to market status is mainly attributed to the ability of the DNN, LSTM, and BiLSTM in discovering the market state from the vast and noisy historical price signals.

Fig. 12

The comparison of the trading signals on different assets. The top panel shows the trading system of RRL; the second panel shows the trading system of DNN-RL; the third panel shows the trading system of LSTM-RL; the fourth panel shows the trading system of BiLSTM-RL. -1 presents selling the considered security, 0 presents neutral, and 1 presents buying the considered security.

A more detailed quantitative comparison is shown in Table 1, where we report two widely used metrics for stock trading: total profit, and Sharpe ratio. The results show that our method outperforms other baseline methods in terms of total profit and Sharpe ratio.

Table 1

Performances of different trading methods on different stock indices

Method	DJI		S&P		IXIC
	TP	SR	TP	SR	TP	SR
B&H	3.77	0.03	3.95	0.03	4.00	-0.01
RRL	56.75	0.07	-19.92	-0.02	-18.39	-0.02
DNN-RL	346.24	0.46	347.52	0.47	384.27	0.52
LSTM-RL	455.44	0.64	295.02	0.38	499.53	0.72
BiLSTM-RL	469.57	0.65	462.20	0.64	504.90	0.72

^*TP: total profit; SR: sharpe ratio.

5 Conclusions

In this paper, we propose a new hybrid method extended from Moody’s RRL algorithm by integrating DNN, LSTM, and BiLSTM into RRL to obtain DNN-RL, LSTM-RL, and BiLSTM-RL, respectively. We employ DNN, LSTM, and BiLSTM to approximate the decision function to map the state space to the action space, respectively, and then maximize the Sharpe ratio by the gradient ascent method to generate the optimal trading strategies, respectively. The performance of DNN-RL, LSTM-RL, and BiLSTM-RL methods is evaluated using three major stock indices of the U.S. stock market, S&P500, Dow Jones, and NASDAQ, and the results show that our methods have higher scalability and better robustness compared to other baseline methods. In particular, BiLSTM-RL performs better than the other methods because BiLSTM is able to make full use of information in both directions in an attempt to capture more effective information. Our proposed method uses the hidden state of the LSTM/BilLSTM to transfer the information of recording the current time step to the next time step and more effectively capture information on the financial time series.

In spite of this, the performance of the proposed algorithm can still be improved. Three research directions are suggested to improve the framework. First, this paper only uses daily closing prices for trading, but there are many other trading data, such as fundamental data and news data. They can be properly integrated to achieve better performance. Second, the neural network structure proposed in this paper is relatively simple, using a single type of network. So it can be considered to add other structures to the network structure, such as attention mechanism or convolutional neural network. Third, multi-scale data such as weekly, hourly, and minute data are introduced in our approach. Finally, we can consider implementing substantial improvements in other reinforcement learning algorithms, such as value-based algorithms or actor-critic algorithms.

Footnotes

Acknowledgments

The authors thank the anonymous reviewers for their valuable comments and suggestions that greatly helped to improve the manuscript. This work is supported partly by the Faculty Research Grants, Macau University of Science and Technology (Project no. FRG-22-001-INT).

References

Lin

, Prediction and Analysis of Financial Volatility Based on Implied Volatility and GARCH Model, Modern Economics & Management Forum 3(1) (2022), 48–56.

Ampountolas

, The Effect of COVID-19 on Cryptocurrencies and the Stock Market Volatility: A Two-Stage DCC-EGARCH Model Analysis, Journal of Risk and Financial Management 16(1) (2023), 25.

Zolfaghari

and Gholami

, A hybrid approach of adaptive wavelet transform, long short-term memory and ARIMA-GARCH family models for the stock index prediction, Expert Systems with Applications 182 (2021), 115149.

Tang

, Xu

and Ye

, The Way to Invest: Trading Strategies Based on ARIMA and Investor Personality, Symmetry 14(11) (2022), 2292.

Rustam

, Vibranti

D.F.

and Widya

, Predicting the direction of Indonesian stock price movement using support vector machines and fuzzy Kernel C-Means, Proceedings of the 3rd international symposium on current progress in mathematics and sciences 2017 (ISCPMS2017), 2017.

Soni

, Tewari

and Krishnan

, Machine Learning Approaches in Stock Price Prediction: A Systematic Review, Journal of Physics: Conference Series 2161 (2022).

Sutton

R.S.

and Barto

A.G.

, Reinforcement Learning: An Introduction, Cambridge, MIT press, 2018.

Mnih

, et al., Human-level control through deep reinforcement learning, Nature 518(7540) (2015), 529–533.

Silver

, Huang

, Maddison

C.J.

, et al., Mastering the game of Go with deep neural networks and tree search, Nature, 2016.

10.

Moody

J.E.

and Saffell

, Learning to trade via direct reinforcement, NaIEEE transactions on neural networksture 12(4) (2001), 875–889.

11.

Parisi

and Vasquez

, Simple technical trading rules of stock returns: evidence from to in Chile, Emerging Markets Review 1 (2000), 152–164.

12.

Gerritsen

D.F.

, Bouri

, Ramezanifar

and Roubaud

, The profitability of technical trading rules in the Bitcoin market, Finance Research Letters 34 (2020), 101263.

13.

Persio

L.D.

and Honchar

, Artificial neural networks architectures for stock price prediction, Comparisons and applications, 2016.

14.

Zhou

, Xu

and Zhao

, Tales of emotion and stock in China: volatility, causality and prediction, World Wide Web 21(4) (2018), 1093–1116.

15.

Verma

, Dey

and Meisheri

, Detecting, quantifying and accessing impact of news events on Indian stock indices, International Conference on Web Intelligence ACM, 2017.

16.

Roondiwala

, Patel

and Varma

, Predicting stock prices using LSTM, International Journal of Science and Research (IJSR) 6(4) (2017), 1754–1756.

17.

Althelaya

K.A.

, El-Alfy

E.-S.M.

and Mohammed

, Evaluation of bidirectional LSTM for short-and long-term stock market prediction, 2018 9th international conference on information and communication systems (ICICS), pp. 151–156, 2018.

18.

Huang

, Kong

, Li

, Yang

and Li

, Discovery of trading points based on Bayesian modeling of trading rules, World Wide Web 21(6) (2018), 1473–1490.

19.

and Cohen

S.B.

, Stock movement prediction from tweets and historical prices, pp. – , In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)pp. 1970–1979, 2018.

20.

Chen

and Ge

, Exploring the attention mechanism in lstm-based hong kong stock price movement prediction, Quantitative Finance 19(9) (2019), 1507–1515.

21.

Shen

and Shafiq

M.O.

, Short-term stock market price trend prediction using a comprehensive deep learning system, Journal of big Data 7(1) (2020), 1–33.

22.

Al Aradi

and Hewahi

, Prediction of stock price and direction using neural networks: Datasets hybrid modeling approach, 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI), pp. 1–6, 2020.

23.

Mundra

A.K.

, Mundra

, Verma

V.K.

and Srivastava

J.S.

, A deep learning based hybrid framework for stock price prediction, Journal of Intelligent & Fuzzy Systems 38 (2020), 5949–5956.

24.

Sunny

M.A.I.

, Maswood

M.M.S.

and Alharbi

A.G.

, Deep learning-based stock price prediction using LSTM and bi-directional LSTM model, In 2020 2nd Novel Intelligent and Leading Emerging Sciences Conference (NILES), pp. 87–92, 2020.

25.

Shah

, Jain

, Jolly

and Godbole

, Stock Market Prediction using Bi-Directional LSTM, 2021 International Conference on Communication information and Computing Technology (ICCICT), pp. 1–5, 2021.

26.

Zhang

, Dai

H.N.

, Zhou

, Mondal

S.K.

, García

M.M.

and Wang

, Forecasting cryptocurrency price using convolutional neural networks with weighted and attentive memory channels, Expert Systems with Applications 183 (2021), 115378.

27.

, Wang

and Cheng

, A hybrid approach for stock trend prediction based on tweets embedding and historical prices, World Wide Web 24(3) (2021), 849–868.

28.

Yang

and Wang

, Adaptability of Financial Time Series Prediction Based on BiLSTM, Procedia Computer Science 199 (2022), 18–25.

29.

Lee

M.C.

, Chang

, Yeh

, Chia

, Liao

and Chen

, Applying attention-based BiLSTM and technical indicators in the design and performance analysis of stock trading strategies, Neural Computing & Applications 34 (2022), 13267–13279.

30.

Chandola

, Mehta

, Singh

, Tikkiwal

V.A.

and Agrawal

, Forecasting Directional Movement of Stock Prices using Deep Learning, Annals of Data Science, pp. 1–18, 2022.

31.

A.Q.

, Kapoor

, A.V.

C.J.

, Sivaraman

A.K.

, Tee

K.F.

S.H., and J.N., Novel optimization approach for stock price forecasting using multi-layered sequential LSTM, Applied Soft Computing, 2022.

32.

Bhandari

H.N.

, Rimal

, Pokhrel

N.R.

, Rimal

, Dahal

K.R.

and Khatri

R.K.

, Predicting stock market index using LSTM, Machine Learning with Applications, 2022.

33.

Deng

, Bao

, Kong

, Ren

and Dai

, Deep Direct Reinforcement Learning for Financial Signal Representation and Trading, In IEEE Transactions on Neural Networks and Learning Systems 28(3) (2017), 653–664. doi: 10.1109/TNNLS.2016.2522401.

34.

David W.

, Agent inspired trading using recurrent reinforcement learning and lstm neural networks, arXiv preprint arXiv:1707.07338, 2017.

35.

Almahdi

and Yang

S.Y.

, An adaptive portfolio trading system: A risk-return portfolio optimization using recurrent reinforcement learning with expected maximum drawdown, Expert Systems with Applications 87 (2017), 267–279.

36.

Aboussalah

A.M.

and Lee

C.G.

, Continuous control with stacked deep dynamic recurrent reinforcement learning for portfolio optimization, Expert Systems with Applications 140 (2020), 112891.

37.

Nguyen

H.T.

and Luong

N.H.

, Applying Deep Reinforcement Learning in Automated Stock Trading, pp. 285– 297, In Soft Computing: Biomedical and Related Applications, 2021.

38.

, Wang

and Zhou

, Ensemble Investment Strategies Based on Reinforcement Learning, Scientific Programming, 2022.

39.

, Qin

, Li

, Huang

and Hu

, Single stock trading with deep reinforcement learning: A comparative study, 2022 14th International Conference on Machine Learning and Computing (ICMLC), 2022.

40.

, Liu

and Wang

, Stock Trading Strategies Based on Deep Reinforcement Learning, Scientific Programming, 2022.

41.

Zou

, Lou

, Wang

and Liu

, A Novel Deep Reinforcement Learning Based Automated Stock Trading System Using Cascaded LSTM Networks, ArXiv, abs/2212.02721, 2022.

42.

Malibari

, Katib

I.A.

and Mehmood

, Smart Robotic Strategies and Advice for Stock Trading Using Deep Transformer Reinforcement Learning, Applied Sciences, 2022.

43.

Zhou

, Tang

and Li

, Research on investment strategies of stock market based on sentiment indicators and deep reinforcement learning, In International Conference on Statistics, Applied Mathematics, and Computing Science (CSAMCS 2021) 12163 (2022), 1151–1156.

44.

Sharpe

W.F.

, The sharpe ratio, the Best of the Journal of Portfolio Management, Streetwise – the Best of the Journal of Portfolio Management, pp. 169–185, 1998.

45.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural computation 9(8) (1997), 1735–1780.

46.

Schuster

and Paliwal

K.K.

, Bidirectional recurrent neural networks, IEEE transactions on Signal Processing 45(11) (1997), 2673–2681.

47.

Graves

and Schmidhuber

, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks 18(5-6) (2005), 602–610.

48.

Koker

T.E.

and Koutmos

, Cryptocurrency trading using machine learning, Journal of Risk and Financial Management 13(8) (2020), 178.