Abstract
Recently, the algorithmic trading of financial assets is rapidly developing with the rise of deep learning. In particular, deep reinforcement learning, as a combination of deep learning and reinforcement learning, stands out among many approaches in the field of decision-making because of its high performance, strong generalization, and high fitting ability. In this paper, we attempt to propose a hybrid method of recurrent reinforcement learning (RRL) and deep learning to figure out the algorithmic trading problem of determining the optimal trading position in the daily trading activities of the stock market. We adopt deep neural network (DNN), long short-term memory neural network (LSTM), and bidirectional long short-term memory neural network (BiLSTM) to automatically extract higher-level abstract feature information from sequential trading data, respectively, and then generate optimal trading strategies by interacting with the environment in a reinforcement learning framework. In particular, the BiLSTM consisting of two LSTM models with opposite directions is able to make full use of the information from both directions in attempting to capture more effective information. In experiments, the daily data of Dow Jones, S&P500, and NASDAQ (from Jan-01, 2005 to Dec-31, 2020) are applied to verify the performance of the newly proposed DNN-RL, LSTM-RL, and BiLSTM-RL trading systems. Experimental results show that the proposed methods significantly outperform the benchmark methods, such as RRL and Buy and Hold, with higher scalability and better robustness. Especially, BiLSTM-RL performs better than other methods.
Introduction
In recent years, algorithmic trading has played an important role in financial trading and has become a trend in today’s financial markets. The trading process can be described as a decision-making process that aims to maximize returns while reducing risk. Various financial assets are analyzed and traded within the framework of traditional algorithms such as GARCH [1, 2] and ARIMA [3, 4], as well as the latest machine learning-based algorithms [5, 6]. As part of a broader family of machine learning methods, deep learning is often used to forecast stock prices or trend movements in order to build financial trading strategies. But such methods mainly depend on the accuracy of the forecast and can easily lead to over-fitting. Also, these methods cannot handle continuous decision-making problems in financial markets.
Compared with forecast-based methods, deep reinforcement learning enables mapping from state space to action space through continuous and online self-learning. DQN (Deep Q-Network), a primary method that firstly combines deep learning and reinforcement learning [7], outperformed humans on the Atari game platform [8]. In particular, AlphaGo developed by Google DeepMind team pushed deep reinforcement learning to a new hotspot and height, which became a new milestone in the history of artificial intelligence [9]. deep reinforcement learning has achieved remarkable success in solving complex sequential decision problems, and therefore more and more research is focusing on the combination of deep reinforcement learning and investment decision-making. It can extract features directly from high-dimensional raw financial data in a deep learning module, and then find optimal dynamic trading strategies to maximize risk-adjusted returns by interacting with the environment in a reinforcement learning module.
In 2001, Moody et al. proposed the recurrent reinforcement learning (RRL) algorithm with return as the input and the difference Sharpe ratio as the objective function for single assets and portfolios with a transaction cost of 0.005 [10]. Nevertheless, the model extracts features from time series data in a linear manner, but financial markets are noisy, and a nonlinear model is needed to extract higher-level features. In contrast, deep learning combines features through multilayer network structures and nonlinear transformations with strong perception and representation capabilities. In this paper, we propose an extended RRL algorithm based on DNN, LSTM, and BiLSTM, respectively, to learn higher-level and abstract feature information from low-level raw time series data, and find the optimal trading strategy by reinforcement learning method with Sharpe ratio as the objective function.
In conclusion, the main contributions of this paper can be summarized as follows: Proposed an improved deep reinforcement learning algorithm based on Moody’s RRL algorithm by employing DNN, LSTM, and BiLSTM into RRL to extract higher-level features. Compared with Moody’s RRL, our method only considers the sequence of return without considering the previous moment’s trading position. Experimental results show that our method can achieve higher profits without previous trading positions. Compared the performance of our DNN-RL, LSTM-RL, and BiLSTM-RL models and the baseline models. Since BiLSTM can capture more effective information by making full use of both directions, the results show that the BiLSTM-based approach can outperform other methods in the U.S. index market.
The remainder of this paper is arranged as follows. Section 2 generally reviews some related works on deep reinforcement learning for studying algorithmic trading. Section 3 depicts the details of RRL and presents new hybrid methods, DNN-RL, LSTM-RL, and BiLSTM-RL. Section 4 shows the experiments and the comparison of DNN-RL, LSTM-RL, and BiLSTM-RL methods with baseline models. Section 5 gives the conclusions and future work.
Related work
Traditional algorithmic trading is based on techniques developed by mathematics or the experience of human experts, such as trend following and mean reversion strategies. Parisi et al. used moving averages and trading range breakout rules to construct stock trading decisions and found that the strategy achieved consistent profits [11]. Gerritsen et al. applied seven trend-following indicators to assess the profitability of technical trading rules for Bitcoins data, finding trading range breakouts that contain significant predictive power for bitcoin prices and were profitable [12]. However, artificial intelligence methods play an important role in the study of trading in financial markets.
Among the different methods and techniques of machine learning, deep learning mainly fuse different data to extract time series rules to forecast stock prices and generate trading signals. Persio et al. considered MLP, CNN, and LSTM techniques to predict trends [13]. Zhou et al. proposed a prediction model combining social media information by analyzing stock-related tweets on Weibo. Experiments show that the proposed model can outperform models that rely only on pure financial series data [14]. Verma et al. presented an LSTM-based forecasting model to forecast stock movement direction [15]. Roondiwala et al. proposed a recurrent neural network (RNN) and Long Short-Term Memory (LSTM) approach to forecast stock market indices [16]. Althelaya et al. applied bidirectional and stacked LSTMs to short-term and long-term forecasting of financial time series. The experimental results show that the bidirectional and stacked LSTM outperformed shallow neural networks and vanilla LSTM models in predicting stock movements [17]. Huang et al. attempted to use Bayesian models to predict trading rules from trading patterns detected by the proposed double clustering algorithm [18]. Xu et al. introduced a StockNet to forecast stock movements, using GRU to learn from tweets and historical prices [19]. Chen et al. investigated the use of attention mechanism in LSTM networks for stock price trend prediction. The experimental results in Hong Kong stock market show that the proposed model greatly enhanced the predictive performance of the LSTM [20]. Shen et al. proposed a comprehensive customization of feature engineering and LSTM-based model to predict the price trend of the stock market [21]. AlAradi et al. proposed a framework combining LSTM and DNN to forecast price and movement direction by using multiple datasets including news sentiment, social sentiment, company earnings announcements, and technical indicators. The experimental results on the Apple Inc. stock show that the superiority of LSTM model was better than the DNN model [22]. Mundra et al. proposed a hybrid approach to predict stock prices based on support vector machines (SVM) and LSTM. The experimental results showed that the proposed model has a prediction accuracy of 97% on the TATA Global Beverage stock dataset [23]. Mai et al. analyzed LSTM and BiLSTM to forecast stock prices, which allows the model to obtain the best performance by capturing the temporal evolution of information [24–26]. Ni et al. proposed a hybrid method for forecasting stock market based on tweet embeddings and historical prices [27]. Yang et al. applied BiLSTM in financial time series prediction, and compared with LSTM, support vector regression (SVR) and differential autoregressive moving average model (ARIMA) are compared. Experiments show that BiLSTM model has the highest prediction accuracy [28]. Lee et al. applied technical indicators to LSTM-attention time series models for stock price forecasting and trading strategy design [29]. Chandola et al. proposed a hybrid deep learning model combining Word2Vec and LSTM algorithms to predict directional movements in stock market prices using financial time series and news headlines as inputs [30]. Md et al. proposed a novel optimization method based on a multi-layer Sequential Long Short Term Memory model for predicting stock prices by considering past performance information as well as past trends and patterns [31]. Bhandar et al. adopted an LSTM to predict the next-day closing price of the S&P 500 index under the umbrella of fundamental market data, macroeconomic data, and technical indicators [32].
Different from the above methods of directly forecasting stock prices, the reinforcement learning methods find the optimal dynamic trading strategy to maximize profits by interacting with the environment. Deng et al. introduced a recurrent deep neural network for real-time financial signal representation and proposed a task-aware backpropagation method to alleviate the gradient vanishing problem in DRL [33]. Lu proposed a model that combines long short-term memory (LSTM) recurrent structure and reinforcement learning. Experiments show that the system created successful strategies to achieve human-level long-term reward in GBPUSD [34]. Almahdi et al. used the recurrent reinforcement learning method to solve a dynamic portfolio optimization problem. Experiments show that the expected maximum drawdown risk-based objective function has higher return performance compared to the previously proposed RRL objective functions [35]. Aboussalah et al. proposed a stacked deep dynamic recursive reinforcement learning (SDRRL) architecture to construct real-time optimal portfolios by capturing the latest market conditions and rebalancing the portfolios accordingly. Experiments show that the proposed SDRRL achieves better performance than three benchmarks, including the rolling horizon mean-variance optimization (MVO) model, the rolling horizon risk parity model, and the uniform buy-and-hold (UBAH) index [36]. Nguyen et al. investigated three DRL algorithms, namely Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC), for their ability to solve automated stock trading problems [37]. Li et al. proposed a set of ensemble automated trading strategies in a deep reinforcement learning-based framework, which combined the advantages of the three algorithms, PPO, A2C, and SAC [38]. Ge et al. applied the A2C, PPO, DDPG, TD3, and SAC models to automate the trading of a single stock [39]. Li et al. proposed a new deep reinforcement learning model for stock trading, which fused features from different data sources extracted by deep neural networks into the state of the stock market, and then the agent makes trading decisions through reinforcement learning [40]. Zou et al. proposed a deep reinforcement learning-based stock trading system with cascaded LSTM, which first used LSTM to extract the time-series features from stock daily data, and then the features extracted are fed to the agent for training, while the strategy functions in reinforcement learning also use another LSTM for training [41]. Malibari et al. combined deep reinforcement learning with a transformer network to produce a decision transformer architecture for online trading [42]. Zhou et al. proposed an improved deep recurrent DRQN-ARBR model by changing the fully connected layer in the original model to the LSTM layer and selecting more technical indicators, especially the emotion indicator ARBR to construct a trading strategy [43].
However, the RRL method only uses a simple full connection and a hidden layer network to jointly generate the trading signal through the previous trading signal and the return sequence. In general, the more strategy rules and parameters a model has, the more likely it is to lead to data accommodation bias. At the same time, according to the definition of the Markov decision process, when the environment state is fully observable, the optimal strategy is only related to the current state, without considering historical information. Different from RRL, we propose a simplified method to RRL by automatically perceiving dynamic market conditions through DNN/LSTM/BiLSTM, extracting information features, generating trading signals directly, and optimizing the parameters of the trading system by backpropagation through time.
Methodology
Recurrent reinforcement learning
Intuitively, Moody’s RRL framework trains trading systems and portfolios by optimizing risk-adjusted investment returns (such as differential Sharpe ratio), taking into account the impact of transaction costs. It can be viewed as a stochastic control problem with investment decision-making. Fig. 1 shows the trading system based on recurrent reinforcement learning.

Trading system based on recurrent reinforcement learning.
Moody’s RRL considered agents that trade fixed position sizes in a single security. It assumes that the trader can take a short, neutral, or long position with constant magnitude. The position F
t
∈ {-1, 0, 1} is established or maintained at the end of each time interval t and is evaluated at the end of period t + 1. Thus, it is possible to trade at the end of each period. The decision function is defined as follows,
Specifically, the decision function is formalized as follows,
Therefore, the trading systems can be optimized by maximizing performance functions such as the Sharpe ratio [44],
The RRL model considers the effects of transaction costs, market impact, and taxes in a trader’s decision-making. Thus, transaction costs and market impact on the trading system are taken into account in the decision function. However, the reward at the time t already takes into account the transaction costs. Therefore, there is no need to include the transaction costs in the decision function. The details are discussed in Section 3.2.
Herein, we extend from RRL, by integrating DNN, LSTM, and BiLSTM. Firstly, we adopt DNN, LSTM, and BiLSTM respectively to replace decision functions to map the state space to action space, and then maximize the Sharpe ratio by gradient ascent method to generate optimal trading strategies, respectively.
A simple decision function is defined as
Traders learn strategies through trial and error exploration, take action and receive positive or negative reinforcement depending on the results. A trading performance function U (θ), such as profit, utility, or risk-adjusted return, is used to directly optimize the trading system parameters θ. Therefore, we adopt reinforcement learning to update the weights in a deep neural network (such as DNN, LSTM, and BiLSTM) via gradient ascent in the utility function U (θ). Figure 2 shows the structure of DNN/LSTM/BiLSTM-RL.

The structure of DNN/LSTM/BiLSTM-RL.
Some basic elements for constructing a trading system based on DNN/LSTM/BiLSTM-RL are listed as follows:
In the following sections, we will focus on the details of integrating DNN, LSTM, and BiLSTM into RRL, which can effectively capture information on financial time series.
Deep Neural Network (DNN) is a neural network with more than one hidden layer, also known as Deep Feedforward Network (DFN), Multi-Layer Perceptron (MLP). According to the location of different layers, the neural network layers can be divided into the following: input layer, hidden layer, and output layer inside DNN, generally the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Herein, we transform the 20 features of the network input into one output using a fully connected layer and map that output to [-1, 1] using the tanh activation function.
However, it is still a challenging task to extract time-series data, such as stock data, and DNN may still be limited in their ability to deal with such data. When dealing with continuous data, one cannot process the correlated time series at each time step. each time step of the correlated time series and save the entire state of the series. the entire state of the sequence. Therefore, we further introduce LSTM/BiLSTM to efficiently capture both long-term and short-term information of financial time series. In the next section, we proposed a new hybrid method based on recurrent reinforcement learning and LSTM/BiLSTM for Algorithmic Trading. The LSTM/BiLSTM is used to approximate the decision function, which maps the state space to the action space, and then maximizes the Sharpe ratio by the gradient ascent method to generate the best trading strategy, respectively. Our proposed method is inspired by the RRL method and takes advantage of LSTM/BilLSTM, that is, their hidden layer state is not only related to the input at the current moment but also related to the hidden layer state at the previous moment to approximate the decision function.
LSTM-RL
According to the time characteristics of financial data, we adopt Long Short Term Memory network to capture the temporal evolution of information [45]. LSTM can use its internal state (memory) to process variable length sequences of inputs. A common LSTM cell consists of a cell c t , an input gate i t , an output gate o t , and a forget gate f t . The cell memorizes values at arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. Figure 3 shows the detailed structure of the LSTM.

The structure of LSTM.
The cell ensures that the state of the memory cell c
t
remains constant from one-time step to another. The input gates i
t
allow incoming signals to change the state of the memory unit or block it. The output gates o
t
can allow the state of the memory cell to affect other neurons or block them. Finally, the forget gate f
t
regulates the self-recurrent connections of the memory cell, allowing the cell to remember or forget its previous state as needed. Suppose the input is
Therefore, we propose an LSTM-based agent to learn the features of massive stock data and adopt a reinforcement learning framework to generate trading strategies and improve the scalability and robustness of the model in financial trading. Figure 4 shows the structure of the decision function with LSTM. The whole structure can be divided into two parts, the LSTM network and a layer of fully connected nonlinear transformations. The financial time series data are represented as price returns with window length M, then the features are extracted by the LSTM module, and finally, the corresponding actions are generated by the fully connected layer.

The structure of the LSTM-based decision function.
When our input is time series data with fixed length M, we can extract features from more perspectives. For example, the output of a moment is not only related to the information of past moments, but also to the information of subsequent moments. Therefore, we can add a network layer that transmits information in the reverse order of time to enhance the network’s capability. The bidirectional RNNs were introduced by [46]. The bidirectional LSTM consists of two layers of LSTM, which have the same input but different directions of information transmission [47]. Figure 5 shows the structure of the BiLSTM-based decision function.

The structure of the BiLSTM-based decision function.
BiLSTM is a deep recurrent neural network. Unlike the standard LSTM, the input flows in both directions and it is able to utilize information from both directions. The forward LSTM captures feature information on the forward aspect, while the backward LSTM captures feature information on the forward aspect. BiLSTM exploits directional temporal information through a backward updating hidden layer. For any time step t, suppose the input is
Therefore, the agent adopts LSTM/BiLSTM based on the current state to provide the corresponding strategy directly and still considers the transaction cost in the reward. The detailed procedure of training LSTM/BiLSTM-RL has been summarized in Algorithm 1.
The proposed trading system is evaluated on U.S. stock markets, Dow Jones (DJI), S&P500 (S&P), and NASDAQ (IXIC). We first describe the data, evaluation metrics, and baseline methods, and then present the performance results.
Datasets
The data used in the experiment involves 15 years of daily data of DJI (Fig. 6), S&P (Fig. 7), and IXIC (Fig. 8), ranging from 01/01/2005 to 12/31/2020, divided into a train set 80% and a test set 20%. Only daily closing prices are used in this study.

DJI stock index [01/01/2005 - 12/31/2020][Train set in orange, Test set in blue].

S&P stock index [01/01/2005 - 12/31/2020][Train set in orange, Test set in blue].

IXIC stock index [01/01/2005 - 12/31/2020][Train set in orange, Test set in blue].
Experimental setup
We compare DNN-RL, LSTM-RL, and BiLSTM-RL methods with two baselines: the buy-and-hold (B&H) strategy and the recurrent reinforcement learning (RRL). For the deep learning part, the return window length is 20 with three layers (input, hidden, and fully connected layers) and the number of units is set to 64. The input unit (features) consists of a series of returns (r t = z t - zt-1) for prices over the past M = 20 days, and the output unit correspond to the best action of the three choices in the trading. The transaction cost is set to 0.005.
To ensure that the inputs to the neural network are all within a reasonable range, we normalize all price returns relative to these values by calculating the mean and variance of the price returns to normalize the data, and used Adam to optimize all models with a learning rate of 0.001. The models are implemented using the Pytorch library in Python and training stopped after 1000 iterations. As for the RRL method, the code provided by [48] is used to build the system, the learning rate is set to 0.3, and the maximum training epoch is set to 1000.
Experimental results
Figure 9–11 show the total profit curves for the five algorithms under the test sets of Dow Jones Industrial Average (DJI), S&P 500 Index (S&P500), and IXIC, respectively. The vertical axis in the figure represents the cumulative returns of the assets and the horizontal axis represents the out-of-sample test time points. As can be seen from the figure, the total profit curves of DNN-RL, LSTM-RL, and BiLSTM are significantly higher than those of the RRL, B&H methods. Comparing DNN-RL, LSTM-RL, BiLSTM-RL, and RRL, the former three perform more consistently than the latter.DNN-RL, LSTM-RL, and BiLSTM-RL perform well in all three markets, whereas RRL performs poorly on S&P500 than the B&H method. This advantage of LSTM-RL and BiLSTM-RL can be summarized in the following two main points. One is the ability of LSTM and BiLSTM to detect market states from raw and noisy data. The other is the online nature, which can be quickly adapted to new market states. In particular, BiLSTM-RL outperforms LSTM-RL, because BiLSTM can fully capture past and future data information simultaneously and take the reverse relationship of data into account.

Comparison of total profit by different methods on DJI test set.

Comparison of total profit by different methods on S&P test set.

Comparison of total profit by different methods on IXIC test set.
As shown in Fig. 12, agents can generate proper trading signals with RRL, DNN-RL, LSTM-RL, and BiLSTM-RL in the test set of three stock indices. More interesting observations can be found in the three trading signal plots. From the trading signal by the DNN-RL, LSTM-RL, and BiLSTM-RL methods, It seems that the agents have learned how to act differently in different market situations. For example, in a downside market, it learns that selling is more profitable than other actions and therefore tends to hold a short position. When in an upside market, it learns that buying is more profitable than other actions and therefore tends to hold a long position. This sensitivity to market status is mainly attributed to the ability of the DNN, LSTM, and BiLSTM in discovering the market state from the vast and noisy historical price signals.

The comparison of the trading signals on different assets. The top panel shows the trading system of RRL; the second panel shows the trading system of DNN-RL; the third panel shows the trading system of LSTM-RL; the fourth panel shows the trading system of BiLSTM-RL. -1 presents selling the considered security, 0 presents neutral, and 1 presents buying the considered security.
A more detailed quantitative comparison is shown in Table 1, where we report two widely used metrics for stock trading: total profit, and Sharpe ratio. The results show that our method outperforms other baseline methods in terms of total profit and Sharpe ratio.
Performances of different trading methods on different stock indices
*TP: total profit; SR: sharpe ratio.
In this paper, we propose a new hybrid method extended from Moody’s RRL algorithm by integrating DNN, LSTM, and BiLSTM into RRL to obtain DNN-RL, LSTM-RL, and BiLSTM-RL, respectively. We employ DNN, LSTM, and BiLSTM to approximate the decision function to map the state space to the action space, respectively, and then maximize the Sharpe ratio by the gradient ascent method to generate the optimal trading strategies, respectively. The performance of DNN-RL, LSTM-RL, and BiLSTM-RL methods is evaluated using three major stock indices of the U.S. stock market, S&P500, Dow Jones, and NASDAQ, and the results show that our methods have higher scalability and better robustness compared to other baseline methods. In particular, BiLSTM-RL performs better than the other methods because BiLSTM is able to make full use of information in both directions in an attempt to capture more effective information. Our proposed method uses the hidden state of the LSTM/BilLSTM to transfer the information of recording the current time step to the next time step and more effectively capture information on the financial time series.
In spite of this, the performance of the proposed algorithm can still be improved. Three research directions are suggested to improve the framework. First, this paper only uses daily closing prices for trading, but there are many other trading data, such as fundamental data and news data. They can be properly integrated to achieve better performance. Second, the neural network structure proposed in this paper is relatively simple, using a single type of network. So it can be considered to add other structures to the network structure, such as attention mechanism or convolutional neural network. Third, multi-scale data such as weekly, hourly, and minute data are introduced in our approach. Finally, we can consider implementing substantial improvements in other reinforcement learning algorithms, such as value-based algorithms or actor-critic algorithms.
Footnotes
Acknowledgments
The authors thank the anonymous reviewers for their valuable comments and suggestions that greatly helped to improve the manuscript. This work is supported partly by the Faculty Research Grants, Macau University of Science and Technology (Project no. FRG-22-001-INT).
