Abstract
Portfolio management remains a crucial challenge in finance, with traditional methods often falling short in complex and volatile market environments. While deep reinforcement approaches have shown promise, they still face limitations in dynamic risk management, exploitation of temporal markets, and incorporation of complex trading strategies such as short-selling. These limitations can lead to suboptimal portfolio performance, increased vulnerability to market volatility, and missed opportunities in capturing potential returns from diverse market conditions. This paper introduces a Deep Reinforcement Learning Portfolio Management Framework with Time-Awareness and Short-Selling (MTS), offering a robust and adaptive strategy for sustainable investment performance. This framework utilizes a novel encoder-attention mechanism to address the limitations by incorporating temporal market characteristics, a parallel strategy for automated short-selling based on market trends, and risk management through innovative Incremental Conditional Value at Risk, enhancing adaptability and performance. Experimental validation on five diverse datasets from 2019 to 2023 demonstrates MTS’s superiority over traditional algorithms and advanced machine learning techniques. MTS consistently achieves higher cumulative returns, Sharpe, Omega, and Sortino ratios, underscoring its effectiveness in balancing risk and return while adapting to market dynamics. MTS demonstrates an average relative increase of 30.67
Introduction
Portfolio management is a crucial component of financial investment, aiming to optimize the balance between risk and return. 1 A well-constructed portfolio can mitigate risks and capitalize on market opportunities, thereby achieving sustainable growth. Traditional portfolio management relies on diversification and periodic rebalancing to manage risk and enhance returns. 2 However, the dynamic nature of financial markets necessitates more sophisticated approaches that can adapt to changing conditions.3,4 Market volatility, economic shifts, and unforeseen global events can all impact investment outcomes, making traditional static strategies insufficient.5,6
To address complex decision-making problems, the Deep Q-Network was introduced, marking the inception and widespread recognition of Deep Reinforcement Learning (DRL). 7 DRL has emerged as a promising approach in portfolio management due to its ability to learn and adapt to complex environments.8,9 It integrates reinforcement learning with deep learning techniques, empowering algorithms to dynamically optimize strategies through continuous learning from historical patterns and real-time market feedback, thereby demonstrating robust adaptability to the inherent volatility and complex dynamics of financial markets. 10
Leveraging these advancements, various frameworks are developed to enhance trading strategies. The Ensemble of Identical Independent Evaluators (EIIE) and its variants use ensemble learning and policy gradients for cryptocurrency portfolios. 11 The Investor-Imitator (IMIT) mimics investor behavior for knowledge extraction. 12 The FinRL framework supports various algorithms, such as Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO), for single and multi-stock trading. 13 The Ensemble Strategy (ES) selectively applies these algorithms to different time intervals. 14 TradeMaster is an open-source platform for reinforcement learning-based trading that features a multi-strategy integration framework. 15 SARL integrates price movement predictions to enhance trading decisions. 16 Additionally, our previous work introduces a framework utilizing the Memory Instance Gated Transformer (MIGT) for effective portfolio management. 17
Despite the advancements in DRL-based portfolio management, the design of these strategies lacks comprehensive risk management and adaptability, which can be identified as three limitations. First, due to the reward structures used in training these models, most DRL-based portfolio management strategies overly emphasize return maximization, often at the expense of risk control, which might prioritize immediate gains over long-term stability, leading to potential large portfolio losses. 18 Some frameworks attempt to address risk management issues, but their approaches often rely on ex-post risk evaluation, such as liquidating all investments upon reaching a predefined risk threshold.13,19 This unbalanced approach can result in algorithms that pursue extreme returns without adequately managing the accompanying risks, which leads to significant losses or heightened volatility. Second, existing strategies assume that market conditions are stationary and often do not account for temporal market characteristics, such as the weekend effect or the turn-of-the-month effect, which can significantly influence portfolio performance.20–24 Some studies employ the Fast Fourier Transform (FFT) to capture temporal or seasonal features, but FFT assumes that the input data is stationary, which does not align with the realities of the stock market. 25 Third, most portfolio management strategies assume that only long positions are allowed while short selling is prohibited, which does not reflect the realities of the stock market.26,27 This results in strategies failing to generate positive returns in declining or volatile markets. Despite short-selling being considered in some studies, thresholds and caps are mostly set manually or used only for risk hedging and without adequate risk controls, which can lead to significant potential for increased volatility and unforeseen losses.28–30
The key contributions of our research are three-fold. First, we incorporate risk management into DRL algorithms using Incremental Conditional Value at Risk (ICVaR), addressing the limitation of overly emphasizing return maximization at the expense of risk control. These measures are integrated into the reward function to guide the algorithm towards better risk control, making the algorithms more adaptable to real-world scenarios. Second, we propose a new encoder and attention mechanism for DRL networks to incorporate temporal market characteristics and capture both long-term and short-term temporal patterns, enhancing their performance in varied market conditions. This innovation aims to develop robust and adaptive portfolio management strategies that better align with the complexities of real-world financial markets and effectively address the challenge of underutilizing temporal market characteristics. Third, we propose a parallel strategy for managing short-selling based on market trends, such as bull and bear markets. This strategy not only regulates the allowance of short-selling but also automates the control of risk aversion levels, enhancing the overall risk management framework. This comprehensive approach resolves the issues related to managing short-selling risks and limitations in existing strategies.
This paper is organized as follows: Section 2 provides the necessary definitions and assumptions for the portfolio management environment. In Section 3, we present the proposed portfolio management framework, including the Markov decision process, risk controls, short-selling control framework, and time-aware embedding and attention. Experiments are conducted in Section 4 to evaluate the performance of our approach. Finally, we discuss the results and concludes the paper with future research directions in Section 5.
Definition
Transaction process
In the DRL framework, the agent dynamically adjusts its capital allocation across stock classes at each discrete trading period

Interaction Process for Time Series.
For the constructed trading environment to be as realistic as possible, the following assumptions are made: Our simulated trades, grounded in historical data, are presumed not to influence stock prices. This premise is rooted in our trading volume being infinitesimal relative to the market’s overall size, making any impact on stock prices negligible.11,37 Given the considered daily trading frequency, each transaction price is set to the previous day’s adjusted closing price, and the trading takes place in real time.
13
Using the adjusted closing price accurately reflects the impact of dividends and stock splits on stock prices, ensuring data consistency and accuracy in investment decisions. Actions in period To reflect various market expenses like trading and execution fees, we implement a universal transaction fee rate
Portfolio management framework
The Markov decision process of portfolio management
The stock market is inherently stochastic, characterized by random and unpredictable movements in stock prices. 38 To address the complexities of stock trading, we model the portfolio management task as a Markov Decision Process (MDP), as shown in Figure 2. In this framework, the portfolio management problem is formulated as a decision-making process whose goal is to optimize the trading strategy.

The Markov decision process of the DRL Environment, reflects the interaction of state, reward, action, DRL agents and the stock market environment.
In our MDP model, the state
The training process of the DRL agent involves observing the changes in state
The iterative nature of this learning process allows the DRL agent to refine its strategy over time. Through this reward-driven feedback loop, the agent can discover complex patterns and relationships in market data that are not immediately apparent. This process involves exploration of different actions and exploitation of known strategies to maximize cumulative rewards over time. The ultimate objective is to develop a trading strategy that generates the highest possible returns while managing risks and adapting to changing market conditions.
We first establish the DRL environment, including arrays representing stock prices and technical indicators. The state space is then defined to include elements such as the account cash amount, market volatility index, and technical indicators, using a Box space (a continuous range with defined upper and lower bounds for each variable) to indicate continuous states. The action space is defined where actions represent adjusting the weight of each stock, which is also continuous. The cash and stock holdings are randomly initialized to reset the environment, and the total stocks are calculated. Stock trades are executed based on actions, rewards, and new states calculated and returned. Portfolio details such as cash, holdings, and market indices are converted into state vectors to obtain the state. In addition, constraints are added, such as the minimum number of shares, to limit the impact of market volatility and transaction costs for trades.
The action
At each time period
The value of stocks
To enhance our model’s ability to navigate complex and dynamic market conditions, we make our risk judgments and reward functions more sensitive to incremental changes in tail risk. Tail Risk refers to the probability of extreme, low-frequency market events that reside in the tails of a return distribution. As an integral component of our risk assessment and reward function, we propose Incremental Conditional Value at Risk (ICVaR) to effectively model and manage this risk. This risk metric builds upon two foundational concepts in financial risk management: Conditional Value at Risk (CVaR) and Incremental Value at Risk (IVaR).39–41 Their base formula Value at Risk (VaR) is
In Equation (8),
Building on VaR, we introduce the Conditional Value at Risk,
39
i.e.,
While CVaR provides a more comprehensive view of potential losses beyond VaR, it does not account for the evolving nature of risk over time. As market conditions can change rapidly, a static measure may not be sufficient for real-time risk management. Therefore, to enhance our model’s responsiveness to market fluctuations and provide a more accurate assessment of changing risk profiles, we define ICVaR:
The environment moves to a new state based on the current state and action taken, with the reward function calculating the percentage change in total assets considering risk, as computed by the ICVaR averse utility function:
The system architecture diagram of our framework is shown in Figure 3. Central to the framework is the primary investment strategy (Algorithm 1), determining and executing the final portfolio allocation and position deployment. Running in parallel is a sub-strategy which operates independently and is not directly involved in the final master portfolio management strategy but is used to generate analytical judgment and inform decision-making.

The proposed MTS framework for portfolio management. The first part is data input and preprocessing, which involves separating the time features and feeding them along with the stock data into the neural network. The second part involves the training of neural networks and reinforcement learning, where the core of the neural network is Time-Aware Embedding and Attention. It outputs actions
The overall system architecture of this DRL portfolio framework is as follows. The system begins by setting up the financial investment environment, configuring asset holdings, cash balances, and relevant parameters such as risk aversion and ICVaR parameters. A parallel sub-strategy operates to assess current market conditions and determine appropriate trading actions in real time. At each time step, the system retrieves market data and updates the internal state accordingly. The agent then uses a deep neural network to select an action based on the observed state, which may include long or short trading positions. After executing the chosen trade, the system updates the portfolio, recalculates the asset holdings and cash balance, and computes the total assets. Risk is managed by calculating ICVaR, which adjusts for risk aversion, ensuring that the agent takes risk-aware actions. Finally, the system calculates the reward for the agent, accounting for both profitability and risk, and then checks if the episode has ended before resetting for the next cycle.
We incorporate a parallel strategy (Algorithm 2) into the main strategy to determine the market trend of the portfolio to determine whether to execute short-selling and its upper limit. It uses the same structure as the main strategy except that only the Dow Jones index and the cash account are used as portfolios. Furthermore, the objective of the reward function is to maximise returns, and there are no assumed fees because these represent a simplified model to gauge broader market trends, allowing for efficient tracking of overall market movement. So its reward function is simpler than the main strategy, i.e.,
Finally, the master strategy uses all reference information to make the final asset investment and portfolio construction decisions to achieve the predefined investment objectives. In this way, the independent sub-strategy serves as a valuable reference and benchmark for comparison. The main portfolio strategy remains the primary focus. It makes the final investment deployment, short-selling judgment, and execution decisions, as well as attributing all position control and risk adjustments to the main strategy.
The novel Time-aware Relative Multi-Head Attention algorithm (Figure 3b and Algorithm 3) introduces advancements in processing sequential data with complex temporal relationships. The Fast Fourier Transform (FFT) is commonly used for analyzing frequency components in time series data. 42 However, FFT has limitations in capturing non-stationary patterns and complex temporal dependencies, making it difficult to process real-world data that exhibit irregular and multi-scale temporal patterns. 25 The Time-aware Relative Multi-Head Attention algorithm can model these intricate temporal dependencies directly in the time domain, allowing for a more comprehensive understanding of the data’s temporal structure.
At the core of this algorithm lies an innovative time feature embedding mechanism, which captures long-term and short-term temporal patterns with weekly and monthly effects. The Time Feature Embedding (Algorithm 4 and part of Figure 3b) starts by initializing parameters for output dimension
The attention mechanism in this algorithm builds upon the multi-head attention structure of traditional Transformers but with several key enhancements. The process begins with the projection of the input into queries
The advantages of our attention mechanism are threefold. First, it enables the model to capture and utilize temporal patterns at multiple scales simultaneously, addressing a key limitation of existing Transformers in handling time series data.
44
Second, using masked queries
Experiments
Comparison with other approaches
To test the performance of the new framework in real markets, we carrie out experiments using historical data, using 30 stocks from the Dow Jones Industrial Average (DJIA) as the portfolio. We use data sets for the five most recent full calendar years from 2019 to 2023 as test sets refer to Table 1, respectively. We assess each variant using four key metrics: Cumulative returns, Sharpe ratio, Omega ratio, and Sortino ratio.46–48
Time composition of the training and test set.
Time composition of the training and test set.
To evaluate the portfolio management framework, it is a direct approach to compare cumulative returns:
To assess the performance of our strategy, we compare it with traditional statistical and DRL strategies. Our traditional statistical strategies are based on mean reversion, trend following, cost optimization, and machine learning principles. Mean reversion strategies are based on the assumption that stock prices will revert to their historical averages over time:
Confidence Weighted Mean Reversion (CWMR)
49
models the portfolio vector as a Gaussian distribution, updating it sequentially according to the mean reversion trading principle. Online Moving Average Reversion (OLMAR)
50
leverages multi-period moving average regression to inform its strategy. Passive Aggressive Mean Reversion (PAMR)
51
utilizes the mean reversion relationship in financial markets and applies online passive-aggressive learning techniques from machine learning. Robust Median Reversion (RMR)
52
uses the mean reversion properties of financial markets and implements robust L1-median estimation to address outliers in mean reversion. Weighted Moving Average Mean Reversion (WMAMR)
53
predicts stock price trends by taking a weighted moving average of stock prices.
Cost optimization strategies focus on reducing transaction costs to improve overall returns. Transaction Costs Optimization (TCO) 54 combines L1 parametrization of the difference between consecutive allocations with the principle of maximizing expected logarithmic returns, accommodating non-zero transaction costs. Trend-following strategies identify and capitalize on the momentum of stock prices. BEST 55 selects the stock with the best performance from the previous day. Machine learning strategies utilize advanced algorithms to identify patterns and make predictions based on data. Nearest Neighbor-based Strategy (BNN) 56 classifies or predicts groups of data points based on proximity. Correlation-driven Nonparametric Learning Approach (CORN) 57 uses correlations to infer relationships between variables in a nonparametric learning context.
Our experiments use MIGT, EIIE, IMIT, FinRL, ES, TradeMaster, and SARL as DRL comparative frameworks. MIGT aims to maximize investment returns while ensuring the learning process’s stability and reducing outlier impacts. EIIE has proven effective in managing cryptocurrency portfolios, 55 and we have transitioned this framework to the stock portfolio domain, optimizing it specifically for the stock market. IMIT models trading knowledge by emulating investor behaviour with logical descriptors and introduces the Rank-Invest model, which optimizes various evaluation metrics to preserve the diversity of these descriptors. 12 .FinRL offers a well-structured and robust framework for automated trading using DRL. 13 ES integrates the best features of three actor-critic algorithms into a novel portfolio management framework. 14 TradeMaster supports DRL-based quantitative trading (including portfolio management) and incorporates automated machine learning techniques to fine-tune hyperparameters for training reinforcement learning algorithms. 15 We will compare this with the PPO algorithm, TradeMaster_PPO (TMP). Lastly, SARL enhances robustness to environmental uncertainties by using asset information and price trend predictions as additional states based on financial data. 16
The experimental results in Table 2 and Figures 4 to 8 provide evidence for the efficacy of the MTS strategy. Across five diverse datasets, MTS consistently outperforms existing strategies, including traditional algorithms like OLMAR and PAMR, as well as state-of-the-art machine learning approaches such as MIGT and EIIE. If we average the results of the five experiments, our MTS achieves a cumulative return of 0.3255, which is 30.67

Results of the comparative Experiments (Dataset 1).

Results of the comparative Experiments (Dataset 2).

Results of the comparative Experiments (Dataset 3).

Results of the comparative Experiments (Dataset 4).

Results of the comparative Experiments (Dataset 5).
Results of the comparative experiments, where the best results for each metric are in bold.
This combined effect is especially noticeable in Dataset 1, where MTS shows the most significant performance across all metrics. One possible reason is that this dataset represents a market environment characterized by high volatility (benefiting ICVaR), strong temporal effects (leveraged by time-aware attention), or distinct trends (captured by parallel DRL). Conversely, in Dataset 2, where the performance in Omega and Sortino ratios are smallest (despite still positive at 0.84
To validate MTS’s performance over the long term, we use the 2019-2023 five-year data, i.e., combining datasets 1 to 5 (Table 3 and Figure 9), as a comparative experiment for the test set. The experimental results demonstrate that MTS significantly outperforms other strategies across all evaluated metrics over the five-year period. Specifically, MTS shows a cumulative return of 2.2739, which is 115

Results of the comparative Experiments (Dataset 1-5).
Five years of data from 2019–2023 were used as a test set to compare the results of the experiments, with the best results for each metric given in bold.
Overall ablation study
To evaluate the individual contributions of each component in our MTS strategy, we conducted an ablation study across five datasets representing different market conditions from 2019 to 2023. We compared the full MTS model against three variants, each with one key component removed:
MTS_w/o_ICVaR: MTS without the ICVaR at Risk component MTS_w/o_Time-Awareness: MTS without the time-aware attention mechanism MTS_w/o_Short-Selling: MTS without the ability to short-selling
The ablation study reveals that the time-aware attention mechanism has the most significant impact on the performance metrics across the datasets, with its omission leading to the largest decreases. On average, the metrics decreased by about 43.7
Risk management through ICVaR
At the core of MTS’s success is its novel risk management framework, ICVaR, which incorporates adjustable risk preferences into the reward function. The impact of this contribution is most evident in the risk-adjusted performance metrics. Across all datasets from Table 2, MTS achieves higher Sharpe ratios, outperforming the next-best strategy by a margin ranging from 12.04
Furthermore, the Sortino ratios, which focus on downside deviations, demonstrate a greater advantage. MTS’s Sortino ratios are 4.01
In the overall ablation study, the ICVaR component consistently improves performance across all datasets. Its impact is particularly notable in Dataset 4, where removing ICVaR significantly drops all metrics (e.g., Sharpe ratio decreases from 0.71168 to 0.59399). This suggests that ICVaR plays a crucial role in risk management, especially in challenging market conditions.
Capturing temporal market inefficiencies
MTS has a redesigned encoder and attention mechanism, incorporating temporal market characteristics such as the weekend and turn-of-the-month effects. This time-aware attention mechanism translates into consistent outperformance in cumulative returns, a primary measure of strategy effectiveness. MTS surpasses the second-best strategy by 4.58
The degree of outperformance in cumulative returns varies across datasets, from 4.58
To verify the role of time feature embedding on the returns of portfolio strategies, we generate a set of data (in Figure 10) with a strong weekend effect and turn-of-the-month effect, that is, add random fluctuations on the weekend and early month of a simulated smooth market for testing. From Figure 10, we find that starting from the 47th trading day, the DRL strategy with time feature embedding shows an advantage, indicating that it more fully extracts the time feature in the stock market.

Comparison of markets with strong weekend effect and turn-of-the-month effect.
MTS’s parallel DRL for market condition determination complements the previous two by adapting trading strategies based on detected market trends. This component’s impact is most evident in challenging market conditions, exemplified by Dataset 4. Some strategies that achieved positive returns in other datasets suffered losses in Dataset 4. In contrast, our MTS strategy maintained impressive returns.
This performance in adverse conditions can be attributed to the parallel DRL’s adaptive trading rules, such as limiting buying during downtrends and managing short-selling. These rules, informed by real-time trend analysis, contribute significantly to downside protection, as reflected in the superior Sortino ratios. Moreover, we hypothesize that the trend identification itself benefits from the time-aware attention mechanism, creating a synergistic effect that enhances MTS’s adaptability.
To test the profitability of our strategy that includes short-selling in a declining market, we have simulated an extreme market that declines by 5

Comparison of market declining by 5
The experimental results show that when
Results of ablation study, where the best results for each metric are in bold.
We also compared the impact of applying short-selling in a simulated market of up to 5
This study proposes a novel strategy, MTS, for portfolio management using deep reinforcement learning. Our approach addresses key challenges in algorithmic trading through three innovations: ICVaR for dynamic risk management, a time-aware attention mechanism to capture temporal market inefficiencies, and a parallel DRL framework for adaptive trading with short-selling control. The experimental results, conducted over five diverse datasets spanning from 2019 to 2023, demonstrate the superior performance of MTS compared to a wide range of existing strategies, including both traditional algorithms and state-of-the-art machine learning approaches. MTS consistently outperformed other strategies across multiple performance metrics, including cumulative returns, Sharpe ratio, Omega ratio, and Sortino ratio.
Despite its promising performance, our MTS strategy has three major limitations. First, our current implementation primarily focuses on equity markets and, like existing strategies, lacks comprehensive support for diverse financial markets. This limitation restricts the strategy’s applicability across the broader financial landscape. Second, while the reinforcement learning method we use PPO is effective, it is not specifically tailored for portfolio management tasks, potentially missing out on domain-specific optimizations. Third, although we incorporate risk management through ICVaR, our model does not fully utilize the wide array of advanced mathematical models available in financial theory, which could potentially enhance the strategy’s risk assessment and decision-making capabilities.
We propose three directions for future work to address these limitations. First, we aim to extend MTS to support a broader range of financial markets simultaneously, including bonds and foreign exchange. This expansion would involve adapting our model to handle these markets’ unique characteristics and data structures, potentially leading to a more versatile and robust strategy. Second, we plan to develop a novel reinforcement learning algorithm for portfolio management tasks instead of existing algorithms like PPO, DDPG and SAC. This new algorithm could incorporate financial domain knowledge directly into its architecture, potentially including custom policy and value networks that better capture the complexities of financial markets. Additionally, we will explore Meta-Reinforcement Learning 58 to enable the agent to rapidly adapt its strategy to new market dynamics, improving the model’s ability to generalize from historical data to live trading environments. Third, we intend to incorporate more advanced mathematical models from financial theory into our framework. This could include integrating stochastic volatility models for improved risk estimation or incorporating regime-switching models to capture market dynamics better. These enhancements would improve our approach’s theoretical foundation and potentially improve real-world performance across various market conditions.
Footnotes
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the 2024 Jiangsu Provincial Construction Science and Technology Project (No. 2024ZD056), Suzhou Key Laboratory of Multi-modal Intelligent Agents for Industrial Applications (SZS2025126RC), Suzhou Multimodal Big Data Innovation Application Lab, Suzhou Science and Technology Plan Project - Innovation Consortium Project (LHT202417), Taicang Key Research and Development Program (Social Development) Project (TC2024SF09), and XJTLU Research Development Funding (RDF-21-01-069).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
