Abstract
Reward signal reinforcement learning algorithms can be used to solve sequential learning problems. However, in practice, they still suffer from the problem of reward imbalance, which limits their use in many contexts. To solve this unbalanced reward problem, in this paper, we propose a novel model-based reinforcement learning algorithm called the expected n-step value iteration (EnVI). Unlike traditional model-based reinforcement learning algorithms, the proposed method uses a new return function that changes the discount of future rewards while reducing the influence of the current reward. We evaluated the performance of the proposed algorithm on a Treasure-Hunting game and a Hill-Walking game. The results demonstrate that the proposed algorithm can reduce the negative impact of unbalanced rewards and greatly improve the performance of traditional reinforcement learning algorithms.
Introduction
Reinforcement learning (RL) is a computational approach for understanding and automating goal-directed learning [1–3]. Unlike supervised learning, RL agents never receive correct or incorrect labeled data to facilitate learning [4, 5]; instead, agents learn by receiving positive and negative rewards for the behavior they choose. In recent years, RL combined with deep learning has achieved successful results on a wide range of problems in various fields [6, 7].
In prior works, the reward function modeled by the programmer was usually balanced. That is, the reward function at the intermediate state had no local extremum except for the terminal state. This can help to reduce the number of extreme points of the state value function and improve the veracity of the optimal policy. For example, in the environment of a cliff-walking task, such as the windy grid world task [5, 11], the reward function in an intermediate state is usually defined as a constant, such as 0.1, -0.1 or 0. In the terminal state, the reward function has a global extreme point, and the reward is usually defined as 1 or -1, which guarantees that the RL environment has a balanced reward.
However, in practice, the reward function can have various patterns, some of which are unbalanced. An unbalanced reward environment can reduce the performance of traditional reinforcement learning algorithms and even lead to safety risks [12]. For example, reward hacking is a risk that can cause unexpected behavior in practical applications [13–15]. The problem of reward imbalance hampers the agent’s work. With the ongoing applications of RL, it is worth paying serious attention to the problem of unbalanced rewards, since unbalanced rewards can break traditional RL algorithms in surprising, counterintuitive ways [13, 17]. The main reason is that in traditional RL algorithms, the return function emphasizes only the immediate reward as the main effect. In an unbalanced reward environment, the RL agent requires a significantly higher number of iterations to obtain the optimal policy.
To avoid the problem of unbalanced rewards, some methods have proposed transforming an unbalanced reward environment into a balanced reward environment [18, 21], for example, by clipping all positive rewards at 1 and all negative rewards at -1 while leaving 0 rewards unchanged. Although this approach alleviates the negative impact of unbalanced rewards, the performances of the resulting RL methods are degraded because they ignore the magnitude information of different rewards during learning [4]. Therefore, it is particularly important to establish a simple and effective method to solve the unbalanced reward problem.
Thus, in this paper, we propose a novel model-based RL algorithm that combines multistep methods with dynamic programming to solve the unbalanced reward problem. This new algorithm is called the expected n-step value iteration (EnVI). In contrast to the traditional multistep methods, the proposed algorithm uses a new return function composed of n standard return functions. This new return function alters the discount of future rewards while reducing the influence of the current reward, which enables the proposed method to reduce the negative impact of unbalanced rewards and substantially improves RL performance.
The remainder of this paper is organized as follows: The problem definition is introduced in Section 3.1; then, the new return function and the new multistep method are presented in Sections 3.2 and 3.3. In Section 4, the performance of the proposed method is evaluated on the Treasure-Hunting game. Finally, some brief discussions and conclusions are provided in Sections 5 and 6, respectively.
Background
RL methods have been effectively applied to solve sequential decision problems, which are often formalized as Markov decision processes (MDPs). An MDP can be described as a tuple (S, A, P, γ, R), where:
•S is a finite state space,
•A is a finite action space,
•P : S × A × S → [0, 1] is a state transition matrix. P (s′|s, a) is a transition probability of an agent taking action a in state s and reaching state s′, and P π : S × S → [0, 1] is a state transition submatrix under a fixed policy π; when π (s) = a, we define P π (s′|s) = P (s′|s, a).
•γ ∈ [0, 1) is a discount factor specifying how future rewards are weighted with respect to the immediate reward.
•
In traditional RL algorithms, the return function G
t
is defined as the discounted sum of the rewards observed after time step t,
The goal of the agent is to maximize the expected return by finding an optimal policy π*, which is a mapping function from state s to action a. Under a given policy π, the state value function at state s is defined as follows:
The optimal state value function is
To estimate the optimal state value function, the value iteration (VI) algorithm, a classic learning method, was proposed in [1]. The update rule is given as
Problem Definition
In practice, the reward functions of some RL tasks are unbalanced. That is, except for the terminal state, the reward function at intermediate states has some local extreme points. In traditional one-step RL algorithms, such as VI, unbalanced rewards cause data inefficiency problems and degrades the learning performance. This result mainly occurs because in traditional RL algorithms, the return function emphasizes only the immediate reward as the main influence. Consequently, in an unbalanced reward environment, an RL agent requires a significantly higher number of iterations to obtain the optimal policy. Moreover, unbalanced rewards can even change the action tendency of the RL agent, moving it far away from the target state. Although the optimal policy obtained by traditional RL algorithms ensures that the agent has an optimal state value function, it does not ensure that the agent can achieve the goal. This type of behavior can undermine the designer’s intentions, therefore, it is particularly important to establish a simple and effective method to solve the unbalanced reward problem.
In this paper, we propose the novel model-based EnVI method. In contrast to traditional dynamic programming methods, the proposed method uses a new return function that effectively weakens the adverse effects of unbalanced rewards and improves RL performance.
New return function
First, we present a new return function, which is defined as follows:
The relative weight is an important parameter that can serve as an indicator for estimating the relative contribution of the reward signal. For the first n steps, the relative weights of the discount factor of the reward at time step t + k - 1 are calculated by
This is an important characteristic that can help us solve the unbalanced reward problem. Consider the following Markov Chain:
In our study, we randomly determine a set of parameters, i.e., n = 5, γ = 0.8, r
i
= -1, i ∈ {t, …, t + k
n
- 2, t + k
n
, …}. In particular, we assume that rt+k
n
-1 = 0, which shows that the robot has an unbalanced reward case (the reward function has a local extremum in state st+k
n
). According to Eq. (9), we calculate that the value of k
m
is 2. For diverse values of k
n
∈ [1, 5], Table 1 shows the changes in relative weights
The changes in the relative weight
Moreover, the results of Table 2 also support this conclusion. Table 2 shows the changes in the return function
The changes in the return function
Note that the second part of the new return function is
Under a given policy π, for an agent in state s, the new state value function
According to the Markov Property, the new return function
Initialize π arbitrarily
Given the model tuple (S, A, P, γ, R)
Initialize n
Repeat
For each state s ∈ S
Calculate P π and {ρ π (s′)}
L
π
(s) = ∑s′∈Sρ
π
(s′) r (s′)
In Section 3, we presented the proposed method EnVI. In this section, a series of experiments are conducted to evaluate the performance of EnVI. First, a simple Treasure-Hunting game, which has a sparse reward function, is used as the test platform. Then, a more complicated Hill-Walking game is used to verify the scalability of the algorithm. All the experiments were conducted using the hardware and software environment of a desktop computer with an Intel-i7 3.6 Ghz CPU, 16 GB of DDR3 RAM, Windows 7, Python 3.5.2 and Sublime Text 3.
Treasure Hunting Game
As shown in Fig. 1(a), the Treasure-Hunting game has 23 grid squares, each of which represents a robot state s. The goal of the game is to train the agent to find the treasure, which is located in state s22. In states s21 and s23, there are two pirates, which can "eat" the agent. In this game, states {s21, s22, s23} are terminal states; the game will end when the agent either encounters a pirate or finds the treasure. In each episode, the agent can start in any intermediate state s i , i = {1, 2, …, 20}. The thick black line on the outer boundary represents a wall, which restricts the robot’s movement.

Treasure Hunting Game. (a) shows the state distribution of the game.
For each intermediate state s i , the robot action space is A = {right, up, left, down}. The agent moves one square in the corresponding direction, unless it encounters a wall in that direction. When the agent bumps into a wall, it stays in the same position. Each step results in a reward of -0.1, except when the agent encounters a pirate or finds the treasure. The agent obtains a reward of 1 when it finds the treasure; however, if the agent encounters a pirate, it is penalized 1.
The purpose of this study is to evaluate the performance of the RL algorithm under an unbalanced reward scenario. Hence, we specifically modified the reward function in Fig. 1(b) to ensure that the game has an unbalanced reward environment. As shown in Fig. 2, there are two versions of unbalanced reward environments. In the first version, the reward at state s3 is 0.5, while in the other version, the reward at state s8 is -0.5. The rewards remain the same for the other states. These two environments provide two typical unbalanced reward problems because the reward function at state s3 has a local maximum and that at state s8 has a local minimum.

The reward function of the two unbalanced rewards environments. We modify the reward function in Fig. 1. In the first one, we assume the reward at state s3 is 0.5, and in the second one we assume the reward in state s8 is -0.5.
In this study, we adopt the accuracy rate of the optimal policy as the performance evaluation index since it can evaluate whether the optimal policy is satisfactory. The accuracy rate is defined as
Tables 3 and 4 present the accuracy rate results in the two unbalanced reward environments. As γ and n increase, the accuracy rate of the optimal policy improves. This result illustrates that we can reduce the adverse effect of an unbalanced reward by increasing the values of the discount factor γ and n. However, during training, a large γ will cause the number of iterative loops to increase, resulting in longer learning times. With a small γ, EnVI can still obtain a better optimal policy than does the traditional VI algorithm.
The results for the accuracy rate δ in an unbalanced reward environment, r (s3) =0.5
The results for the accuracy rate δ in an unbalanced reward environment, r (s8) = -0.5
Figure 3 shows the smallest number of iterations that the agent needs to obtain the best accuracy rate. Clearly, as the value of n increases, the proposed method requires fewer iterative loops to obtain the best optimal policy.

The results of iterative loop for an agent at least needed to obtain the best accuracy rate.
In Section 4.1, we designed a grid world simulation to evaluate the performance of our proposed method. In the game, the reward functions in most intermediate states are equal. To fully evaluate the performance of our proposed method, we designed a new simulation, the Hill-Walking game, in which the reward functions in each state are dissimilar.
In the Hill-Walking game, the agent gradually moves stepwise from an intermediate state to a terminal state. As shown in Fig. 6(a), the state space is discretized and can be expressed as S = {sx,y|x = i, y = j}, where x and y are the coordinate values of the robot and i ∈ 1, 2, …, 8, j ∈ 1, 2, …, 7. In the game, the state s8,7 = (8, 7) is the terminal state. For any state si,j, the reward function is r (si,j) = exp (-0.07 (s - s8,7) 2). Thus, the agent obtains a greater reward as it approaches the terminal state. However, in this experiment, we specifically changed the reward function in states s3,3, s6,3, s3,5 and s6,5 to ensure that the game has a complex unbalanced reward environment by setting r (s3,3) =0.4, r (s6,3) =0.7, r (s3,5) =0.5 and r (s6,5) =0.8. As shown in Fig. 6(b), the reward function has several local maxima. The action space of this new game is the same as that in the Treasure-Hunting game. The robot arm can move to an adjacent state by manipulating the joystick.

The Hill-Walking game: (a) the state distribution of the game. Here, s8,7 is the terminal state. At each time step, the robot can move one grid space in any direction until it reaches a terminal state; (b) the distribution of the reward function. At states s3,3,s6,3,s3,5 and s6,5, the reward function has several local maxima.
Table 5 shows the number of iterative loops that the robot spends completing the learning process across a range of γ and n values. As the discount factor γ increases, the robot requires a larger number of iterative loops to obtain the optimal value function. For a fixed γ, the results for various values of n are similar. Table 5 shows the accuracy rate results in the unbalanced reward environment. As γ and n increase, the accuracy rate of the optimal policy improves.
The number of iterative loops the robot requires to complete the learning process in the Hill-Walking game
The number of iterative loops the robot requires to complete the learning process in the Hill-Walking game
The results for the accuracy rate δ in the Hill-Walking game5
As shown by the results presented in Sections 4.2 and 4.4, the proposed EnVI method outperforms the traditional VI algorithm when solving unbalanced reward problems. The reason for this outcome is that the proposed algorithm uses a new return function that changes the discount of future rewards while reducing the influence of the current reward. The changes in the discount factor of the reward not only enable the unbalanced reward to be perceived early but can also help to reduce the number of local extreme points of the state value function in an intermediate state. As shown in Figs. 4, 5 and 7, as the value of n increases, the number of local extreme points is reduced, and the policy also improves.

The results of state value function in the unbalanced rewards environment, r (s3) =0.5. When the parameters n = 1 and γ = 0.1, the state value function has a local extreme point at state s3. With the γ and n increasing, the number of local extreme points is reduced and the state value function becomes more smoothness.

The results of state value function in the unbalanced reward environment, r (s8) = -0.5. When the parameters n = 1 and γ = 0.1, The state value function has two local extreme points at states s1 and s5. With the γ and n increasing, the number of local extreme points is reduced and the state value function becomes more smoothness.

The results of state value function in the Hill-Walking Game. When the parameters n = 1 and γ = 0.1, The state value function has several local extreme points. With the γ and n increasing, the number of local extreme points is reduced and the state value function becomes more smoothness.
Moreover, as shown by the results presented in Sections 4.2 and 4.4, we find that with as γ increases, the accuracy rate of the optimal policy also improves. The reason is that as γ increases, the relative weight of the immediate reward r t for the state value function decreases. This weakens the effect of the unbalanced reward function and reduces the number of local extreme points of the state value function. However, as shown above, a larger γ will lead to more iterative loops and a greater learning time, because as γ increases, the optimal state value function becomes larger, as shown in Fig. 8. Fortunately, the proposed method can reduce the dependence of the algorithm’s performance on a large γ. Even when using a small γ, EnVI still obtains a better optimal policy than does the traditional VI algorithm.

The results of the sum of absolute state value function on the Treasure Hunting game and Hill-Walking Game. In the both games, as the γ increasing, the value function becomes more large.
Note that in this study, the proposed EnVI method can only be applied in model-based learning, and the agent must know the environment precisely; however, it is very difficult to guarantee this in real-world situations. Hence, in our next study, we plan to improve EnVI and propose a new method of learning in a model-free environment. Additionally, the value n can be viewed as a reward-scaling parameter. We argue that a better understanding of its use is an important area for future research.
In this paper, we proposed a new learning method called EnVI by adopting a new return function. In EnVI, we changed form of the discount of future rewards and no longer treat the immediate reward as the primary influencing factor. We evaluated the proposed method on a Treasure-Hunting game and the results demonstrate that EnVI not only weakens the adverse effects of unbalanced reward problems but also significantly improves the optimal policy.
Footnotes
Appendix.A
In the first, we define:
In this study, we have a theorem:
Theorem 1 In a MDP, for a fixed initial state s and policy π, the probability of state in n - step can be obtained through initial probability distribution
Assume we have a Markov chain {s
t
, π (s
t
) : t ≥ 0}, where π is a fixed policy and a
t
= π (s
t
). We define the probability of state s
i
in n - step later is:
where, in the second to last equality we used the Markov property to conclude that
When n = 2 we can get
Combining with our definition in the beginning, we have
The Eq. 23 can be written as the form of vector
Hence, if policy π and initial state s are given, we can obtain the probability distribution of state in n - step. The total probability of state s
i
in n steps is
