Abstract
A long-standing challenge in artificial intelligence is lifelong reinforcement learning, where learners are given many tasks in sequence and must transfer knowledge between tasks while avoiding catastrophic forgetting. Policy reuse and other multi-policy reinforcement learning techniques can learn multiple tasks but may generate many policies. This paper presents two novel contributions, namely 1) Lifetime Policy Reuse, a model-agnostic policy reuse algorithm that avoids generating many policies by optimising a fixed number of near-optimal policies through a combination of policy optimisation and adaptive policy selection; and 2) the task capacity, a measure for the maximal number of tasks that a policy can accurately solve. Comparing two state-of-the-art base-learners, the results demonstrate the importance of Lifetime Policy Reuse and task capacity based pre-selection on an 18-task partially observable Pacman domain and a Cartpole domain of up to 125 tasks.
The importance of efficient learning over multiple tasks
During their lifetime, animals may be subjected to a large number of unknown tasks. In some environmental conditions, nutritious food sources may be readily available, while in others they may be sparse, hidden, or even poisonous, and dangerous predators may roam in their vicinity. To address these challenging conditions, various behaviours must be selectively combined, such as avoidance, reward-seeking, or even fleeing. When direct perception provides limited or no cues about the current task, animals have to infer the task or use a strategy that works for many different tasks it may encounter. If task sequences are long and diverse, the animal may need to find different strategies, each of which apply to a large sub-domain of the tasks it encounters. Selecting how many such strategies are required represents a trade-off between optimality and the animal’s limited cognitive resources.
In artificial intelligence, variants of the above problem have been formulated within the field of lifelong learning [11,62,68], in which new tasks may appear at any time and where a learned task might be forgotten when it has not been seen regularly. The lifelong learning setting combines the problems of transfer learning and catastrophic forgetting into a single scenario. In transfer learning, learners should leverage the knowledge gained from a set of previously learned tasks with similar characteristics to a new task, whilst avoiding transferring knowledge that is not relevant [37,50,65]. In catastrophic forgetting, knowledge learned on one task removes or deteriorates knowledge learned on some previous tasks [19,22]. Due to combining both necessary aspects of learning as well as the emphasis on long-term problem solving, the lifelong learning setting is an exciting frontier for advancing modern artificial intelligence systems.
Lifelong reinforcement learning represents the intersection of reinforcement learning (RL) [64] and lifelong learning in the above sense. Lifelong RL requires representing the commonalities among tasks which may be done by considering a single policy with task-specific features or through multiple policies. Single policy learning emphasises feature sharing which is beneficial when it is known which aspects are task-specific and which are shared. Multi-policy learning has improved applicability to larger task sets with more widely varying tasks and it allows greater task specialisation within the task set. A key challenge is that to store solutions to different tasks, multi-policy RL (e.g. [17,25]) generates more policies as more tasks are presented, which results in excessive memory consumption. Moreover, in both single- and multi-policy RL, there is a fundamental challenge regarding the task capacity of policy representations: how many tasks can be represented or learned within a policy without a significant loss of performance? So far, this question has only been investigated indirectly by tracking forgetting and transfer metrics [32,65] or by growing a policy library until there is no more benefit of adding policies (e.g. [16]).
To address both challenges, this paper focuses on two novel contributions:
Lifetime Policy Reuse: a policy reuse algorithm which maintains a fixed-size policy library. The algorithm adaptively assigns policies to tasks based on bandit learning over the lifetime reinforcement on the different tasks. Contrasting to previous approaches, the algorithm is model-agnostic in the sense that it can take any base-learner as the underlying policy optimisation and exploration on the single-task level is preserved as is. Measures for task capacity: the paper presents theoretical and empirical measures for task capacity, which can be used to determine how many policies to use in a lifelong RL problem. The theoretical task capacity applies arguments from representational capacity [4,18,20,28,56,71] to the multi-task multi-policy context. To account for sequential effects and learning, the empirical task capacity is based on the relative performance compared to task-specific policies.
Using lifetime policy reuse with pre-selection based on task capacity allows high performance in lifelong learning domains, as demonstrated in a Cartpole domain and a partially observable Pacman domain, both of which consist of a long sequence of randomly ordered tasks.
The paper is structured as follows. Section 2 provides preliminaries to the lifelong RL task sequences. Section 3 provides an overview of lifelong reinforcement learning algorithms and their relation to Lifetime Policy Reuse. Section 4 develops the three lifetime principles behind Lifetime Policy Reuse. Section 5 develops useful definitions of task capacity. Section 6 provides the experimental setup including lifelong learning domains and the various experimental conditions. Section 7 provides the performance evaluation of Lifetime Policy Reuse and analyses the impact of adaptive policy selection, the importance of task capacity, and the difference between the base-learners.
Reinforcement learning setup
In RL, an agent interfaces with the environment in the following control cycle. First, it receives the state of the environment, a state s from the state space
The most common task-modelling frameworks in RL are Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs).
An MDP is defined by a tuple
The POMDP generalises the MDP to account for partial observability of the current environment state, making it more widely applicable. The POMDP
While traditional RL solves a single MDP or POMDP, this paper will focus on how to solve lengthy sequences of randomly presented MDPs and POMDPs.
Lifelong reinforcement learning: A review
Lifelong reinforcement learning (LRL) is one of the branches of lifelong learning which focuses on the problem setting of RL. This section surveys the field, identifying scalability to a large number of widely varying tasks and applicability to different base-learners as key gaps in the current state-of-the-art.
Single-policy methods
Most LRL approaches consider how to learn with a single policy and provide some task-specific aspects. The main aim of this approach is that it allows transferable features to be efficiently learned within a single representation while adding some additional task-specificity. Specificity may be represented in architectural components or within the training algorithm itself to improve performance on multiple tasks [15,29,30,34,43,55,57]. While a single policy is compact and the neural network’s hidden layers close to the input layer are able to represent transferable features, it can bring disadvantages. The number of neurons may grow with the number of tasks [55], or expert knowledge of the task’s relevant features may be needed as inputs to the learning system [15,57]. Further, there may be no common representation to all tasks; for example, one lifelong learning approach, which reused a single Deep Q-Network policy on all of the tasks, has also been outperformed by task-specific DQN policies [34,46]. More generally, these approaches typically have a few limiting assumptions on the relation between tasks. For example, [12] and [38] require some bounds on the change in reward function and transition function, implying it is more suited to continuously changing scenarios rather than a sequence of tasks from a wide task domain. Techniques and insights relevant for single-policy RL may also be found in related fields; for example, parameter sharing for multi-agent RL [21,66], which uses a single-network to represent all policies of the agents but which may add agent indicators to the observation to account for heterogeneity of different agents.
Policy reuse
To avoid some of the above-mentioned challenges, some have recognised the need for learning with multiple policies. In particular, policy reuse methods learn a limited number of policies each specialised to their own subset of tasks as a means to improve transfer to a new unseen task [16,40,42,52,73]. The approach represents a step forward in improving the scalability of multiple policy approaches as some empirical results have shown the viability of transfer learning. For example, 50 tasks in an office domain were learned with 14 Q-learning policies; the results show this was possible by learning a common policy for all of the tasks which have their goal locations in the same room [16,74]. While there have been recent advances in policy reuse which demonstrate improved transfer on a smaller number of tasks (e.g. [40]), the high number of 50 tasks in [16] has not been replicated in other LRL studies (policy reuse or otherwise). Of primary interest here, the high number of tasks indicates policy reuse as a promising approach for scalable LRL.
To scale the policy reuse approach to memory-consuming deep reinforcement learners or to even more policies than is currently the case, current methods have their limitations. In particular, online learning of the policy library without generating too many temporary or permanent policies remains a challenge in policy reuse. Most works have focused on how to select the new policy for a new, unlabeled task (see e.g. [52]). Other works have provided tools to build the policy library. In [16], new, temporary policies are formed for new tasks and these may be added to the library if after convergence on the task the resulting policy performs significantly better than the existing policies in the library. A downside of this method is that when convergence is not allowed, and many tasks are randomly presented, memory issues may become apparent. In [25], one also forms new policies for new tasks but all policies are added to the library, which implies memory issues will be apparent even more rapidly.
Policy reuse has been limited in its applicability as base-learners other than Q-learning methods have yet to be incorporated with policy reuse. This would be particularly desirable after results suggesting Q-learning may not be optimal for reuse, since (i) policy reuse with Q-learning policies may result in negative transfer, being outperformed by task-specific Q-learning [40]; and (ii) a single-policy lifelong learning approach with DQN has been outperformed by task-specific DQN policies [34,46].
Alternative multi-policy methods
A variety of alternative approaches have been taken to multi-policy LRL. These methods assume some similarity across all tasks and are therefore not the method of choice to learn large sets of tasks with widely varying dynamics and reward functions.
One approach is to provide an initial policy based on the prior tasks and then generate a new policy for each additional task encountered during operation [1,17,35,69,75]. However, each task requires a new policy to be stored, causing memory problems when a long sequence of tasks is considered. Moreover, finding an initial policy may be difficult to locate in parameter space when tasks require widely differing policies.
Hierarchical approaches are proposed to improve transfer learning by learning sub-policies that achieve a subgoal or search for a rewarding high-level state [8,35,41,67,68]. Sub-policies used in this way represent solutions to sub-tasks which occur across a number of different tasks, reducing the number of required policies. While there are a few exceptions (e.g., [39]), a downside of this approach is that (i) there is a need for two separate phases, one to learn sub-policies, and one to combine them; and (ii) that tasks may come in clusters in which similar tasks are more efficiently solved by a common policy.
Meta-learning
Meta-learning, in which a meta-level algorithm optimises the underlying learning algorithm itself by means of a meta-level objective, has been investigated in domains relevant for lifelong learning. In general, such methods share key aims, such as learning over an extended lifetime and improving generalisation but do not consider the same learning setting or the same goal to learn widely varying task sequences.
Meta-learning methods are particularly prevalent in supervised learning. For example, hyper-networks, in which a top-level super-ordinate network performs gradient descent to optimise a sub-ordinate network, can solve a variety of tasks from within a similar class [2,27].
In meta-RL, the literature focuses on generic aspects of RL such as improved exploration (e.g., [76]). Other meta-RL approaches have focused specifically on settings relevant to lifelong learning: multi-task learning, in which multiple tasks are learned in batch (e.g., [17]); continual learning, in which a limited number of similar tasks of increasing difficulty are presented (e.g., [51]); and lifetime RL methods that are guided by the lifetime average reward to improve the system in long-term environments (e.g., [6]). The lifetime policy reuse method is partly inspired by the latter approach.
Lifetime policy reuse
Lifetime Policy Reuse avoids excessive memory consumption by using a number of policies fixed at design-time which are continually refined based on incoming tasks across the lifetime. Contrasting to earlier policy reuse systems [16,52,73], Lifetime Policy Reuse has the following features (to which it owes its name):
Rather than assuming convergence on the task, the successive presentation of one task may be a much more limited number of episodes as part of a much longer randomised sequence, and the policies in the library are adjusted throughout the entire lifetime of the RL agent based on the subset of tasks they are assigned to. Policies are assigned to subsets of tasks by comparing their lifetime average reward compared to the other policies. The number of policies is fixed at all times, and thereby it scales better to lengthy lifetimes with many tasks.
The following provides a concrete description of Lifetime Policy Reuse, including the problem setting (see Section 4.1), the policy selector, details on the adaptive mechanism for policy selection (Section 4.2), and the model-agnostic algorithm (see Section 4.3). The full algorithm of Lifetime Policy Reuse is described with pseudo-code in Algorithm 1.

Lifetime policy reuse algorithm with conservative ϵ-greedy policy selector
Let
The task sequence consists of consequent task-blocks, each of which is a time interval during which the same task is repeated for a number of episodes (or a particular time interval for non-episodic environments). At the start of each task-block, the current task τ is sampled with replacement from an unknown distribution
The goal is to develop a fixed number of policies by assigning to each policy their best task specialisation based on their task-specific lifetime average
The task-specific lifetime average reward is favoured over a few alternatives as the objective of the adaptive policy selector. First, the jumpstart, which is the performance difference at the start of the new task, and the most rapid improvement, which considers the improvement in performance over an initial time interval in the new task, may lead to rapid initial improvement but may be followed by stagnation. Second, while the asymptotic performance on the task is especially important for long-term scenarios, it is unknown during the learning process and therefore cannot provide online learning; instead, the lifetime average performance on the task allows online, incremental learning of which policy to select and if it converges, it converges to the asymptotic performance. Third, while the discounted cumulative reward is used to optimise the policies, this objective is not used in the policy selection because discounting defines the solution method (i.e., low-patience vs high-patience) rather than the goal. To give a practical example, in Pacman or Atari games, the RL community evaluates the performance of algorithms not on the objective of the learner but on the actual score, be it asymptotic or lifetime average performance. Discounting in continuing environments does not optimise the lifetime average but the lifetime average does (see e.g., [47]).
Policy selector
In Lifetime Policy Reuse, there is a natural trade-off between exploration, selecting an untested policy for the task, and exploitation, selecting the well-tested policy for the task. Bandit algorithms such as Upper Confidence Bound (UCB) and ϵ-greedy are therefore prime candidates to function as policy selectors. The experiments in this paper will use ϵ-greedy as policy selector. At the start of each episode, ϵ-greedy selects the best policy with a probability
Model-agnostic algorithm
The algorithm first initialises a library of policies

Lifetime policy reuse. The following cycle is repeated: first, a policy selector selects a policy to be used; second, the policy interfaces with the environment for a prolonged amount of time (e.g. one episode), during which time it not only acts but also learns in the environment. When the policy has finished, it feeds back its performance to the policy selector.
Lifelong reinforcement learning algorithms would greatly benefit from measures that help to determine how many tasks they can learn. In Lifelong Policy Reuse, for example, this would help to select a suitable number of policies,
Solving multiple tasks accurately
A near-optimal lifetime policy reuse system yields for each task
In addition to providing a notion of near-optimality, Lemma 1 below provides a sufficiency condition that illustrates that to obtain near-optimal policies, it is sufficient to analyse the states in which the policy performs sub-optimally. The lemma will make the following assumptions. First, rewards and expected rewards are normalised in
An
Lemma 1 implies that to find an ϵ-optimal policy, it is sufficient to focus on those states
Further, Lemma 1 comes with two corollaries.
When a given a policy
When a given a policy
When both reward function and transition dynamics are dissimilar, transferability of one task to another is low, resulting in scalability issues. This effect can be observed in a sequence of POcman tasks, where many policies are required for epsilon-optimality across the task set (see Section 7). By contrast, the Cartpole problem maintains the same reward function across tasks, improving the overlap between tasks.
Epsilon-optimality introduced in Lemma 1 can be used to define how many policies are needed or equivalently, how many tasks a single policy can represent accurately. This quantity, called the theoretical task capacity, is defined below and illustrated on simple examples as well as the 27-task cartpole domain, which will be revisited in Section 6.3.
The theoretical task capacity is defined representationally based on (i) the hypothesis class
The
An important aspect of the theoretical task capacity is its dependency on the task set
Simple examples. To illustrate the representational argument made for selecting the number of policies, below are two examples on how the representation of the base-learner limits its theoretical task capacity for a given task set.
Since the minimal number of linear functions that can be selected while maintaining expected error smaller than
Each task
Task cluster analysis of the 27-task Cartpole domain. When exact arguments are not possible, as is the case for deep-reinforcement learners in relatively complex tasks, the theoretical task capacity can be roughly approximated by identifying distinct clusters of tasks. This cluster analysis is now exemplified for deep reinforcement learners on the 27-task Cartpole domain, a domain that will be revisited in the empirical experiments of this paper.
The Cartpole domain includes different MDPs characterised by different transition dynamics. In this domain, a cart must balance a pole placed on its top by moving left or right with fixed force of 1 N. Each time step is rewarded with
A single deep RL policy can represent a single task accurately since it can map arbitrary states onto arbitrary low-entropy probability distributions over actions. However, by Corollary 1, when different MDPs have different transition dynamics, a subset of such tasks will make the policy reach states in which its action is suboptimal. In the Cartpole domain, such suboptimal states can only be the states preceding the above-mentioned terminal states of the type
A computationally cheap experiment is performed, requiring only a few minutes runtime. The experiment tests a random uniform policy on all 27 tasks for 600,000 time steps each, which amounts to 20,000 to 60,000 episodes depending on task difficulty. Since case (b) is observed rarely (at most 15 times), while case (a) happens between 17,000 and 30,000 times, depending on the length of the cart, the focus is on which states precede event (a) and how frequently. In this analysis, clear differences in the angular velocity and the episode length before event (a) are observed across tasks. The cartpole problem is symmetric, so the lowest absolute angular velocity is computed as a boundary between the safe and unsafe regions. With this methodology, 8 clusters of nearby points are observed, with means
A near-optimal set of policies may not necessarily be found, even if it is possible to represent them. The theoretical task capacity ignores various learnability issues within the hypothesis class
Definition. Base-learners with high empirical task capacity are those for which a low number of policies results in a performance close to, or even better than, a one-to-one mapping of tasks to policies. The
The choice of
Suggested usage. The empirical task capacity is proposed for two use cases. First, it can be used to compare base-learners over a domain and observe which base-learner scales best over tasks. Second, it can be used, analogous to the theoretical task capacity, as a tool for pre-selecting the number of policies. In this case, the user selects a task sequence that represents the application of interest and then estimates the optimal number of policies.
Experiment setup
Base-learners
Two base-learners, Deep Q-Network and Proximal Policy Optimisation, are compared in the study. They represent two classes of state-of-the-art deep RL methods, value-based learners and actor-critic learners. In both cases, a single actor is chosen instead of distributed RL with many actors, because in a realistic, continual lifelong learning setting the learners are not able to perfectly simulate copies of their environment. Their architectures are chosen to be as similar as possible, as depicted in Fig. 2.

Architectures used in the experiments. The observation
Deep (recurrent) Q-network. For MDP tasks, we investigate Deep Q-Network [46], a state-of-the-art value-based reinforcement learner suitable for discrete action spaces and complex state spaces, as is the case for example in Atari game environments. Its policy depends on a state-action value, and its updates are done in an off-policy style, allowing the training to be independent of the current policy and to repeat events in the distant past. For POMDP tasks, Deep Recurrent Q-Network (DRQN) is chosen due to its track record in partially observable video games [10,23,36,61]. DRQN extends Deep Q-Network with a different experience replay method that samples traces of observations rather than a single observation. This allows it to learn from sequences of observations with a Long Short-Term Memory [26] layer, making it suitable for partially observable environments. A key question in both learners is whether negative transfer results in DQN-based, and other Q-learning-based lifelong learning systems [34,40], can be replicated in other types of tasks and whether these results also hold for DRQN.
DQN repeatedly performs control cycles in which first it outputs for each action its Q-value,
For initialising DRQN, burn-in initialisation [31] is chosen, where the agent selects a number of random actions to initialise the trace of the recurrent network; for DQN this step is not taken. In DRQN, the updates and the loss function are the same as in DQN except that the sampled states are traces of observations which are passed through an LSTM layer such that the system remembers previous time steps.
In the present paper’s experiments, the buffer of D(R)QN is treated as part of the policy, and therefore each policy will have a separate experience buffer; this ensures that experiences are sampled only from the subset of tasks on which the policy specialises. The convolutional layers are replaced by a single densely connected layer due to the small observation without spatial correlations. Further details of its implementation, including the hyper-parameters, are given in Appendix A.
Proximal policy optimisation with/without LSTM network. Proximal Policy Optimisation (PPO) [60] is a policy optimisation algorithm which is often applied as an actor-critic reinforcement learner. PPO is chosen because of its robustness and transfer learning benefits [9,24,49]. It applies a clipped loss function, shown in Equation (7), that takes into account that the policy updates should not be too large, to allow monotonic improvements,
Analogous to the D(R)QN setting, for POMDP tasks the neural network architecture is supplemented with a Long Short-Term Memory layer [26] and burn-in initialisation [31] is the chosen initialisation at the start of the episode. The architecture differs from DRQN in the sense that there are two output layers, one for the actor and one for the critic, that share their connections to the LSTM layer. The actor’s output,
To assess forgetting and transfer, we now construct a forgetting ratio and a transfer ratio, which define forgetting by a negative forgetting ratio and negative transfer using a negative transfer ratio. The ratios are similar to transfer learning metrics and are directly based on the performance after transitioning from earlier tasks to the task of interest [5,65], and in particular, the area ratio [65], which compares the performance of a policy with experience on other tasks to a policy that is randomly initialised. Like the area ratio, our metrics integrate the performance across the entire task-block to capture the complete learning behaviour, contrasting to jumpstart, time to threshold, and asymptotic performance. Due to the existence of many task-blocks of the same task, the proposed transfer and forgetting ratios are first computed for all task-block transitions across the lifetime. Forgetting ratios are then aggregated for different bins of the number of interfering task-blocks while transfer ratios are aggregated for different bins of the number of prior task-blocks.
The
The
The transfer ratio differs from the area ratio in the use of the uniform random policy performance as the denominator; this compares different base-learners in the same units and avoids the sensitivity to scale mentioned in [65]. The forgetting ratio also is formulated using the uniform random policy in the denominator; dividing by the uniform random policy as opposed to the performance of the base-learner on a previous block avoids sequential effects (e.g. consequent task-blocks after a completely forgotten task could potentially have a performance near zero, giving a high score and dominating the aggregated forgetting ratio). The correction for the 1-to-1 policy’s improvement
Policy spread. An additional metric, called the
Lifelong learning domains
The experimental setup tests the impact of the number of policies on performance, probing in particular the effects of changing reward functions and transition dynamics as these limit the scalability (see Corollary 1 and 2). The experiments are conducted in a 27-task Cartpole MDP domain, in which a feedforward DQN and PPO are used, and an 18-task POcman POMDP domain, in which case an LSTM layer is added for a recurrent version of PPO and DRQN.
These domains are chosen to illustrate the challenge of lifetime policy reuse, involving randomly chosen tasks presented in rapid succession. In both domains, the learner does not have time to converge its parameters, memories of earlier tasks can be lost by the time they are presented again, and policies learned on one task may transfer negatively to other tasks. The differences between these two domains help to analyse critical factors underlying its success: (i) changing reward functions; (ii) changing transition dynamics; (iii) partial observability; and (iv) task similarity. The MDP domain has a relatively large task-similarity and has full observability. The POMDP domain has lower task-similarity and has changes in reward functions, which combined with the partial observability makes it challenging to infer the task from within the policy; this contrasts to Atari and other video games where a single observation is sufficient to recognise the task. Such partial observability is often true in the real world, where opposite policies are often required but this is unknown to the learner due to the effect of hidden variables. Therefore, the POMDP domain allows to assess strategies for lifetime policy reuse in low task-capacity scenarios.
27-task Cartpole MDP domain. In the 27-task Cartpole MDP domain, reinforcement learners are tasked with moving a cart back and forth to balance a pole for as long as possible. The domain includes varying transition dynamics but not varying reward functions. 27 unique tasks are based on varying properties of the cart and the pole: (i) the mass of the cart varies in
The lifetime of the learner starts at
In this experiment, tasks are presented in a random sequence from a uniform random distribution, with each task sequence providing a unique order. To account for the variability in the experiments due to the stochasticity in the actions selected by the reinforcement learner, as well as the random task presentation, the learners are evaluated based on 27 independent runs. Each independent run consists of 40.5 million time steps and corresponds to a unique task sequence. Each task sequence consists of 675 task-blocks. Each task-block consists of 60,000 time steps, or at least 300 episodes, of the same given task. With 27 unique tasks, this allows on average 25 blocks per task. Each episode takes at most 200 time steps, depending on the success of the learner. To minimise the effects of task order, the tasks are spread evenly between the different task sequences, as is illustrated in Fig. 3.

Illustration of the task sequences assuming a number of tasks
18-task POcman POMDP domain. In the 18-task POcman POMDP domain, reinforcement learners interact with 18 tasks each of which contains an object of interest and has an unknown grid-world topology. While the smaller state space reduces the number experiences required, the lifelong learning domain is extremely challenging due to including varying transition dynamics, varying reward functions, and partial observability.
As illustrated in Fig. 4, the environment’s 18 distinct tasks are based on three dimensions,
The defensive strategy is to move away with a 50% probability and otherwise to stand still, and the aggressive strategy is to move towards the agent with a 50% probability and otherwise to take a random action.

Illustration of POcman tasks based on three defining task characteristics. (a) in a positive dynamic task, the learner must touch the defensively moving ghost; (b) in a negative dynamic task, the learner must avoid the aggressively moving ghost; (c) in a positive static task, the learner must touch the food pellet; (d) in a negative static task, the learner must avoid touching the poison bottle. Note that the objects are illustrated in this way purely for interpretation, whereas the learner perceives the objects as the same observation on all tasks.
The lifetime of the learner starts at
As in the MDP task sequences, tasks are presented in a random sequence from a uniform random distribution and each task sequence provides a unique order. Independent runs account for the stochasticity in the learner and environment and task order effects are minimised by spreading tasks evenly between the different task sequences. In the POMDP domain, each task sequence consists of 450 task-blocks with 200 episodes of the same given task.
To demonstrate the proposed approach, a first study compares the performance of Lifetime Policy Reuse across different settings of the number of policies as well as an ablation without the adaptive policy selector. For the unadaptive policy selector, the policy is selected based on a fixed, balanced partitioning of tasks. For unadaptive mappings, the experiments manipulate the number of policies
The above-mentioned conditions are abbreviated based on three properties: (i) adaptivity, either
The performance evaluation experiments are conducted on the IRIDIS4 supercomputer [70] using a single Intel Xeon E5-2670 CPU (2.60 GHz) with a varying upper limit to RAM proportional to the number of policies (approximately
After the performance demonstration, a further analysis then investigates three hypotheses to support the understanding of the Lifetime Policy Reuse and the importance of task capacity.
The first hypothesis is that adaptivity in the policy selection is beneficial if a particular policy is situated in a bad region of parameter space, in which case selecting one of the alternative policies may rapidly find a favourable location of the parameter space. This hypothesis is assessed by making use of the policy spread of learners and relating it to the performance benefit of adaptivity (if any).
The second hypothesis is that task capacity pre-selection is important in the sense that it can be used together with Lifetime Policy Reuse with the following benefits: a) it improves performance and learning; b) it reduces the memory cost; and c) it generalises towards related task spaces. While aspect a) is assessed based on the performance evaluation and the corresponding learning metrics, aspect b) and c) require additional experiments. To assess aspect b), the proposed approach of using a fixed policy library is compared to the growing policy library used in traditional policy reuse methods across different settings of the task capacity, the probability of accepting a policy into the policy library, and the number of task-blocks until convergence. To assess aspect c), additional experiments are conducted on a 125-task Cartpole MDP domain which is based on the same three task-features within the same range as the 27-task domain but with more values included per task-feature.
The third hypothesis is that PPO and DQN are subject to the same representational limits of the theoretical task capacity but due to the more aggressive updates, DQN will push these limits more rapidly; thereby learning more rapidly but also forgetting more rapidly. Additional experiments on the 27-task Cartpole domain investigate the effect of increasing task-block sizes and relaxing the clipped objective as evidence for the hypothesis. To rule out an alternative hypotheses, experiments with alternative experience replay methods are conducted, including modifying the replay buffer through distribution matching, task-matching, and larger buffer sizes.

Performance evaluation of lifetime policy reuse compared to single-policy and 1-to-1 policy baselines.

Average score (mean ± standard error) in the 27-task Cartpole MDP domain for different multi-policy variants of the DQN and PPO base-learners. “Ours” indicates the proposed lifetime policy reuse algorithm with its ϵ-greedy policy selection and a number of policies indicated by the task capacity. To demonstrate the importance of selecting a suitable number of policies and the effectiveness of the ϵ-greedy policy selector, the number of policies is manipulated and the ϵ-greedy policy selector is compared to a fixed policy selector which balances task assignments equally across policies. The inclusion vs exclusion of ϵ-greedy is indicated by adaptive vs unadaptive while the number of policies is indicated by the suffix (e.g.
Having defined the experimental conditions, this section provides the experimental results, including the performance evaluation for the 27-task Cartpole and 18-task POcman as well as a further analysis. Before providing detailed results with ablations and parametric studies, Fig. 5 provides an initial demonstration of the effectiveness of Lifetime Policy Reuse: despite using significantly fewer policies, Lifetime Policy Reuse provides comparable performance levels to the 1-to-1 policy setting for both DQN and PPO base-learners.
27-task Cartpole MDP domain
The development of the average score in the 27-task Cartpole MDP domain is shown in Fig. 6 for different settings of the number of policies. The following main trends are observed:
For DQN, settings of the number of policies For PPO, settings of the number of policies Overall DQN demonstrates rapid and pronounced performance improvements while PPO learns more slowly but more stably. Adaptivity has a small negative effect on the performance.
The average and final performances are summarised in Table 1. For DQN, the best performances are reached by 14- and 27-policy settings, with an average lifetime performance of around 150 and a final performance of 170–180. The 1-to-1 27-policy significantly outperforms the conditions where
All significance values reported in this paper are based on pair-wise F-tests between the conditions, and
Performance in the 27-task Cartpole MDP domain as a function of the number of policies and the inclusion of adaptivity with the ϵ-greedy policy selection. The

Cumulative reward (mean ± standard error) in the 18-task POcman POMDP for different multi-policy variants of the DQN and PPO base-learners. “Ours” indicates the proposed lifetime policy reuse algorithm with its ϵ-greedy policy selection and a number of policies indicated by the task capacity. To demonstrate the importance of selecting a suitable number of policies and the effectiveness of the ϵ-greedy policy selector, the number of policies is manipulated and the ϵ-greedy policy selector is compared to a fixed policy selector which balances task assignments equally across policies. The inclusion vs exclusion of ϵ-greedy is indicated by adaptive vs unadaptive while the number of policies is indicated by the suffix (e.g.
The development of the cumulative reward in the 18-task POMDP POcman domain is shown in Fig. 7 for different settings of the number of policies. Two differences with the 27-task MDP domain are the positive effect of adaptivity and the 1-to-1 policy being much higher-performing than settings with 2 or 3 tasks per policy. However, the other trends are comparable to the previous domain:
The cumulative reward intake improves as the number of policies is increased, with the 18-policy algorithm performing best in both DRQN and PPO.
For DRQN, increasing the number of policies substantially increases the performance and the single policy’s performance even decreases over time.
For PPO, all learners are able to improve over the lifetime and the differences between conditions are less pronounced than for DRQN; PPO’s single policy performance is better but its multi-policy is worse than the corresponding DRQN condition.
DRQN’s best-performing setting, here
An upper bound for the optimal performance is achieving a reward of 1 in tasks with
The average and final performances are summarised in Table 2. For DRQN, the single policy has a lifetime average performance of 0.016, and including multiple policies improves the lifetime average performance monotonically with increasing number of policies, with a factor ranging between 3, for the unadaptive 2-policy learner with its performance of 0.049, and 20, for the 18-policy learner with its performance of 0.309. All learners with multiple policies, including both adaptive and unadaptive conditions and all settings of
Performance (mean ± standard-deviation) in the 18-task POcman POMDP domain as a function of the number of policies and the inclusion of adaptivity with the ϵ-greedy policy selection. Two performance metrics, based on the cumulative reward function
The final performance yields results similar to the lifetime average, with two notable exceptions. First, the variability is much higher, resulting in higher p-values and a lower confidence in the results. Second, the single policy approach in DRQN deteriorates strongly, down to 0.002, despite the other final performances being in the range of 0.072, for the unadaptive 2-policy, to 0.323, for the 18-policy DRQN approach. This means that the improvement obtained by including multiple policies is between 35 and 133 times in performance. By contrast, the single policy approach for PPO has not deteriorated and scores similarly to its lifetime average, 0.055.
With 2–4 policies, adaptive DRQN significantly outperforms unadaptive DRQN; however, with 9 policies, there is no significant difference. Adaptivity can have a beneficial effect comparable to increasing the number of policies: although comparisons of a larger to a smaller setting of
This section further analyses three core hypotheses to explain the above results, illustrating Lifetime Policy Reuse and the importance of task capacity.
Hypothesis 1: Adaptive policy selection provides escape from low-performing regions of parameter space. The first hypothesis is that adaptivity in the policy selection is beneficial if a particular policy is situated in a bad region of parameter space, in which case selecting one of the alternative policies may rapidly find a favourable location of the parameter space. Due to the positive effect of adaptivity only occurring in the DRQN conditions in the 18-task POMDP domain, this hypothesis is supported empirically if DRQN has a higher policy spread than PPO in the 18-task POMDP domain, but not in the 27-task MDP domain. Empirical results support the hypothesis: in the 18-task POMDP domain, adaptive DRQN conditions have a policy spread within

Policy spread (mean ± standard error) in the 18-task POcman POMDP, representing the average variability between the stochastic policies in the policy library. The different base-learners are compared, showing that PPO has more limited variability in its policies compared to DRQN, explaining why DRQN gains more performance from adaptive policy selection.
Hypothesis 2: The importance of task capacity. The importance of task capacity is now demonstrated on three metrics, showing that task capacity pre-selection a) improves performance and learning metrics; b) reduces the memory cost; and c) generalises towards related task spaces, such that the same number of policies can be used with the same performance benefits. To this end, the analysis below compares settings of the number of policies above task capacity to settings below task capacity.
a) Improved performance and learning metrics: In the Cartpole domain, the theoretical task capacity for deep reinforcement learners has been estimated as Improved performance: with 9 or more policies, the performance converges to the highest level for the base-learner; with 1-, 2-, and 4-policy learners, a degrading performance is observed towards the end of the run. These findings are consistent with the hypothesis from the theoretical task capacity that the representational limit of the policies is being exceeded. Improved DQN forgetting ratio: with 9 or more DQN policies, the forgetting ratio is near-zero; with 1-, 2-, and 4-policy DQN learners have a degrading forgetting ratio between Marginally improved PPO forgetting ratio: for PPO, similar trends are demonstrated except that the differences are much smaller.
A further observation is that while DQN displays a higher transfer ratio than PPO for low-policy settings, this effect does not improve performance because it is negated by DQN’s stronger forgetting. In the POcman domain, a theoretical task capacity estimate is not available; however, it can be observed that the empirical task capacity is lower (
b) Reduced memory cost: The analysis of memory cost considers the growth of the number of policies as a function of (i) the task capacity; (ii) the number of task-blocks until convergence; and (iii) and the acceptance probability, defined as the probability that newly proposed policies are accepted into the policy library. The analysis fixes the number of tasks to
c) Generalisation to 125-task domain The hypothesis that the pre-selected number of policies generalises across related tasks spaces is investigated below by an empirical demonstration on a 125-task Cartpole MDP domain. The domain has the same three task-features within the same range as the 27-task domain but with more values included per task-feature. That is, (i) the mass of the cart varies in
Hypothesis 3: DQN and PPO performance difference explained by unbounded updates vs monotonic improvement. A third and last hypothesis is that the performance results are explained by the difference between unbounded updates of DQN compared to the clipped updates of PPO for monotonic improvement. That is, PPO and DQN are subject to the same representational limits of the theoretical task capacity but due to the more aggressive updates, DQN will push these limits more rapidly. The empirical results (see section on DQN vs PPO in Appendix B) are consistent with the hypothesis. First, increasing task-block sizes and relaxing the clipped objective demonstrates that the monotonic improvement objective of PPO slows down the learning progress on a single task but helps to maintain performance across many sequentially presented tasks. Second, modifying the replay buffer, through distribution matching, task-matching, and larger buffer sizes, does not reduce DQN’s forgetting metrics, thereby ruling out an alternative hypothesis.
Average score (Mean ± Standard Error) in the 27-task Cartpole MDP domain for different multi-policy variants of the DQN and PPO base-learners. “ours” indicates the proposed Lifetime Policy Reuse algorithm with its ϵ-greedy policy selection and a number of policies indicated by the task capacity. To demonstrate the importance of selecting a suitable number of policies and the effectiveness of the ϵ-greedy policy selector, the number of policies is manipulated and the ϵ-greedy policy selector is compared to a fixed policy selector which balances task assignments equally across policies. The inclusion vs exclusion of ϵ-greedy is indicated by Adaptive vs Unadaptive while the number of policies is indicated by the suffix (e.g.

Average score (mean ± standard error) in the 125-task Cartpole MDP domain for different multi-policy variants of the DQN and PPO base-learners. The pre-selected number of policies, as indicated by
This paper presents an approach to policy reuse that scales to lifetime usage, in the sense that (a) its number of policies is fixed and does not result in memory issues; (b) its evaluation is based on the lifetime average performance; and (c) its policies are continually refined based on the lifetime of experience with newly incoming tasks. The approach is widely applicable to MDPs and POMDPs and various base-learners, as is illustrated empirically for DQN and PPO base-learners in cartpole and POcman domains. To help select the number of policies required for a domain, theoretical and empirical task capacity metrics are developed and illustrated for a few examples. These metrics are proposed for three purposes, each of which has been shown in this paper to have empirical support:
To pre-select, based on theoretical arguments, the number of policies required on a task sequence. This purpose has been demonstrated by the observation that violating the task capacity requirement leads to negative transfer and forgetting.
To select the number of policies required for a lengthy task sequence based on a limited but fairly representative task-sequence. This purpose has been demonstrated by the use of the 27-task Cartpole domain to select the number of policies for the 125-task Cartpole domain.
To benchmark base-learners in terms of their scalability. Having a representation of comparable complexity, D(R)QN and PPO are found to have the same theoretical task capacity. However, they demonstrate observable differences in the empirical task capacity metric due to learning and forgetting at different rates.
Lifetime Policy Reuse comes with certain benefits but also certain limitations compared to existing alternatives. Compared to other algorithms, it can be applied when task sequences involve (i) a rapid succession of randomly presented tasks; and (ii) distinct task clusters each of which allow significant positive transfer within clusters and negative transfer between clusters. Compared to traditional policy reuse [16,25,40,42,52,73,79], the lifetime set-up allows online learning without growing the policy library, requiring policies to converge on the task, or modifications to the base-learner’s usual learning algorithm. Lifetime Policy Reuse continually refines its existing policies, overcoming a downside of offline methods for policy reuse (e.g., [52]): since the simulation may not represent the real-world environment or as unexpected problems may come up during the application of interest, no suitable policy will be available in the library. However, a challenge that follows is to avoid catastrophic forgetting; the results in this paper show that this can be avoided through appropriate selection of the number of policies guided by the task capacity. As in other policy reuse methods, Lifetime Policy Reuse avoids task-specific parameters and task-descriptive features (see e.g., [15,43,53]), allowing policies that are uniquely specialised to subsets of tasks and requiring less expert knowledge on the task domain. Lifetime Policy Reuse similar to other policy reuse algorithms and LRL algorithms using task-specific parameters requires that tasks can be identified with a task index. Various LRL algorithms (e.g. [1,8]) focus only on learning a new task from various source tasks rather than also considering the problem of retaining knowledge of prior tasks; as such these do not require a task index but applying them as given would imply relearning any task re-occurring in a long task sequence. The requirement of the task index is not strong, since instead of being provided by the environment it could be identified automatically by the agent using task identification. For example, a Hidden Markov Model can predict the task from observations [34] or an observed change in reward distribution can indicate a new task is being presented [44].
The performance and adaptivity findings in the Cartpole domain and the POcman domain have significance for RL in challenging lifelong learning domains. For the Cartpole domain, learners with one policy for every three tasks can similarly achieve a near-optimal performance despite the challenging domains that may perhaps be paraphrased as an “anti-curriculum”: tasks are presented in an unordered sequence, follow in quick succession, and vary in orthogonal dimensions. For the POcman domain, partial observability further has the detrimental effect of making the different tasks more difficult to learn but also more difficult to distinguish from each other.4
Although anti-curricula do not appear too often in structured man-made environments, they tend to appear often in the animal kingdom or in robotic missions, where environmental changes require rapid adaptation with limited if any cues from the environment.
The theoretical task capacity provides a representational argument for the limited number of tasks that can be solved by a single policy. While a single policy can represent the optimal behaviour required for any one particular task, it can only represent the optimal behaviours for a limited number of tasks due to the conflicting state-action pairs that are required for different tasks. The empirical results show that the theoretical task capacity is one of the key limiting factors. The difficulty of epsilon-optimality over many tasks is illustrated for D(R)QN and PPO as a low-policy PPO could learn all tasks with low accuracy while a high-policy D(R)QN could learn all tasks with high accuracy. PPO’s objective function is clipped specifically to stimulate monotonic improvement over time [60]. This slow learning allows time to learn new patterns without erasing previously learned patterns and rather than representing a few tasks optimally, the policy represents a wide set of tasks sub-optimally. By contrast, the D(R)QN objective allows more aggressive weight updates resulting in either a near-optimal performance if a sufficient number of policies allows all tasks to be represented near-optimally, or in strong catastrophic interference and negative transfer otherwise. Further evidence of the importance of theoretical task capacity is that selective experience replay [29] or other modifications to experience replay do not improve single-policy DQN.
In lifelong reinforcement learning settings, a challenge is to avoid catastrophic forgetting and to stimulate selective transfer. This paper makes two novel contributions, namely 1) Lifetime Policy Reuse, a model-agnostic policy reuse algorithm that avoids generating many policies as tasks are being added across the lifetime; and 2) measures for task capacity, which help to pre-select a number of policies and analyse the performance of lifelong reinforcement learning systems in terms of the number of tasks a policy can learn or represent. Combining task capacity based pre-selection with Lifetime Policy Reuse demonstrates improved performance, transfer, forgetting, and memory consumption on 18-task POMDP sequences and 27- to 125-task MDP sequences with random temporal structure.
Footnotes
Acknowledgements
This research was funded by the Engineering and Physical Sciences Research Council and by Lloyd’s Register Foundation. The authors thank Nicholas Townsend for comments on an early draft of this paper and acknowledge the use of the IRIDIS High Performance Computing Facility and associated support services at the University of Southampton.
