Abstract
Organizations are vulnerable to cyber attacks as they rely on computer networks and the internet for communication and data storage. While Reinforcement Learning (RL) is a widely used strategy to simulate and learn from these attacks, RL-guided offensives against unknown scenarios often lead to early exposure due to low stealth resulting from mistakes during the training phase. To address this issue, this work evaluates if the use of Knowledge Transfer Techniques (KTT), such as Transfer Learning and Imitation Learning, reduces the probability of early exposure by smoothing mistakes during training. This study developed a laboratory platform and a method to compare RL-based cyber attacks using KTT for unknown scenarios. The experiments simulated 2 unknown scenarios using 4 traditional RL algorithms and 4 KTT. In the results, although some algorithms using KTT obtained superior results, they were not so significant for stealth during the initial epochs of training. Nevertheless, experiments also revealed that throughout the entire learning cycle, Trust Region Policy Optimization (TRPO) is a promising algorithm for conducting cyber offensives based on Reinforcement Learning.
Keywords
Introduction
According to Oreyomi [27], the cyber security landscape has been significantly changed by the exponential growth of autonomous and intelligent threats. Due to the problem of cyber attacks causing financial loss and damage to the reputation of organizations, many companies have resorted to offensive security initiatives such as penetration testing [4,5,13], warfare [46] and professional training [22,44], in an attempt to fight these attacks effectively. However, the majority of research and analysis involving the topic of cyber security and machine learning [18,42,43,45,51] have primarily focused on detection aspects, including intrusion detection, phishing, malware analysis, and logic-flaw-exploiting network attacks. This study offers a distinct perspective.
From the attacker’s standpoint, this research aims to investigate specific issues related to autonomous cyber offensives. In this context, among the studies identified in this investigation, it was noticed that most of them treat autonomous cyber offensives as fully observable Markov Decision Processes (MDPs) [3,9,12,31,41,47,48]. In an MDP, the scenario is completely observable, meaning the agent has access to all variables of the environment. Considering that attackers often engage in cyber offensives against scenarios with incomplete observed states, the MDP model falls short in accurately capturing this reality. However, Standen and Li [19,39] view the problem as a partially observable MDP (POMDP).
In a POMDP, the scenario is not fully revealed to the agent, introducing an element of uncertainty in its observations. POMDP is more aligned with the perspective of a real attacker since they often operate in environments where complete information is lacking. Moreover, all of these research [3,9,12,19,31,39,41,47,48] used Reinforcement Learning (RL), a machine learning paradigm that has shown promise for automating MDP and POMDP. RL is a trial-and-error approach that allows an agent to learn from its environment through rewards or penalties and gradually improves its performance over time. Furthermore, using RL in this type of problem, the studies found [3,9,19,31,39,41,47,47,48], focus on the training process as a whole, analyzing stages and rewards until the generalization of the algorithms, which occurs after several epochs.
Thus, as this research has a perspective of an autonomous attack against unknown scenarios, the chosen approach considers an RL-guided offensive against a POMDP environment. Nevertheless, cyber offensives guided by RL against unknown scenarios have an important drawback: they usually present high probability of early exposure due to mistakes committed during their training phase, such as connection errors and access permission errors.
Face of this problem, this work evaluates if the use of Knowledge Transfer Techniques (KTTs) can contribute to making attacks stealthier, thereby reducing the likelihood of early exposure by minimizing errors during the training of RL-based cyber offensives against unknown scenarios. This research considers knowledge transfer as any approach in which knowledge or experience acquired in one task or domain is utilized to enhance performance in another related task or domain. Therefore, according Guo et al [10] Transfer Learning (TL) and Imitation Learning (IL) can be regarded as forms of knowledge transfer, as both involve the utilization of information from a source to enhance performance in a target task.
Given that this research focuses on stealth, the evaluation of the efficiency includes error rates and number of steps, considering attacks driven by basic RL algorithms and TL techniques for this specific problem. However, it has been acknowledged that before applying these KTTs in an attack on an unknown scenario, they must first be trained in a known benchmark environment to analyze their effectiveness in a different unknown scenario.
To compare RL based cyber attacks with the ones that use KTTs, the present work developed a laboratory1 and a method that allows simulations of different cyber offensives algorithms in different unknown scenarios from Network Attack Simulator (NASim). The experiments run simulations for 2 unknown scenarios using 4 traditional RL algorithms: Deep Q-Networks (DQN), Advantage Actor-Critic (A2C), Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO); and 4 KTTs: TL, Behavior cloning (BC), Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL).
As a result of the evaluations carried out, it was observed that, although the use of certain RL with KTTs produced superior results, they were not so significant in relation to stealth in the first epoch for certain experiments, which still places it as a promising solution to be explored in seeking to reduce the probability of early exposure to cyber attacks against unknown scenarios.
Finally, this article is organized into 5 sections that include: Section 2 background, which presents the basic concepts related to cyber attacks, simulations, MDP, RL, KTT and related work; Section 3 the methodology; Section 4 results and discussion; and Section 5 the final considerations that includes the conclusion and points for future work.
Background
This section aims to present the fundamental concepts and related works that orbit the experiments carried out in this research, such as cyber attacks and simulations, reinforcement learning and KTTs.
Cyber attacks and simulations
According [17], cyber attacks are offensive actions that seek to exploit vulnerabilities in systems and devices against a organization or computer network. They can be direct, in which the attacker seeks unauthorized access to a system or network, or they can be of the APT (Advanced Persistent Threat) type, which involve a set of techniques, persistent and targeted activities with the goal of gaining and maintaining unauthorized access to a system. APTs are characterized by their sophistication and persistence and they are often carried out by well-funded and highly skilled attackers who are looking to gain access to sensitive information or systems. Furthermore, APTs typically involve multiple stages, including reconnaissance, delivery, exploitation, installation, command and control and action.
To carry out cyber attacks, attackers use a variety of techniques and tools, organized into a logical sequence of steps known as a kill-chain. The first was the Lockheed Cyber Kill Chain [6], which is made up of seven stages: reconnaissance, weaponization, delivery, exploitation, installation, command and control, and actions on objectives. Another kill-chain is the Unified Kill Chain [30], which includes a pre-attack phase, reconnaissance, weaponization, delivery, exploitation, installation, command and control, and actions on objectives.
According Balto, [1], simulators are tools used to create virtual environments that mimic real-world situations. As can be seen in researches [3,19,39,41,47,48], a popular application of simulators is in the field of artificial intelligence, specifically in development of reinforcement learning agents. In the context of this research, there are simulators known as cyber-ranges that allow the simulation of attacks on computer environments and networks [1]. Some examples of these simulators are NASim [38] and CAGE/CybORG [21], both of which are compatible with the OpenAI Gym API [2]. Moreover, using these simulators, it is possible to create custom scenarios or use a set of pre-existing scenarios to simulate the behavior of reinforcement learning agents through a simulated attack.
According to Yang and Liu [47], NASim is an open-source simulation platform that enables the creation of various network scenarios with a high level of abstraction to test attackers using RL algorithms. In NASim, the agent’s objective is to compromise all target hosts on the network, while minimizing the number or cost of actions used. Agent rewards are calculated based on the values of compromised hosts minus the cost of actions. The actions available in NASim include scanning to collect information from hosts and subnets, escalating privileges through the use of procedures to raise the level of access for executing processes, and exploiting vulnerable services on the hosts.
Regarding the NASim benchmark scenarios,2 they encompass a range of predefined instances designed for the evaluation of algorithmic performance. Among these benchmark scenarios, the ones classified as static remain unchanged upon successive loading and are characterized by distinct configurations outlined in their respective configuration files. These benchmark scenarios showcase varying levels of complexity and scale, featuring a spectrum of attributes including subnets, hosts, operating systems, services, processes, exploits, and privilege escalations. The intricacy extends to the dimensionality of observations and states, encompassing elements such as compromised, reachable, and discovered components. Furthermore, these scenarios offer differing levels of access potential on hosts. The scenarios’ scope spans from relatively compact instances like tiny scenarios, featuring 4 subnets, 3 hosts, and minimal features, to more elaborate setups like medium-multi-site, comprising 7 subnets, 16 hosts, and expanded feature sets. These benchmark scenarios serve as standardized environments for gauging the efficacy of algorithms in diverse settings, aiding the assessment of algorithmic robustness and adaptability under varying complexities.
Reinforcement learning
Artificial intelligence (AI) refers to the simulation of human intelligence in machines, allowing them to perform tasks that typically require human intelligence such as perception, reasoning, and learning. Machine learning is a subset of AI that enables machines to learn from data without being explicitly programmed [32]. There are three main paradigms of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a model on labeled data, while unsupervised learning involves discovering patterns in unlabeled data. RL is a type of machine learning in which an agent interacts with an environment, learns from the feedback it receives in the form of rewards, and makes decisions to maximize the cumulative reward over time [35].
According Sutton and Barto [40], RL agents learn from their own experiences, making decisions based on the feedback they receive from the environment. The RL model is composed of an agent, an environment, a set of states, a set of actions, a reward function, and a policy. The agent takes an action in the environment based on its current state and receives a reward based on its action. The goal of the agent is to learn a policy that maximizes its expected cumulative reward over time. Moreover, RL can be modeled as a MDP or a POMDP. In MDP, the agent has full observability of the environment, while in POMDP, the agent only has partial observability of the environment.
According Ozalp [29], there are different types of algorithms using RL, including model-free and model-based algorithms. Model-free algorithms, such as DQN [24], learn the Q-value of each action in each state. In contrast, model-based algorithms, such as A2C [23], learn the dynamics of the environment and the optimal policy. In addition, RL algorithms can be classified as value-based or policy-based. Value-based algorithms, such as DQN, learn the optimal Q-value function and derive the policy from it. Policy-based algorithms, such as TRPO [36] and PPO [37], learn the policy directly. Nonetheless, A2C is actually a hybrid algorithm that combines both value-based and policy-based methods, so it could be classified as either value-based or policy-based depending on the perspective.
Knowledge transfer techniques
Traditional machine learning often demands training models from scratch for each new task, a process in which consumes time and computational processing. To address this issue, KTTs facilitate the transfer of knowledge between tasks and domains, thereby reducing the need for extensive data and computational resources to achieve optimal performance. Fundamental KTT approaches, such as IL and TL, include pre-trained models, which provide a foundational basis for new tasks.
KTTs as IL and TL are interconnected concepts, though they exhibit distinct characteristics. According to Guo [10], IL constitutes a machine learning paradigm wherein a model learns to replicate the actions of a human expert. Through observation of expert actions in a given task, the model seeks to emulate these behaviors when encountering similar scenarios. Operating within the realm of supervised learning, the central aim of IL is the reproduction of demonstrated behavior. Otherwise, as described by Zhuang [50], TL entails leveraging knowledge from one task to enhance performance in another related task. TL manifests in various forms, such as transferring layer weights between models or refining a pre-trained model on analogous tasks.
Traditional machine learning necessitates training models from scratch for each task, a process that can be time-intensive and computationally demanding. KTTs address this challenge by facilitating the transference of knowledge between tasks and domains, thereby diminishing the data and computational resources required to achieve optimal performance.
Pre-trained models serve as a fundamental TL approach, furnishing a foundation for new tasks. Furthermore, another avenue for knowledge transference is imitation learning, which involves training a new agent to mirror the actions of an expert agent. According Hussein et al [14], this approach proves valuable when sourcing data for the new task is arduous or costly. Imitation learning encompasses various techniques, including BC [28], DAgger [34], and GAIL [11].
Lastly, according Da Silva and Costa [7], to evaluate the TL for RL-based agent systems, the following metrics are generally used: jumpstart (offset), measures the improvement in the initial performance of the agent; time to threshold (speed), the learning time taken for generalization; asymptotic performance (generalization), convergence to a suboptimal policy; total reward and transfer ratio.
Related works and experiments
Related works and experiments
The investigation into simulation environments for cyber attack scenarios and related topics has seen notable advancements in recent research endeavors. Table 1 provides an overview of key contributions in this domain, categorically organized to illuminate various aspects of the employed simulation environments, Markovian models, reinforcement learning algorithms, knowledge transfer techniques, maximum steps per epoch, the epoch of generalization, and the corresponding references. Table 1 is organized into 6 columns, each representing a different aspect of the simulation environment employed. Specifically, Simul. denotes the simulator used in the experiment, Model represents the Markovian model, Algorithm refers to the reinforcement learning algorithm, KTT indicates the utilization of knowledge transfer techniques, MSPE stands for the maximum number of steps per epoch, Generalization indicates the epoch when generalization occurred, and finally, Ref. indicates the research in which the experiment was performed.
As can be seen in Table 1, the experiments are based on 2 simulators: NASim and CybORG. Regarding the Markovian model used, although [47] uses a multi-objective MDP, all except [39] considered the problem as a completely observable MDP, which does not accurately represent the real perspective of an attack. Among the algorithms used to guide reinforcement learning agents, Chen [3] was the only one that used imitation techniques with GAIL in conjunction with A3C and DPPO, although he did not consider an attack on an unknown scenario. The algorithms NDSPI-DQN Decoupling proposed by Zhou [49], CLAP by Yang [47] and HA-DQN by Tran [41], are improvements and modifications from base RL algorithms or in their method of use with the aim of achieving improvements in speed and generalization. Finally, Standen’s experiment [39] is the closest to this research, considering a POMDP using a DQN with LSTM (Recurrent DQN) in a standard CybORG scenario, comparable in size to tiny-hard from NASim. In this experiment, the direct application of the algorithm achieves convergence after 2500 epochs.
As can be seen in above, all related works, including Chen [3] who used the GAIL imitation technique, are focused on the training process as a whole, analyzing steps and rewards until the algorithms generalization and different from the present work, did not consider the assessment for the issue of stealth against unknown scenarios. Furthermore, all algorithms presented in the research referenced in Table 1 are based on algorithms from the RL section. These include variations of DQN such as HA-DQN, DQN with LSTM and NDSPI-DQN decoupling, PPO serving as the foundation for DPPO and CLAP, and A2C forming the basis of A3C.
Methodology
With the objective of evaluating the effectiveness of established algorithms regarding the stealth of RL agents engaged in attacks on unknown scenarios, a dedicated reinforcement learning laboratory3 has been developed. This laboratory4 serves as a comprehensive environment oriented towards the development and analysis of RL agents assigned to execute cyber attacks. RL algorithms, namely DQN, A2C, TRPO, and PPO, have been integrated into this laboratory,5 constituting the foundational elements for experiments pertinent to this research field [3,19,39,41,47,48]. All of these algorithms are outlined and elaborated upon by Raffin et al. [33], wherein they were subjected to testing across instances from tiny-hard and medium-multi-site scenarios in order to assess their performance metrics, particularly concerning number of steps and error rates.
Furthermore, this laboratory6 has been configured not only to facilitate the comparison of the isolated deployment of the aforementioned RL algorithms, but also to explore their combined utilization with KTT. This entails examining the application of these algorithms in conjunction with techniques such as weight transfer across models using standard TL, as well as the training of IL techniques such as BC, DAgger, and GAIL alongside the RL algorithms.
Scenarios
As can be seen in Table 2, NASim’s benchmark scenarios have been specifically designed to test the effectiveness of RL algorithms in simple and complex environments. The header in Table 2 is organized as: Env refers to the name of scenario, N denotes the number of networks, hosts (H), operation systems (O), services (S), processes (P), exploits (E), privilege escalation (PE), actions (Actions), states (States) and steps limit for each scenario (Limit).
NASim’s benchmark scenarios used in this research
NASim’s benchmark scenarios used in this research
In the context of NASim, the terms steps, reward, and error play pivotal roles in evaluating the performance of RL algorithms. Steps refers to the number of actions executed during training. Moreover, Reward denotes the positive reinforcement or feedback that an RL algorithm receives upon taking specific actions in a given scenario. In experimental procedures, the reward serves as a quantitative measure of the algorithm’s success, influencing its learning process. Conversely, error encompasses deviations or inaccuracies in the algorithm’s predictions or actions, reflecting its divergence from optimal behavior. In the case of NASim, the agent can encounter connection errors and permission errors. Steps, reward and error are fundamental metrics used to assess the efficacy and efficiency of RL algorithms.

NASim tiny-hard scenario.
The laboratory7 platform and method developed in this study play a crucial role in enabling the simulation of attacks in unknown scenarios, particularly due to the inherent nature of TL. TL relies on training in a specific scenario to be effectively applied in another. Consequently, when the attack target is unknown, it becomes impractical to train the TL technique on the target scenario. Considering that TL is designed to enhance agent performance in RL, it is essential to evaluate its efficacy in unknown scenarios.
Regarding IL, the algorithms BC, DAgger, and GAIL were chosen due to their implementation and availability in Stable Baseline 3.8 Therefore, to perform IL, the proposed training methodology encompasses a previous training of the agents in an derived instance from known scenarios shown in Table 2 and then applying the knowledge in a unknown benchmark instance from respective scenario in Table 2. Each derived instance has the suffix “-derived” from the base benchmark scenario in which, keeping their same amount of hosts, despite the all design of networks, operational systems, vulnerabilities and firewall policies have been changed. Figures 1, 2, and 3 illustrate the differences between a tiny-hard and medium multi-site for the corresponding benchmarks and derived instances. For both scenarios, the green lines represent the protocols and services allowed by the firewalls, while in red what is blocked. In these examples, the operating systems of the hosts are represented by the penguin, in the case of Linux and the symbol of the window for systems Windows.

NASim medium multi-site scenario benchmark instance.

Instance derived from NASim medium multi-site scenario.
All assets may or may not have vulnerabilities to be exploited with a certain probability and the colored hosts represent those that have sensitive information, something for the attacker, while those in gray are not considered sensitive.
The objective of this methodology is to enable the agents to learn the fundamental strategies required for executing successful attacks from training in known derived instances so that they can be more assertive in attacking unknown instances. Once the agents have been trained in a instance derived from a scenario, they can transfer their knowledge or be imitated in a different and unknown instance. This approach is in line with Lu [20], because it enables the agents to learn the fundamental strategies required for executing successful attacks and transfer this knowledge to new instances of scenarios, thereby reducing the errors since the initial phases of the attack.
Experiments were conducted on the benchmark and derived instances from 2 scenarios, using the initialization parameters outlined in Table 3. All agents, configured with DQN, A2C, TRPO and PPO algorithms, were started with the same Seed, Gamma and Learning Rate parameters described in Table 3.
Experiment initialization parameters for each scenario and agent
Experiment initialization parameters for each scenario and agent
For all these algorithms, there exist specific parameters that significantly influence the behavior and learning process of each algorithm. For instance, the Seed parameter initializes the random number generator, thereby influencing the reproducibility of results. The Gamma parameter, often referred to as the discount factor, determines the trade-off between immediate rewards and future rewards. The Learning Rate parameter governs the degree to which the algorithm adjusts its policy or value estimates based on new experiences. These parameters play a crucial role in configuring the algorithms to attain favorable learning outcomes and achieve desired performance characteristics.
The remaining parameters of each algorithm were preserved at their default configurations, in accordance with Raffin [33] in the implementation of Stable Baselines 3.9 The Fully observable information set to False, represents that the simulation environment was configured as a POMDP. Additionally, parameters such as the Maximum Number of Steps, Episodes, Timesteps, Action Space, State Space, Benchmark Steps and Rewards are considered. The Benchmark Steps signify the average steps required to attain the goal (+/− standard deviation), as observed in a random experiment conducted by NASim’s own research [38]. This serves as a reference for the benchmark scenarios, involving agents operating in a randomized mode.
The experimental phase involves an analysis process through the aggregation of results from two distinct scenarios: the tiny-hard scenario and the medium multi-site scenario. This process consists of 4 sequential steps aimed at investigating the performance of both strict RL algorithms and RL algorithms integrated with KTT under varying conditions.
Initial RL-only simulation: During this phase, RL algorithms are exclusively applied to benchmark instances from each scenario. The primary objective in this stage is to identify the most efficient RL algorithm in terms of Rewards, Errors, and Steps within the unknown scenario. To determine the most efficient algorithm, the learning curve for the combined results of the two scenarios will be plotted for each of the 3 factors: Rewards, Errors, and Steps. This will help identify the offset, speed, and generalization achieved by each algorithm in their simulations. Integration of KTTs: With the purpose of covering any possibility that could compromise the selection of the best-performing algorithm, for instance, the possibility that the chosen RL algorithm may not learn from KTT effectively, or even that KTTs may not be helpful. Before determining it, all algorithms, both in a strict sense and with the use of KTTs, are tested and analyzed. This procedure aims to ensure that the chosen algorithm exhibits equal or superior performance compared to others. After determining the most efficient RL algorithm in the previous phase, the chosen algorithm is integrated into the second stage. In this step, the algorithm is used in conjunction with KTTs on derived instances from each scenario. This setup allows the algorithm to accumulate knowledge during the learning process in known scenarios. Execution of RL with KTTs on an unknown instance: The selected RL algorithm, combined with KTTs, is executed against unknown benchmark instances from each scenario. The goal is to assess the combined performance of these techniques when dealing with an unknown scenario. Analysis and visualization of results: The outcomes of both the strict RL algorithm and the RL algorithm combined with KTTs are tabulated and presented graphically. These visualizations facilitate a comprehensive analysis of the algorithm’s performance under different circumstances.
Results and discussion
In accordance with the instructions provided in the initial phase of the experimental procedures, simulations were executed employing A2C, TRPO, PPO, and DQN exclusively on the unknown benchmark instances of each scenario. The outcomes related to Rewards, Steps, and Errors resulting from these simulations were aggregated for each epoch.

Sum rewards from unknown benchmark instances using strict RL algorithms.
The results pertaining to Rewards for the strict algorithms are illustrated in Fig. 4. Figure 4(a) delineates the learning curve, showcasing that only the TRPO and PPO algorithms exhibited convergence towards the generalization of optimal rewards. As the learning curve represents the accumulation of rewards from two distinct scenarios, the point of generalization was established by identifying the mode among the rewards of each algorithm. Consequently, a reference value of 360 rewards was employed to ascertain the epoch at which the algorithm demonstrating the most favorable reward performance achieved or exceeded this value. This threshold was narrowly attained by TRPO at epoch 110 and by PPO at epoch 141. From these identified points of generalization, the speed of learning can be deduced, with TRPO demonstrating the most adept performance.
Regarding the offset, Fig. 4(b) portrays the total rewards achieved in the initial epoch, during which all algorithms yielded negative values. Among them, DQN exhibited comparably less negative performance, reaching −589 rewards, whereas the remaining three algorithms tied at a score of −1369.
The three algorithms exhibited the same offset performance, all attaining relatively low rewards in the initial epoch. The intrinsic stochastic nature of policy-based RL algorithms, distinct from the value-based DQN, coupled with identical hyperparameters, could have influenced the outcomes, potentially impacting the algorithms’ capacity to effectively explore the solution space.
In depicting the averages of rewards during the initial epochs of the agents in the experiment, Fig. 4(b) is instrumental. Given the nature of the studied problem, related to a real-world cyber attack scenario, the focus lies on the first epoch, specifically on the optimal offset performance. Failing to outperform in this crucial phase could lead to premature exposure. Subsequent initial epochs in Fig. 4(b) illustrate trends in the initial learning speed during training, which is not the primary focus of this research.

Sum steps from unknown benchmark instances using strict RL algorithms.
Continuing with the experiment, the same methodology applied to rewards was replicated for the Steps and Errors variables, as depicted in Figs 5 and 6, respectively. The only distinction is that, when determining the reference value for generalization, the mode was derived from the minimum values of Steps and Errors. Unlike the approach for maximizing Rewards, optimizing efficiency for these variables involves minimizing them.
In Fig. 5(a), only TRPO and PPO showcased generalization, employing a reference value of 20 steps to identify the generalization point. TRPO reached values equal to or below 20 by epoch 119, while PPO accomplished this by epoch 168. This observation suggests that TRPO demonstrates superior speed performance.
Similarly, Fig. 5(b) reflected the reward outcomes. DQN displayed the best offset performance, achieving the objective in 685 steps during the initial epoch, as opposed to the other algorithms which converged at 1199 steps.
With regard to the Errors variable, the outcomes followed a pattern similar to previous simulations. In Fig. 6(a), both PPO and TRPO demonstrated generalization, with 2 errors identified as the threshold for generalization. Moreover, TRPO exhibited superior speed by achieving or maintaining below 2 errors by epoch 138, in comparison to PPO which reached this threshold by epoch 166. Despite the consistent trend in the results, the offset illustrated in Fig. 6(b) demonstrates that for Errors in the initial epoch, DQN’s performance lagged behind the other algorithms, committing 241 errors compared to 214 by the remaining algorithms.

Sum errors from unknown benchmark instances using strict RL algorithms.
Based on the results of this experiment, the TRPO algorithm emerges as the clear frontrunner in terms of overall performance. The learning curve analysis demonstrates TRPO’s rapid convergence towards optimal rewards, surpassing the reference threshold at epoch 110. Additionally, TRPO showcases exceptional efficiency in minimizing both Steps and Errors, achieving or maintaining values below the specified thresholds earlier than its counterparts. This superior speed performance can be attributed to TRPO’s trust region approach, which effectively balances exploration and exploitation, allowing it to navigate the solution space more efficiently. While PPO also exhibits competitive results, TRPO’s consistent and rapid attainment of generalization across rewards, steps, and errors highlights its robustness and effectiveness in handling the complexities of the benchmark instances. Thus, based on the comprehensive assessment of its convergence, efficiency, and speed, TRPO stands out as the algorithm of choice for optimizing performance in this experimental context. It’s important to note that the offset, in this decision-making process, carries less weight, given that during the initial phase when the algorithms are used strictly, they lack knowledge of the scenario, leading to stochastic performance during the exploration phase.
The experiment continues with the selection of TRPO as the algorithm for the second phase, which involves its utilization alongside KTTs on known derived instances. Moving on to the third phase, TRPO, in conjunction with KTTs, is simulated against the unknown benchmark instances from both scenarios.
Figures 7, 8, and 9 depict the learning curves and initial epochs for each of the algorithms, analyzing Rewards, Steps, and Errors respectively. Furthermore, in Figs 7 and 8, a gray dashed line represents the sum of the means and standard deviations for each benchmark scenario. This representation serves as a reference channel, illustrating the values of these variables in simulations involving random agents, as outlined in Table 3.

Sum rewards from unknown benchmark instances using KTTs with TRPO.

Sum steps from unknown benchmark instances using KTTs with TRPO.
Examining Rewards, Steps, and Errors in Figs 7(a), 8(a), and 9(a), it becomes evident that none of the KTTs converged to optimal values. Notably, transfer learning’s performance was the least favorable, unable to leverage the knowledge acquired during training to gain an advantage in an unknown scenario. In the case of imitation learning, the absence of convergence can be attributed to the inherent nature of imitation techniques, which aim to replicate learned behavior rather than seek optimal values in new explorations. Additionally, Figs 7(a) and 8(a) reveal that for imitation learning, both Rewards and Steps align with the limits set by the randomness of random agents. This raises questions about whether any substantial learning occurred during the training process.
According to Ashvin Nair et al. [25], when addressing the challenges in offline RL with online fine-tuning, it is noted that fine-tuning frequently encounters an initial performance dip due to the shift in data distribution. One contributing factor to this observed performance in the experiments may stem from the use of the flat actions feature in the simulator. The flat actions feature means our action space is composed of N discrete actions, where N is based on the number of hosts in the network and the number of exploits and scans available. The utilization of flat actions is a requirement for executing NASim through the OpenAI Gym API with the stable-baseline3 algorithms. One potential mitigation strategy could involve working without flat actions, where each action is a vector with each element specifying a parameter of the action or adopting strategies for online fine-tuning with offline learning, such as AWAC [25] and Cal-QL [26]. However, both alternatives were not considered in this research due to the complexity of modifications required in all stable-baseline algorithms, the OpenAI Gym API, and the simulator to support them.
However, upon analyzing the offset of rewards and steps as presented in Figs 7(b) and 8(b), a slight advantage emerges for the imitation learning algorithms compared to the strict utilization of TRPO. Notably, in the context of the Errors offset, demonstrated in Fig. 9(b), DAgger outperformed TRPO, committing 194 errors compared to TRPO’s 214 errors.

Sum errors from unknown benchmark instances using KTTs with TRPO.
These observations underscore the complexities inherent in integrating KTTs into RL algorithms. Although transfer and imitation learning might not consistently lead to optimal convergence, they demonstrate advantages in terms of offset performance.
The next phase, covering the evaluation of TRPO with KTTs in unknown instances and the comprehensive analysis of the results, will provide more information on the effectiveness of these techniques in relation to stealth.
In assessing performance from the stealth perspective, which primarily centers around the variables of Steps and, more importantly, Errors committed, Fig. 10 serves to consolidate the sum of first epoch values observed in the experiment for both strictly utilized algorithms and those involving KTTs in conjunction with TRPO.

Evaluation of stealth in the first epoch.
Within Fig. 10, excelling above all others, the DAgger with TRPO demonstrated the most favorable metrics in terms of Steps, Errors, and even Rewards, despite the latter being of negative value. While surpassing the second-place contender, DQN, by a slight margin, which was the optimal among the strict algorithms in the offset aspect, the significance of this advantage cannot be understated. In the context of stealth, even the slightest of margins holds considerable importance.
In this analysis, one of the primary issues arises from the highly parametrizable nature of the NASim action space. TL and IL techniques typically rely on knowledge gained from a source domain to improve performance in a target domain. However, in NASim, the context, including hosts, networks, and vulnerabilities, can vary significantly between training and testing scenarios. Therefore, when TL or IL models are trained on scenarios with specific hosts, networks, or vulnerabilities, they tend to learn context-specific strategies. These strategies may not generalize well to scenarios with different contexts. In summary, what is learned during training is tightly bound to the specific characteristics of the training environment.
As a consequence of the escalating harm inflicted upon organizations by global cyber threats, investigations into cyber attacks have gained substantial relevance. These inquiries aim to comprehend these threats and devise solutions to mitigate their detrimental impacts.
In this context, this study delves into the realm of autonomous cyber offensives from the perspective of the attacker, with a specific focus on the issue of stealth. The investigation reveals that prevailing research approaches autonomous cyber offensives as either fully observable MDP or POMDP. While RL is universally adopted to automate MDP in this domain, the research landscape predominantly concentrates on the broader training process, encompassing steps and reward mechanisms leading up to the point of generalization.
While the core motivation of this research revolves around mitigating errors during the training phase, the distinctiveness of the approach lies in the exploration of experiments with KTTs, specifically in unknown scenarios. The novelty is not merely in the integration of KTTs but, crucially, in assessing their performance in unfamiliar environments.
The primary objective is to scrutinize the effectiveness of both the strict application of RL algorithms and their integration with KTTs. The innovation stems from the methodology, which enables the conduct of experiments with KTTs in scenarios where their training data is known, and the target is unknown. This unique aspect allows for an assessment of whether the utilization of KTTs, as an alternative to the strict use of RL, proves effective in unforeseen circumstances.
Therefore, this research addresses the issue in executing stealthy cyber attacks orchestrated by intelligent agents, guided by reinforcement learning, within unknown scenarios. The inherent risk lies in the possibility of exposing an ongoing attack due to errors made by RL-guided agents, particularly in the initial stages when the target environment is uncharted. Consequently, the research posits the problem as a POMDP, as this perspective mirrors the real-world attacker scenario, where engagements are executed against unfamiliar settings.
Faced with this problem, this work evaluated the use of KTT as a stealthy way to reduce the probability of early exposure by smoothing the number of errors in training RL-based cyber offensives against unknown scenarios. Therefore, the efficiency evaluation includes error rates and, mainly, focus on the initial attack phase, considering attacks conducted by strict RL algorithms as well as in conjunction with KTTs. However, this research considered that, before applying these KTT in an unknown scenario attack, they were first trained in known derived instances from scenarios to evaluate their performance effectiveness.
In order to compare RL-based cyber attacks with those incorporating KTTs, this study has devised a laboratory10 platform and methodology capable of facilitating simulations involving a range of cyber offensive algorithms across instances of unknown scenarios. Given the inherent challenge associated with applying TL and IL to target unfamiliar environments, the laboratory11 platform and method assume pivotal roles in addressing this limitation. The proposed methodology encompasses preliminary training in well-defined derived instances from established scenarios, subsequently followed by the integration of KTTs in an unknown instance to assess agent performance within the realm of RL. The significance of this evaluation is rooted in its capacity to ascertain whether the performance enhancements offered by KTTs extend to unexplored scenarios.
The experimental results revealed a noteworthy advantage demonstrated by DAgger with TRPO in the realm of stealth, underscoring the potential merits of employing IL strategies to achieve effective performance. Nevertheless, the advantage, while sometimes resulting in improved outcomes, may not be overly significant. Within the context of stealth, even the slightest disparities in terms of steps and errors carry considerable significance. It was observed during simulations that IL-guided agents tend to exhibit behavior closely aligned with the limits of benchmark steps and rewards (random mode) within each scenario. Additionally, these agents did not demonstrate enhancements in terms of speed and generalization. On a different note, despite not being considered in the reference studies of this research, the TRPO algorithm displayed robustness and promise, outperforming others in terms of speed and generalization performance.
TL and IL techniques traditionally harness knowledge from a source domain to enhance performance in a target domain. Nevertheless, in the context of NASim, the environment’s intricacies, including hosts, networks, and vulnerabilities, can exhibit substantial variations between training and testing scenarios. Consequently, as observed in the experiments, when TL or IL models undergo training within scenarios featuring specific hosts, network configurations, or vulnerability profiles, they tend to acquire strategies closely tied to those particular contexts. Regrettably, these context-specific strategies often struggle to generalize effectively to scenarios characterized by distinct contexts.
In conclusion, this study introduces a promising methodology for assessing and training offensive agents in unknown scenarios based on RL. Future endeavors could encompass the incorporation of mean values from experiments conducted across multiple derived instances, exploration of alternative simulators such as CybORG, integration with methods to improve KTTs performance by offline RL training and online fine-tuning (e.g. CQL [16], IQL [15], TD3+BC [8], AWAC [25] and Cal-QL [26]) and the development of KTTs free of context with the potential to substantially enhance stealth performance in terms of offset, speed, and generalization. This prospect is particularly relevant in the context of the initial stages of an attack, where achieving optimal performance remains a critical consideration.
