Abstract
This work explores the use of generative adversarial networks (GANs) to tackle cyber-security challenges, including threat identification, anomaly detection, and mitigation strategies, particularly in complex systems and critical infrastructures like industrial control systems for energy production and distribution. GANs address a key obstacle in machine learning (ML)-based systems: the scarcity of quality data for training models capable of fully leveraging ML and deep learning in cyber-security applications, such as intrusion detection systems and malicious behaviour detectors. The study highlights GANs’ potential to enhance data augmentation by generating realistic synthetic network traffic flows. These flows simulate common cyber-attacks targeting operational technologies (OTs) and information technologies (ITs). A primary contribution of this research is the creation of a large, high-quality dataset of OT and IT network traffic, designed to improve the robustness of ML models used in cyber-defense systems. Additionally, the work includes statistical analyses to evaluate the reliability of GAN-based data augmentation, laying the foundation for further research. This approach promises significant advancements in developing resilient ML models capable of addressing evolving cyber-security threats.
Keywords
Introduction and background
Recent advances and upgrades in industrial information technologies (ITs) have brought many benefits but also a lot of new cyber-security threats to industries and organizations.
Malicious actors have become exceptionally skilled at infiltrating and compromising their victims’ targets and critical infrastructures (CIs), including industrial control systems (ICSs), and at stealing valuable production resources and data.
While in the past ICS systems and their communication networks were generally thought to be quite safe since they were typically ‘obscured’ and isolated from the external world of the corporate network or the internet, the current scenario is completely different. Such systems are now fully connected to the external environment, thus becoming more exposed and vulnerable.
Therefore, a very different approach to facing security issues is required. Such an approach must be focused on identifying new cyber threats, vulnerabilities, and points of failure of ICSs.
Recent advances in artificial intelligence (AI)-based cybersecurity tools and solutions can offer valid support for the security assessment of CIs and ICs such as energy and power generation and distribution systems.
Furthermore, several AI-based approaches and methodologies can be currently deployed to undertake and assess cyber security risks affecting industrial systems, as well as current issues and trends related to the design of security countermeasures for energy and power management and distribution systems. More precisely, this work focuses on specific case studies concerning the application of an adversarial machine learning (ML) approach, exploiting generative adversarial networks (GANs)1,2 to the industrial systems domain.
ICSs can be considered a standardized approach to organizing diverse and extensively connected sets of technologies automating and controlling significant systems that have become essential to our daily lives. These latter include railway systems switches, power moving through the electrical grids, oil and water flowing through pipelines, and systems controlling pharmaceutical and food manufacturing, just to name a few.
ICSs are part of the wider operations technology (OT) environment in industrial enterprises. They consist of two main subsystems:
ICSs were originally designed to achieve some goals as increasing performance, reliability, and safety and reducing manual efforts.
The easiest way of achieving ICS's cybersecurity was represented by creating an ‘air gap’ and implementing ‘physical isolation’. An approach is better known as ‘security by obscurity’. 3
Moreover, computational and AI techniques significantly enriched systems in several kinds of application domains. Among them are power electronics and power engineering. 4
Modern smart grids (SGs) and renewable energy systems (RESs) provide tangible examples of industrial systems that have been provided with tools for design, simulation, control, fault diagnostics, as well as anomaly detection and estimation. The contribution of AI to such kinds of systems has been discussed in a vast number of research works and recent scientific literature, with a focus placed on the potential of the combination of AI and energy control and management systems.
Bose 5 discusses some examples of AI techniques, such as expert systems, fuzzy logic, and ML (e.g. neural networks), applied to SG and RESs, evidencing the contribution of computational and cognitive services to enhance a multitude of tasks, such as automated design, simulation, control and health monitoring of modern energy generation and distribution systems.
Some of the most effective tools in fighting cyber threats are represented by the emerging techniques in AI. Combining threats with real-time data monitoring along with orchestration and automated response, AI analytics solutions are proving to be the most desirable choice when compared to legacy systems and human-intervention-driven, from the perspective of response times.
The paper is organized as follows: Section 1 provides a general introduction and the background of the work; Section 2 describes the research aims and the scope of the work. Section 3 presents related work. Section 4 describes the adopted methodology and the experimental design of the GAN. Section 5 discusses the evaluation metrics adopted to assess the performance of the experimental activities. Section 6 concludes the work and suggests some future directions for further investigation.
The aim of this research consists of assessing the potential offered by the adoption of generative methods for automatically producing data useful for analysing and facing cyber-security-related issues.
More precisely, the adoption of GANs was investigated, in order to analyse and estimate if such a kind of solution can be effective in reaching the proposed goal of supporting meaningful and multivariate data augmentation, to accomplish the requirements of complex domains of network communications, for both traditional networks, the Internet of Things infrastructures, and ICSs, including, especially, energy production, control, and distribution infrastructures.
Related work
ML represents a branch of AI based on statistical learning mechanisms. As such, ML models are greedy for data of several and adequate quantities, variety and quality, to be accurate in accomplishing a desired task, such as classification, prediction, and recognition. Data generation models are hungry for data too, since they are typically based on unsupervised learning mechanisms that are even more sensitive to the amount of available data.
Generative models need to be provided with meaningful data, since they must learn to mimic genuine and naturally generated data describing effective real systems and to generate quite-similar artificial data.
Several data collections from various application domains have been made available in the recent past. Unfortunately, for the specific domain of cyber security defence of ICSs, such as energy infrastructures, available datasets are very few and most of them often suffer from class imbalance problems.
Whether it is true that learning from imbalanced datasets is a quite common but challenging task for standard classification algorithms, the research described in this work focuses on the generation of artificial data to enhance ML-based defence systems against cyber-attacks. 6
Therefore, a brief overview of other available oversampling methods is reported here below, to introduce and support the motivations in favour of a GAN-based data augmentation strategy. A more detailed review of the most well-assessed data generation methods can be found in the literature.7–9
Oversampling methods
Oversampling methods are typically adopted to solve data between-class imbalance problems, where synthetic examples are generated to enrich the minority classes and to add them to the training set of an ML model. The ‘Random Oversampling’10,11 is a simple method that generates data by copying random minority class samples. Its main drawback is the increased possibility of overfitting since a mere replication of training examples is performed.
Another method, namely ‘SMOTE’, instead, generates synthetic data along the line segment that joins minority class samples, thus avoiding the oversampling problem. This method suffers the disadvantage of producing several noisy samples, since the majority and minority cluster boundaries are often not clearly identifiable. 12 Several improvements of the SMOTE method have been proposed to avoid this problem, as described in the literature.13,14
Another type of problem faced by adopting synthetic data augmentation is the within-class imbalance15,16 that occurs when sparse or dense sub-clusters of minority or majority instances exist. Recently, clustering-based oversampling methods to mitigate both the between-class and within-class imbalance problems have recently been proposed. Most of these methods are based on initially partitioning the input space and then applying sampling methods to adjust the size of the various clusters, while other variants such as the Cluster-SMOTE 17 apply the k-means algorithm to generate artificial data by applying SMOTE in the clusters.
As evidenced from the literature overview, most of the existing oversampling methods represent variations of the SMOTE algorithm methods. These approaches are based on local information, rather than the overall minority class distribution.
Contrary to this kind of algorithm, the conditional version of GANs (C-GANs) 17 can be used to approximate the true data distribution and generate data for the minority class of various imbalanced datasets. The performance of a C-GAN has been compared in the very recent scientific literature against multiple standard oversampling algorithms. Results of most comparisons documented at the state-of-the-art have provided evidence of significant improvements in the quality of the generated data when C-GANs are used instead of more classic oversampling algorithms.
GANs and C-GANs
Recent scientific literature and growing industrial interest have highlighted the potential of GANs1,2 to address real-world challenges beyond experimental settings. These AI-driven systems offer a viable alternative to traditional laboratory simulations, particularly in terms of data generation.
One of the key advantages of GANs is their ability to reduce the time and effort required for experimental setup. Unlike simulation environments that necessitate hardware deployment, software configuration, and complex management systems, GANs primarily require a sample dataset as input. By learning from this data, GANs can generate synthetic yet realistic data that mimics the characteristics of the real-world phenomenon being studied.
This efficiency makes GANs a promising tool for various applications, from medical imaging to autonomous driving, where the generation of large quantities of high-quality data is crucial for training and testing ML models.
A GAN operates as a dynamic system that receives a ‘draft’ input data (i.e. an input that does not conform to the requirements, typically referred to as ‘noise’) and chases a reference output, which is the ‘desired output’. The system learns how to modify the input so that the output can maximize the similarity with the ‘desired one’ (i.e. the deviation between the output and the desired output can be contained within a predetermined margin of error).
The learning process is based on the progressive production of outputs that are evaluated by an oracle, which decrees their adherence or non-adherence to the pursued reference. When the oracle's evaluation is such that it meets a certain threshold of validity, typically measured by the adherence or verisimilitude of the distributions of the generated outputs with those of the references, the trained generative network can be extracted from the training cycle and employed for the generation of synthetic instances adherent to the model characteristics of real data samples.
GANs definition
In this section, we provide a summary of the GAN and C-GAN frameworks following closely the notation proposed in 1, 2, and 17.
A GAN is based on a competitive or adversarial mechanism, in which two neural networks, namely a generator G and a discriminator D are trying to outsmart each other. The goal of the generator is to cheat the discriminator, while the discriminator has the objective of distinguishing the artificial instances produced by the generator apart from the genuine instances belonging to the original dataset. If the discriminator can identify easily the artificial or synthetic instances, then the generator is producing low-quality data. A GAN setup process consists of the training process of the generator where the discriminator, while also improving, provides feedback about the quality of the generated instances, forcing the generator to increase its performance at each step.
Formally, the generative network is a model G, defined as G: Z → X, where Z is the noise space of arbitrary dimension dZ that corresponds to a hyperparameter and X is the data space representing the reference data distribution. The discriminative model, defined as D: X → [0, 1], is typically a binary classifier that estimates the probability that a sample comes from the data distribution rather than G. These two models, which can be implemented with several kinds of deep neural networks, compete in a two-player min–max game with value function:
The training procedure consists of alternating between k optimizing steps for D and one optimizing step for G by applying stochastic gradient descent. Therefore, during training, D is optimized to correctly classify training data and samples generated from G, assigning 1 and 0, respectively. On the other hand, the generator is optimized to confuse the discriminator by assigning the label 1 to samples generated from G. The unique solution of this adversarial game corresponds to G recovering the data distribution and D equal to ½ for any input (equation (1)).
The C-GAN is an extension of the above GAN framework. An additional space Y is introduced, which represents the external information coming from the training data. The C-GAN framework modifies the generative model G, to include the additional space Y, as follows:
In this section, we describe the methodology adopted to retrieve and identify a set of samples to use as the input reference for the training process of a GAN. The GAN model obtained as the result of the training process is then used to generate synthetic traffic datasets that can be used to simulate ad hoc attacks to OT and IT protocols-based infrastructures, along with a subset of Linux-based systems.
Reference data retrieval
The process of retrieving effective reference data to generate synthetic data related to network traffic for OT and IT protocols was divided into three steps:
Retrieval/generation of network traffic for OT protocols. Retrieval/generation of network traffic for IT protocols. Retrieval/generation of network activity traces associated with attack patterns, such as lateral movements, elevation of user privileges, etc. Mastering the complexity of the problem, since the corresponding operational scenarios have characteristics that partially intersect, while other characteristics are specific and peculiar to each protocol. Finding and using publicly available, reliable datasets (open source/paid source) that are available for consultation and use, in compliance with the respective terms of use and copyrights.
The process of finding useful data for GAN training has been divided into the three phases listed above for the following reasons:
While reliable datasets to reproduce attacks against IT protocols are available, it is not possible to find open-source datasets that can be used to reproduce attack patterns against OT protocols.
To overcome this limitation for OT protocols, a set of software services has been designed to reproduce a suitable environment to simulate network traffic samples. Both normal and attack scenarios were simulated, with the aim of capturing a variety of communication data flows. Details concerning the scenarios are provided in the following sections.
For this reason, most of the data used for the generation of synthetic datasets related to attacks against OT protocols were simulated by software. The process was conducted to generate a sample of data used to train the GAN and produce the ‘desired output (datasets containing attack patterns)’.
Finally, it should be noted that different instances of GAN were trained to generate traffic to use in attack scenarios involving different operational protocols covered by the study. The datasets generated by the GAN include both malicious traffic and ‘normal’ traffic (also referred to as ‘background’ traffic).
In this section are described the case studies considered for the generation of synthetic data samples. More precisely, artificial data generated were divided into three sub-categories related to the OT, IT and Linux-based traffic characteristics. We also provide some samples from the whole research study performed.
OT protocols attack traffic
For the case study of OT protocols, four protocols were considered to produce synthetic traffic.
More precisely, the following ones have been considered:
The set-up operations needed to prepare the data generation process performed by the C-GAN, for the MODBUS protocol example, are detailed in the following sub-section.
MODBUS over TCP
The traffic generated according to this application protocol consists of message pairs of the ‘Query/Response’ type, exchanged between a master station (client) and a slave station (server).
The total volume of traffic generated includes both normal and attack traffic exchanged without transport layer security (TLS) (communication without authentication by the master/slave and without data encryption) (port 502).
So, two datasets were generated for this protocol, corresponding to:
‘Background’ traffic or benign traffic. Attacker traffic in the presence of cyber-attacks, covering different types of attacks that can be implemented at both the application protocol (MODBUS) and communication layers (transport (TCP), network (internet protocol (IP)), datalink (Address Resolution Protocol) and physical). The point types of attacks are detailed below. The operational assumptions of the scenario related to this type of attack are described below:
Specifically, in the illustrated scenario, the attacker generates 16 IP addresses belonging to the server station subnet. IP-spoofing and suspicious activity can be traced by observing the packets at layer 2, in which it is verified that the MAC Address corresponding to the 16 different IP addresses is identical.
The 16 client nodes open normal TCP/MODBUS connections with the server but do not transmit the flow's actual requests for substantial operations, such as register reads/writes. The clients, having completed the handshake phase, send numerous consecutive [PUSH-ACK] packets to the server, causing it to terminate the connections by sending RESET [RST] messages.
A more specific attack aimed at MODBUS is the ‘control-logic injection’, the most typical scenario of which is the attempt to force the download of a file containing executable code (control logic) to a device (typically, a programmable logic controller (PLC)) to be able to manipulate it and the logging of the MODBUS traffic exchanged during the attempted download of the file to a device. This type of attack is documented by some experiments conducted on Schneider Modicon M221 PLC-type devices.
21
Typically, ‘control logic injection’ attacks aim to transfer blocks of malicious code to a PLC. This type of attack can be prevented by blocking network packets whose payload contains a block of code. The PLC protocol header contains fields that indicate the type of payload and are used to intercept packets that contain ‘code-blocks’ by appropriate signatures. For some intrusion detection systems (IDSs) such as Snort, for example, signatures are available to monitor ICS traffic, including code-block transfer. When code-block-related signatures are intercepted in network packets, exceptions are raised on the MODBUS. The scenario considered is related to injection/transfer attempts via a network of malicious control logic on Schneider Electric Modicon PLC-type devices. The injection attempt is intercepted by response messages that the server sends with the function code valued at the value ‘90’, which is indicative of an alert situation raised by attempts to upload/download control logic to a device.
The attack scenario presented can be attributed to the DoS family of attacks, and the attack technique adopted is that of ‘SYN-ACK Flooding’. In this type of attack, one or more clients send numerous TCP packets with high SYN flags to establish many connections with the server and try to saturate its resources and/or slow it down in its main services, as it is continually flooded with messages opening new, totally fictitious connections. In this scenario, a single client, performs a ‘SYN-FLOODING’ attack, without ever finalizing any of these connections to exchange or rather to request data/services to the MODBUS server.
The type of traffic generated to attack MODBUS over TCP protocol vulnerabilities (without TLS) is as follows:
The traffic generated for this application protocol consists of message pairs exchanged between a client (requestor) and a server. The phase of requesting action execution for data retrieval (MMS-PDU) by the client is preceded by a phase of establishing a connection between client and server.
The client initiates communication by sending an MMS-type message ‘initiate-RequestPDU’ to the server (Figure 9) to establish a connection. The server can confirm that the connection has been properly established by responding to the client with an MMS-type message ‘initiate-ResponsePDU’.
The client forwards the true data request to the server by sending an MMS-type message ‘confirmedRequestPDU’, to which the server responds with either a ‘confirmed-ResponsePDU’ message containing the requested data or an error message of type ‘confirmedErrorPDU’.
The dataset consists of traffic traces (‘*.pcaps’) related to the underlying TCP. The traffic exchanged over TCP transport channel (insecure) is identified on port number 102.
Four datasets were generated for this case study, corresponding to the following scenarios:
‘Background’ traffic, that is, ‘normal’ or benign traffic. ‘DoS — Request Flooding’ type of attacking traffic. ‘Resource Scan’ type attacking traffic. ‘False Data Injection’ type attacking traffic.
The traffic generated for this application protocol is composed of messages exchanged in accordance with a PUBLISH/SUBSCRIBE paradigm: both the publisher (the provider of published data and services) and the subscriber (the requester of data and services) behave as clients of a third entity called BROKER; the latter acts as a server for both types of clients and is in charge of handling requests both for publication of topics issued by publishers and for subscription to one or more topics, issued by subscribers. The generated dataset consists of traffic traces (‘*.pcaps’) distinguished by TCP. The traffic exchanged over TCP transport channel is identified on port number 1883.
Four datasets were generated for the case study of the MQTT over TCP protocol, corresponding to the following scenarios:
‘Background’ traffic, that is, normal communication traffic. Attacking traffic of 3 types:
Brute Force Attack Distributed DoS attack of the Publish Message Flooding type. Attack from DoS of the Malformed Packets Injection type.
Over the past few years, several datasets collecting IT network protocol traffic have been published. One of the most adopted in the last research studies concerning the accomplishment of ML tasks, including the training of network traffic classifiers and IDSs, is the CIC-IDS2017 dataset, published by the Canadian Institute of Cybersecurity. 22 A fragment of this dataset was selected and adopted for the experiments described in this work, as follows.
Features engineering
The CICIDS dataset is a creation of the Canadian Institute of Cybersecurity and the University of New Brunswick. The network traffic collected in this dataset was generated using several IT protocols among which the HTTP, the HTTPS, the file transfer protocol, the secure shell protocol, and email protocols, in order to monitor 25 users and to infer users’ behaviour. The CICIDS-2017 dataset includes several complex features and a large amount of traffic and attributes that can be used to detect anomalies and malicious activities. For example, some complex features included in the CICIDS-2017 that are not available in other frequently used datasets, among which the NSL-KDD 23 are: (a) Subflow Forward Bytes and (b) Total Length Forward Package, which are significant to detect Infiltration and Bot attack types.
Feature selection (FS) is the first data pre-processing step to accomplish before implementing any data analytics task. Data consisting of several features affect the computational complexity, thus significantly increasing the amount of resource usage and time consumption for data processing.
The CIC-IDS2017 dataset includes a large amount of network traffic samples and features that required a reduction before being used.
So, the first step of the process consisted of producing a normalized version of the initial CIC-IDS2017 dataset: the incomplete instances of the original dataset were deleted, and a further feature reduction was applied before proceeding to conditional generation by the means of training the C-GAN.
There are several feature reduction techniques in the scientific literature.24–26 In the experiment described in this work, the information gain was used as the FS technique, since it represents the most used technique in IDS research.
Features were ranked and grouped according to the minimum weight values to select relevant and significant features, and then the JRip classifier algorithm was implemented to extract 18 features from the initial set of 83 features provided in the CIC-IDS-2017 dataset.

Overall system architecture for generative adversarial networks (GANs) federation.
Moreover, a fragment of the initial CIC-IDS 2017 dataset, including the more salient 18 features and 90,000 samples, fragmented into two balanced sub-parts of about 45,000 samples both for normal and attacking traffic, was prepared to train the C-GAN used for synthetic IT traffic generation, as summarized in the following.
Source data samples: ‘CIC- IDS2017’. Number of total instances of attacker traffic: 44,193. Number of total instances of normal traffic: 44,577. Total number of traffic classes: 15where:
Number of attack categories: 14.
Number of background traffic: 1. Number of considered features to describe the network traffic: 19.where Eighteen (18) features, extracted by applying the Jrip algorithm. One (1) feature classification label [1,15].
Feed-forward network. Number of hidden layers: 2. Layers dimension: [256, 256, 256]. Activation function: Leaky rectified linear unit (ReLU). Output function: Sigmoid.
Feed-forward network. Number of hidden layers: 2. Layers dimension: [256, 256, 256]. Activation function: ReLU activation function. Output class selection function: Hyperbolic tangent (tanh) and ‘softmax’ function for multiple class selection.
Number of training epochs: 300. Input block size (batch size): 100. Optimization function: ‘Adam optimizer’.
As introduced in the previous section, multiple instances of generating networks have been trained to conduct cyber-attacks against different OT and IT protocols. Each instance was trained to respect the peculiarities of background and attacker traffic by working on different features and implementing output requirements for each network.
Each GAN was designed to solve a specific task, such as generating synthetic traffic for a specific protocol and/or attack scenario.
The implemented federation of GANs is detailed in Figure 1. The image reports a hypothetical architecture of a system embedding both the logic for selecting the protocols and types of traffic to be generated and the actual data generation systems (the GANs).
Statistical evaluation
To carry out a preliminary evaluation of the performance of the synthetic traffic generation process using C-GAN, several statistical tests have been selected, which are useful for verifying the ‘verisimilitude’ of the artificially generated samples with those used as the training set for training GAN network instances.
To proceed to the choice of a suitable metric for evaluating the performance of the data generation process, which can be measured by an estimate of the quality of the synthetic data generated, when compared with the ‘natural’ ones, it is necessary to carry out a preliminary analysis of the traffic, aimed at identifying the types of values assumed by the features considered (quantitative or qualitative), the ranges of values assumed by them, the exhibition of statistical properties (e.g. mean value and variance) that allow their description with specific distribution functions.
The preliminary analysis performed, for both types of protocol families, IT and OT, revealed the following:
Both datasets have both quantitative and qualitative (categorical) features. All quantitative features, for both protocol families, take on values that respond to different and entirely arbitrary distributions, in the sense that they do not exhibit values of the descriptive statistical indicators that would make them reasonably attributable to symmetric or uniform distributions, for example.
The results of the preliminary screening of the data, which lumps together both the ‘natural’ data used as a training set and those generated synthetically via GAN, suggest that the choice of metric for comparing sets of features with different distributions of values should fall to that of non-parametric statistical-type tests.
Therefore, based on these considerations, the Kolmogorov–Smirnov non-parametric statistical type test (KS test) 27 was selected for evaluating the likelihood between the synthetically generated traffic samples and the ‘natural’ (training data) samples.
This type of test is suitable for assessing the point closeness (distance) between the distributions of individual features (features) and ultimately allows for the quantitative assessment of a ‘cumulative’ score of the similarity of the two sets of overall distributions.
Evaluation metrics
The KS test is performed by evaluating the distance between pairs of homologous features belonging to the two reference datasets. More precisely, the dataset of non-synthetic traffic flows, used as the training set of the GAN, is considered as the set of features that make up the baseline, that is, the reference distribution against which, for each individual feature extracted from the synthetically generated sample dataset, the probability of belonging to or having been extracted to/from the distribution of the homologous feature belonging to the baseline is evaluated.
The KS test makes it possible to determine how different two arbitrary distributions of A and B values are, yielding as a result a numerical type of synthesis indicator, called the KS score, on a normalized interval on the scale [0,1], attesting to the value of this difference:
The KS score = 1 condition indicates that the distance between the two distributions is maximum. The condition KS score = 0, indicates that the two distributions are identical (minimum distance). The condition KS complement score = 1 indicates that the distance between the two distributions is minimum, that is, their similarity is maximum. The condition KS complement score = 0, indicates that the two distributions are completely different (maximum distance).
Typically, in lieu of the distance measure provided by the KS test, its complement (KS complement score) is considered to indicate the ‘closeness’, that is, the degree of verisimilitude between the distributions. The value of ‘closeness’, that is, similarity between two distributions is also defined on a normalized scale on the interval [0,1], but the semantics associated with the values assumed by this indicator is reversed from that of its complement, that is:
Statistical evaluation of synthetic data generation for IT protocols
For the case of traffic for IT protocols, based on the pre-processing performed on the initial dataset extracted from CIC-IDS2017, the 19 features considered are not all quantitative; features such as IP addresses, MAC addresses and port numbers, in addition to the traffic category label, are qualitative.
The KS test evaluation was carried out only on the 18 descriptive flow features, since the feature identifying the traffic class intervenes to identify the sample classes to be compared.
Moreover, since the 18 features considered are of mixed type, that is, partly quantitative and partly categorical or qualitative, the latter were transformed on discrete quantitative value ranges, so as to make the process of comparison and calculation of the KS test score consistent.
The procedure for performing such a test involves running an appropriate iterative algorithm implemented in Python language 3.8 to perform 18 comparisons, one for each pair of homologous features extracted from the baseline and the distribution under test (the synthetically generated one), respectively.
The outcome of the above tests, for the data generated for the IT protocols, is summarized in the histograms in Figure 2 where, in red, the distributions of the features extracted from the synthetic samples are represented and, in blue, the distributions of the features extracted from the baseline samples.

KS complement test over IT samples.
The values taken by the features were normalized on the scale [0,1] so that they could be made comparable and representable on the same graph. From the normalization procedure of the values of individual features, it is possible to represent the smoothing curve, that is, the curve that approximates the cumulative probability distribution (l) of all features pertaining to the same population (baseline or synthetic traffic). The smoothing curves related to the two compared distributions are illustrated in Figure 3, where the almost-complete overlap of the two distributions is highlighted, indicating a high degree of ‘fitting’ of the population under test (synthetic traffic dataset) (green curve) with respect to the distribution of the baseline population (original training set) (cream-colored curve).

Overall data distributions overlapping plot.
Figure 4 shows the quantitative value obtained from the indicator obtained from running the KS test, expressed in the form of the KS complement score, that is, as a measure of similarity between the distribution of features from the baseline and the distribution of features from the synthetic dataset, which was 0.72 (72% similarity).
Like what has been discussed in the case of IT traffic, for OT protocols, the population of features descriptive of communication flows is of mixed type (both quantitative and qualitative) and their values do not exhibit statistical properties. Again, the feature values follow different and arbitrary distributions of values.
For this family of protocols, the evaluation of the KS test was narrowed down to:
Sixty-two (62) features, compared to 66 overall, for flows related to MODBUS and MQTT protocols. Sixteen (16) features, compared with 20 total, for flows related to the IEC 61850 MMS protocol, excluding from the evaluation the flow identifier and features indicative of the traffic class, which are involved in the identification of the samples to be compared.
In addition, since the useful features considered, for all three OT protocols under analysis, are of a mixed type, because they are both quantitative and qualitative, the latter have been transformed over ranges of discrete quantitative values, to make the process of comparison and calculation of the KS test score consistent.
The procedure for the execution of such a test requires that, through the execution of an appropriate iterative algorithm implemented in Python 3.8 language, all the comparisons between homologous pairs of features, extracted, respectively, from the baseline and the distribution under test (the synthetically generated one), are carried out.
The outcome of the above tests, performed by comparing the baselines and the synthetically generated data via GAN, for OT protocols, are summarized in the histograms in Figures 5 to 7, for MODBUS, IEC 61850 MMS and MQTT protocols, respectively, where, in red, the distributions of the features extracted from the synthetic samples are represented and, in blue, the distributions of the features extracted from the baseline samples.

Quantitative evaluation of the K-Test, represented as KS complement score.

MODBUS TRAFFIC Kolmogorov–Smirnov (KS) complement test.

IEC 61850 MMS TRAFFIC Kolmogorov–Smirnov (KS) complement test.

MQTT TRAFFIC KS complement test. MQTT: message queuing telemetry transport; KS: Kolmogorov–Smirnov.
The values taken by the features were normalized on the scale [0,1], so that they could be made comparable and representable on the same graph. From the normalization procedure of the values of individual features, it is possible to represent the smoothing curve, that is, the curve that approximates the cumulative probability distribution (l) of all features pertaining to the same population (baseline or synthetic traffic). The smoothing curves related to the two compared distributions are illustrated in Figures 8 to 10, where a good overlap of the pairs of distributions is shown, indicating an acceptable degree of ‘fit’ of the population under test (synthetic traffic dataset) (orange curve) compared to the distribution of the baseline population (original training set) (blue curve).

Smooth curves for both the distributions for MODBUS.

Smooth curves for both distributions for IEC 61850 manufacturing message specification (MMS).

Smooth curves for both distributions for MQTT MMS. MQTT: message queuing telemetry transport; MMS: manufacturing message specification.
In this work, we proposed a sort of pioneering study, aimed at evaluating the feasibility and effectiveness of automatic generation of synthetic but realistic network traffic flows, to train ML-based cyber-defence systems.
These synthetic data have been created by using GAN-based solutions. The work also aimed at modelling a subset of the most common cyber-attacks against OTs, ITs, and Linux-based systems.
The use of deep neural networks based on adversarial generative logic (GANs) may prove to be of paramount importance to increase cyber resilience and, specifically, to improve the robustness and resilience of AI-based defence systems for energy systems.
There are still few experiences concerning the use of GANs, and in particular, ‘anomaly’ detection activities, in the energy infrastructure sector: possible reasons for this still limited use can be found in the complexity of the application domain, in the high specificity of the type of data to be analysed and transformed into formats suitable for the employment of ML systems, such as GANs, and, finally, in the still limited availability in quantity and quality of data that can be used to characterize attack scenarios that resemble each other but exploit specific vulnerabilities of operational and communication systems and protocols.
The experimental activity conducted in this work highlighted some of the above critical issues, including the primary one, related to the limited availability of traffic samples of adequate size and characteristics to initiate the training of an automatic synthetic traffic generation network.
The activities conducted have provided evidence of the ‘physical feasibility’ and effective applicability of GAN neural networks to solve data augmentation problems, even in complex domains and scenarios such as those under-considered, as well as of the scalability of the implemented federated solution, which has allowed to expand the range of scenarios covered with an effort that can be considered acceptable when compared with the number of cases considered.
Furthermore, evaluation metrics results can be considered surprisingly encouraging to perform further tests, even if this result could be partially imputed to the accurate input sample selection to perform the experiment. The input sample used to train the GANs was carefully prepared to provide a quite balanced dataset, ranging fairly over the 14 attack categories, beyond the normal traffic. As a significant future work, more tests should be performed adopting imbalanced datasets, to provide stronger evidence of the capability of C-GANs to mitigate unbalanced input while keeping the ability to generate high-quality results yet.
Finally, the preliminary analysis of the results obtained allowed us to identify some areas for improvement and enhancement of the data augmentation strategy implemented by means of GANs, and to suggest, as a future development activity, the extension of the type of attack scenarios to be considered: for example, the expansion of the dataset for the generation of attacker traffic aimed at IT protocols, through the use of additional datasets.
Footnotes
Author contributions
Not applicable.
Funding
This work is original and has been supported by a joint collaboration between RSE S.p.A. and Cybhorus S.r.L., financed by the Research Fund for the Italian Electrical System under the Three-Year Research Plan 2022–2024 (DM MITE n. 337, 15 September 2022), in compliance with the Decree of 16 April 2018.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Correction (February 2025):
The article has been updated with funding statement since its original publication.
