Abstract
The development of intrusion detection systems (IDS) for the in-vehicle Controller Area Network (CAN) bus is one of the main efforts being taken to secure the in-vehicle network against various cyberattacks, which have the potential to cause vehicles to malfunction and result in dangerous accidents. These CAN IDS are evaluated in disparate experimental conditions that vary in terms of the workload used, the features used, the metrics reported, etc., which makes direct comparison difficult. Therefore, there have been several benchmarking frameworks and comparative studies designed to evaluate CAN IDS in similar experimental conditions to understand their relative performance and facilitate the selection of the best CAN IDS for implementation in automotive networks. This work provides a comprehensive survey of CAN IDS benchmarking frameworks and comparative studies in the current literature. A CAN IDS evaluation design space is also proposed in this work, which draws from the wider CAN IDS literature. This is not only expected to serve as a guide for designing CAN IDS evaluation experiments but is also used for categorising current benchmarking efforts. The surveyed works have been discussed on the basis of the five aspects in the design space – namely, IDS type, attack model, evaluation type, workload generation, and evaluation metrics – and recommendations for future work have been identified.
Introduction
The adoption of drive-by-wire technology means that today’s vehicles are equipped with as many as 150 Electronic Control Units (ECUs) [15], which control the different subsystems of a vehicle and enable various functionalities related to performance, safety, and comfort. The operations of a vehicle rely on the communication of these ECUs among themselves, which occurs on the internal vehicular network connecting these ECUs together. One such protocol for internal vehicular networks is the Controller Area Network (CAN), which is used in nearly all modern vehicles due to its simple, inexpensive, and reliable implementation. However, the CAN bus also lacks security features, namely encryption and authentication, which makes it vulnerable to a range of attacks that can be conducted through any of the various communication interfaces a modern vehicle is equipped with, such as Bluetooth, Wi-Fi, and cellular.
Therefore, the need to secure the CAN bus has resulted in the development of CAN intrusion detection systems (IDS), particularly because a CAN IDS can be implemented without affecting the real-time performance of a CAN bus in the resource-constrained environment of an in-vehicle network [2,91]. As such, there have been numerous intrusion detection methods proposed for the CAN bus over the years [4,41,91].
These works provide results of evaluation experiments to demonstrate the efficacy of their proposed methods. But across the CAN IDS literature, these evaluation methods vary in terms of the workload used, the features used, the parameters chosen, and the metrics reported. Since the evaluation of these CAN IDS is performed with different experimental setups and in different contexts, it is not known how they fare in comparison to each other. Furthermore, the evaluation experiments reported in these works may not be replicable because the implementations are not readily available or documentation is not sufficiently comprehensive to aid reproduction [2,39]. This has resulted in efforts to evaluate CAN intrusion detection methods on an equal footing using benchmarking frameworks and in comparative studies. Evaluating CAN intrusion detection methods in similar settings facilitates the understanding of the relative performance of the CAN IDS under test and, ultimately, the selection of the best CAN IDS for implementation.
Given the various ways in which intrusion detection methods can be evaluated [50], this work proposes an evaluation design space for the CAN IDS context that includes IDS types, attacks tested, evaluation type, workload source, and evaluation metrics. This design space enumerates the various evaluation methods found in the CAN IDS literature and is aimed at serving as a guide for planning future CAN IDS evaluation experiments. A survey of benchmarking and comparison studies of CAN IDS has also been conducted and the works discussed in terms of the proposed design space to understand current efforts at benchmarking and identify avenues for future work. The contributions of this paper can thus be summarised as follows:
Outlining a CAN IDS evaluation design space to aid the design of evaluation and benchmarking experiments.
Providing a comprehensive survey of benchmarking frameworks and comparative studies of CAN IDS in the current literature, as well as categorising and discussing the surveyed works according to the proposed design space.
Discussing trends in current benchmarking and comparison efforts and providing recommendations for future work.
Contrary to prior survey works on CAN intrusion detection, this paper does not focus on surveying CAN IDS of different categories but rather on cataloguing the various methods used for evaluating and benchmarking CAN IDS and organising the information into an evaluation design space. We also survey benchmarking studies of CAN IDS, which have only been done for conventional IDS for computer networks. Table 1 outlines how this work compares with related literature on CAN IDS and conventional IDS.
Comparison of present study with related works
Comparison of present study with related works
The rest of this paper has been organised as follows. Section 2 provides background information on the CAN bus and relevant threats. Section 3 provides an overview of related works in the literature, including the state of the art in CAN IDS and the benchmarking and evaluation of traditional computer networks. The proposed CAN IDS evaluation design space is described in Section 4. Section 5 presents the survey of benchmark frameworks and comparative studies of CAN IDS, while Section 6 discusses the surveyed works in terms of the five parts of the design space. Section 7 discusses opportunities for future work, while Section 8 concludes this paper.
Controller area network
CAN is a serial multi-master communication protocol and is one of the most commonly used for internal vehicular networks. It particularly finds use for subsystems such as powertrain and chassis, which are integral to the operation of a vehicle and include functionality such as transmission, braking, and steering [4].
The CAN protocol spans the physical and data link layers of the Open Systems Interconnection (OSI) model. It is a message-based protocol whereby ECUs (i.e., nodes) communicate information related to the current state of the vehicle via message broadcasts that are received by all the other nodes on the network. A CAN message mainly consists of an arbitration identifier (AID) and a message payload of up to 8 bytes, along with other fields like the data length code (DLC) and cyclic redundancy check (CRC). The AID is used in the bit-wise arbitration process in the event of collision, whereby AIDs with lower values have higher priority. CAN also provides error checking and error confinement mechanisms [22].
CAN attack model
The CAN protocol was designed at a time when the internal vehicular network operated in isolation from external networks. Therefore, the protocol does not provide for security in its design and lacks encryption and authentication. Aspects of the CAN protocol, such as the arbitration mechanism, can also be exploited. These factors, along with the fact that modern vehicles are equipped with various interfaces for external communication, make the CAN bus vulnerable to a range of cyber attacks that can disrupt the operations of a vehicle and cause dangerous – even fatal – accidents.
Cho and Shin [17] provide an attack model for the CAN bus that is used in several works, such as [36,90]. This attack model assumes a compromised node on the CAN bus and uses the following terminology: a weakly compromised ECU is one that has been silenced or suspended by an attacker but cannot be used by the attacker to inject messages; on the other hand, a fully compromised ECU is one that the attacker has full control over and can use to inject messages into the CAN bus. This attack model classifies CAN bus attacks into the following categories:
Injection of messages that were previously seen and captured from the CAN bus, commonly termed as a
The attacks thus described range in their sophistication and in the methods that would be effective for detecting them. Attacks such as DoS and fuzzing do not require knowledge of the meaning and semantics of CAN broadcasts, which is often confidential, proprietary information and can vary among vehicle makes and models. They can also be detected by simpler methods, such as those that analyse the timing of AIDs. On the other hand, a masquerade attack is an advanced attack requiring greater skill to mount and would also require detection methods that analyse the payload of the messages [90].
Related works
CAN intrusion detection
Apart from the development of CAN encryption and authentication methods, the development of intrusion detection systems has been one of the main approaches being taken to secure the CAN bus [4]. This is because it is possible to implement a CAN IDS in the resource-constrained environment of an internal vehicular network without impacting CAN bus traffic or requiring changes to the CAN protocol [2,41,91].
A large number of CAN IDS have been proposed in the literature that vary in terms of the techniques used, features used, deployment location, etc., with several survey works enumerating and categorising them. Young at al. [93] provide an earlier survey of eight CAN IDS that are organised into message timing, signature-based, and anomaly-based techniques. Anomaly-based techniques encompass a variety of techniques such as cyberphysical, entropy, message rate, and CAN-field anomaly detection. In their survey of in-vehicle network (IVN) CAN IDS, Wu et al. [91] categorise attacks spanning different layers and discuss the characteristics of IVN IDS and constraints associated with their design and implementation. CAN IDS are organised into (i) fingerprints-based methods that operate on the bus level, (ii) parameter monitoring-based methods on the message level, (iii) information theoretic-based methods on the data-flow level, and (iv) machine learning (ML)-based methods. They further discuss datasets, tools, and evaluation metrics used for CAN IDS evaluation and present observations from comparing results of the surveyed CAN IDS before discussing trends and opportunities for future work.
Aliwa et al. [4] not only survey CAN IDS, but also include cryptographic schemes that aim to enable CAN message authentication and encryption. The approach taken to classifying CAN IDS is similar to that in the present work, with CAN IDS divided into signature-based IDS and anomaly detection-based IDS, depending on the technique. For anomaly detection-based IDS, they distinguish between statistical, ML, and physical characteristic-based approaches. Limitations of the state-of-the-art are also identified – lack of global attack signatures for signature-based techniques, high computational resource requirements for ML-based approaches, and detection of low-volume attacks by entropy and throughput methods.
Al-Jarrah et al. [3] also review CAN IDS and organise them into flow-based, payload-based, and hybrid IDS. Furthermore, the features and feature selection methods used for detection, datasets used, targeted attack types, evaluation metrics, and benchmark models used in these works are examined and the challenges in each of these aspects are identified. This survey is extended by Nappi [56], who uses the same classification scheme as in [3] but reports more recently published CAN IDS works. The author compares the results of this survey with those in [3] and zeroes in on the challenge of reducing detection latency by providing a theoretical exploration of hardware solutions based on Field Programmable Gate Arrays (FPGA).
Rajapaksha et al. [65] focus on reviewing only AI-based CAN IDS and provide a taxonomy for classifying the same. This taxonomy distinguishes between IDS that use only AID, only payload, the entire CAN frame, and physical features. This survey includes a total of 102 works that employ supervised and unsupervised algorithms and include traditional ML, deep learning, sequence learning, and hybrid models. An overview of feature selection techniques employed in the surveyed works and publicly available CAN intrusion datasets are provided before findings and opportunities for future work are discussed.
Given the increasing numbers of not just CAN IDS works but also surveys thereof, Karopoulos et al. [41] draws on these numerous surveys and various classification schemes to propose a unified meta-taxonomy for CAN IDS that consider different characteristics. It distinguishes between network- and host-based CAN IDS depending on location; signature-based, anomaly-based, and hybrid IDS based on technique; single- and cross-layer IDS depending on the OSI layer data used; and active and passive IDS depending on how the IDS reacts to intrusions. Recent CAN IDS are thus catalogued according to this taxonomy. This work also covers datasets and simulation tools not just for CAN, but also other in-vehicle network technologies such as Ethernet, GPS, Wi-Fi, and Bluetooth.
As highlighted in Table 1, these works survey IDS for vehicular CAN and invariably discuss evaluation methods, datasets, and evaluation metrics. However, the present work differs from these by surveying CAN IDS benchmark frameworks and comparative evaluation works to understand current approaches and challenges to CAN IDS evaluation and benchmarking. Such work has been carried out in literature related to conventional computer network IDS, which are also listed in Table 1 and described in detail in the following subsection.

CAN IDS evaluation design space.
Depending on the technique used for intrusion detection, we adopt the following classification of CAN IDS (shown in Fig. 1), which is similar to that proposed in [4] and the classification by type proposed by Karopoulos et al. [41] in their metataxonomy. For each category, we cite representative works and guide the reader to more detailed surveys of CAN IDS such as [3,4,91,93]:
Apart from timing and frequency, other statistical CAN IDS use features such as AIDs and message payloads for intrusion detection. Marchetti and Stabili [48] identify that normal CAN bus traffic contains recurring sequences of AIDs, and the occurrence of any unusual transition between AIDs is indicative of an attack. This approach is similar to the graph-based approach taken in [36], whereby a graph representing valid AID transitions is built from normal CAN traffic, and the chi-square test is applied to features derived from this graph for anomaly detection. The entropy of AIDs and message payloads has also been used as a means to detect anomalies in CAN traffic, in [9,49,55]. In order to detect attacks involving manipulation of data fields, an IDS based on calculating Hamming distances between consecutive payloads of the same AID has also been proposed [76], which is found to be effective against fuzzing attacks but not attacks involving the injection of a previously recorded message sequence (replay attack).
Statistical intrusion detection methods can be described to be effective in detecting only those attacks that impact the features being analysed. As such, while timing- or frequency-based CAN IDS can be useful for detecting fabrication and suspension attacks, they may be unsuitable for masquerade attacks, which manifest as manipulated data fields. Such IDS are still lightweight approaches compared to machine learning methods and can be easily implemented in vehicle-grade ECUs for real-time intrusion detection.
CAN intrusion detection has been treated as both a supervised and an unsupervised learning problem. While supervised learning methods require labelled datasets for training the detection models, unsupervised methods require only benign data with no attack traffic. Unsupervised methods can thus also detect novel, unknown attack types, as opposed to supervised learning methods, which can detect only the attack types they have been trained with. A detailed survey of ML-based CAN IDS can be found in [65].
CAN IDS can also be divided into two categories depending on the layer of the OSI model on which they operate:
As discussed in [87], the nature of the CAN in-vehicle network – characterised by limited computing resources, real-time response requirements, and a lack of sender and receiver identification – makes the design of CAN IDS different from those meant for computer networks. As such, a CAN IDS should ideally have low resource requirements, be able to detect and report attacks immediately, and be able to process the large number of CAN messages that are generated on the bus. In terms of attack detection accuracy, a CAN IDS should have a low false negative rate as well as a low false positive rate. Since CAN bus communications are used to control safety-critical subsystems in a vehicle, a low false negative rate is required to ensure that as many attacks as possible are detected. A low false positive rate is also required so that the CAN IDS does not raise an impractically large number of false alarms. ML-based IDS methods should also be designed to resist evasive, adversarial attacks.
We find a larger number of works that utilise Layer 2 features like AIDs and data fields for intrusion detection, as opposed to physical characteristics-based CAN IDS, which are still a burgeoning area of study. Al-Jarrah et al. [3] also note the same, while also finding that most works focus on evaluating detection capability by reporting security-related metrics and do not report performance metrics like detection and training time. They also identify a lack of benchmark models and benchmark datasets for CAN IDS evaluation. Nappi [56] also finds that more CAN IDS works now contextualise their findings through comparison with some benchmark models. However, there still remains a lack of standard benchmark models, and works that report detection latency still number in the minority.
A brief overview of prior work in CAN IDS evaluation frameworks has been provided in [2], whereby prior evaluation frameworks [23,25] have been reviewed to highlight key features and distinguish the proposed framework. Apart from these, there have not been any comprehensive surveys of benchmarking and comparison efforts for CAN IDS, to the best of our knowledge.
Similar reviews of comparative studies have been carried out for conventional computer network IDS, which mainly focus on the comparison of ML methods and have been included in Table 1. Panigrahi et al. [62] note that central to the research into the usage of ML for intrusion detection is the selection of the most appropriate classifier for building IDS. Therefore, they provide a review of comparative studies that examine supervised learning methods for network intrusion detection, summarising the classification models evaluated, datasets used, evaluation metrics reported, and findings of the studies. An analysis of a total of 54 classifiers from six categories of classification models has also been presented, with 13 metrics related to detection capability reported. Similarly, Almomani et al. [5] also enumerate comparative studies of ML classifiers for network intrusion detection and present an evaluation of 10 supervised learning methods, reporting accuracy, precision, and F1-score. Finally, Kilincer et al. [43] conducted a survey of ML approaches to network intrusion detection by focusing on five datasets that are most commonly used for network IDS research. The authors note that network IDS studies are generally limited to a few datasets, examine only one or few classification methods, and examine only a few attack types. To address these issues, classical ML models like SVM, kNN, and DT have been developed and evaluated using the five datasets as benchmark models. Results of experiments are compared against prior work utilising the same datasets, thus contextualising past results.
Conventional computer network intrusion detection literature is surveyed by Milenkoski et al. [50] to examine methods employed in the evaluation of IDS, and an evaluation design space is proposed in an effort to categorise these common methods. A design space is described by Baum et al. [11] as a “multidimensional space of design choices,” consisting of a set of relevant dimensions that can be used to classify and describe entities in a specific domain. The IDS evaluation design space proposed by Milenkoski et al. [50] consists of three elements: workload, metrics, and measurement methodology. Workloads for evaluating IDS are classified as benign, malicious, and mixed, based on the presence of attacks in the workload. The authors also distinguish between workloads in executable forms for live testing of CAN IDS and trace forms for later replay. Metrics are classified as security-related and performance-related metrics. The measurement methodology part of the design space identifies IDS properties that are of interest and the workload and metrics that are employed to evaluate these properties. This evaluation design space not only serves as a basis for categorising the literature but is also aimed at facilitating the planning of IDS evaluation exercises. The present work thus attempts to propose a design space for the context of CAN IDS evaluation, not only to categorise current comparative evaluation methods but also to provide a guideline for designing IDS evaluations for CAN, which differs from conventional computer networks in features and complexity.
Comparison with related works
Our work aims to contribute to the CAN IDS literature by proposing an evaluation design space as well as by surveying benchmarking and comprehensive efforts for CAN IDS, both of which are thus far available only for computer network IDS. Table 1 highlights the points of difference between this and related works. Like the surveys in [3,4,41,56,65,91,93], the present work concerns itself with intrusion detection for CAN IDS, as opposed to conventional computer networks. Since our focus is on benchmarking and evaluation, we discuss aspects of evaluation such as IDS classification, evaluation approaches, datasets, attack types, and evaluation metrics, as done by other IDS surveys. However, unlike CAN IDS surveys that enumerate CAN intrusion detection techniques, we survey benchmarking and comparative evaluation studies of CAN IDS. While such work has been done for conventional computer network intrusion detection methods [5,43,62], there is a lack of such studies for CAN IDS that needs to be addressed given the large number and variety of CAN IDS in the literature and the numerous recent efforts at benchmarking these. We do not perform and report new benchmarking experiments as in [5,43,62], but we organise and discuss the surveyed works using our evaluation design space for CAN IDS, which is comparable to the design space proposed by Milenkoski et al. [50] for computer networks, to understand current trends in benchmarking and evaluation techniques. Finally, we use the insights gathered from compiling the design space and surveying the benchmarking studies to identify avenues for further work, as has been done in both CAN and computer network IDS works.
CAN IDS evaluation design space
Similar to the guidelines for evaluating conventional CAN IDS proposed in [50], we understand that planning a CAN IDS evaluation study should begin with the identification of the goals of the study and the associated constraints. The goal of an evaluation study is usually the examination of the detection capability and/or the performance aspects of one or several CAN IDS. On the other hand, the main constraints in the evaluation of CAN IDS are the availability of resources such as vehicles, testbeds, and other tools, as well as access to requisite data and information such as proprietary CAN database files that contain rules to decode CAN messages. Consideration of these goals and constraints should then inform the design of the evaluation study.
Our proposed CAN IDS evaluation design space is not only aimed at cataloguing the various evaluation methods available but is also meant to provide a way to describe any CAN IDS evaluation study completely. To do so, we have divided the design space into five essential components, summarised in Fig. 1: the IDS types being evaluated, the attacks considered, the evaluation type, the workload being used, and the evaluation metrics being reported. The choices that should be made for each component thus depend on the goals and constraints of the study, as well as on the choices made for related components. Choosing to report the accuracy, precision, and recall metrics to describe detection capability is an example of a goal influencing the evaluation metric choice; while choosing online tests on a lower-cost testbed instead of a real vehicle represents a choice of evaluation type resulting from a constraint. The decision to include only fabrication and suspension attacks for testing timing-based CAN IDS is a further example of the choice for one component (CAN IDS type) influencing that for another component (attack types).
The remainder of this section describes in further detail each component of the CAN IDS evaluation design space and how consideration of goals, constraints, and related components influences the choices for each component.
IDS type
As described in Section 3.1, a CAN IDS can be either signature-based, anomaly-based, or hybrid, depending on the detection technique used. Anomaly-based methods are further divided into statistical and ML-based methods. Depending on the features used and the OSI layer they operate on, CAN IDS can either be a data link layer or a physical layer IDS.
The classification of a particular CAN IDS informs the type of data or workload required for its evaluation – while a physical layer CAN IDS would require Layer 1 data, a data link layer CAN IDS would require logs containing CAN message frames. As a further example, a CAN IDS based on analysis of AID sequences would require only the AIDs of CAN broadcasts, while a CAN IDS that incorporates timing-based features would require high precision message timestamps.
The detection technique employed by a given CAN IDS also informs the types of attacks that can be detected by the IDS and are relevant for inclusion in its assessment. This is because different attacks manifest as changes in different features of CAN bus traffic, and CAN IDS differ in the features used. Timing and frequency-based CAN IDS [14,52,60,74,93] can be good at detecting fabrication and suspension attacks, which alter the timing and frequencies of AIDs, but ineffective for masquerade attacks, which are observed as manipulated payloads. Masquerade attacks can instead be detected by CAN IDS that analyse the data fields of CAN messages, such as the Hamming distance CAN IDS [76] or ML-based CAN IDS [47]. Physical characteristics-based CAN IDS, which fingerprint CAN IDS using physical features that are difficult to spoof, are capable of identifying malicious nodes and can thus not only detect fabrication and suspension attacks but also masquerade attacks [17,70].
Evaluation type
This part of the CAN IDS evaluation design space identifies two types of evaluations: offline and online evaluations. In an offline evaluation, a CAN IDS is used to analyse a CAN bus log or dataset that has already been collected from a real or simulated CAN bus. This is opposed to an online assessment, where a CAN IDS performs real-time analysis of CAN bus data from a real vehicular CAN bus, testbed, simulation, or data log replay. The physical characteristic-based CAN IDS proposed in [17,18] are evaluated in CAN bus prototypes with nodes consisting of Arduino boards and CAN shields, and with a real vehicle whereby the IDS is implemented on a node connected to the CAN bus via the OBD-II port. Ujiie et al. [88] implement their rule-based CAN IDS on an ATMega162 microcontroller and test it against both a simulated CAN bus in Vector CANoe as well as with a real vehicle. Unlike these works, Desta et al. [24] evaluate their CAN IDS, which uses an LSTM model for AID sequence prediction, by replaying a CAN bus data log and using the SocketCAN API to perform detection.
Using real vehicles is advantageous in that they most closely resemble the real-world environment in which an in-vehicle CAN IDS would operate. However, it is relatively difficult to use real vehicles for CAN IDS assessments. Not only is it expensive to acquire and use a real vehicle for security testing, but mounting attacks like targeted ID, suspension, and masquerade on a real vehicular CAN bus can be difficult, time-consuming, and also pose a risk to passengers and bystanders [66,90]. To address these problems, a testbed for the purpose of online CAN IDS evaluation has been developed in [37], which uses CARLA car simulator in combination with the Vector CANoe CAN bus simulator to generate realistic driving scenarios. The assessment of a clustering-based ML CAN IDS demonstrated inferior detection performance in the online assessment using the testbed compared to an offline experiment with a dataset collected from the same testbed, which highlights the importance of online assessment regardless of its drawbacks.
On the other hand, offline assessments with collected CAN bus logs (as well as online tests by replaying logs) can be performed repeatedly with relative ease to obtain statistically significant evaluation results. A large number of works, such as [14,47,49,57,74,76,96], use collected CAN bus logs to assess their proposed CAN IDS. These datasets are commonly collected either via the OBD-II port available in all vehicles or by tapping into the in-vehicle CAN bus. Offline evaluation is a good starting point to understand the detection capability in terms of accuracy, false positive rates, etc. of a CAN IDS before using online evaluations to understand performance aspects of the CAN IDS such as detection times. Using publicly available datasets further enhances the reproducibility of CAN IDS works and allows direct comparison with results from other CAN IDS assessments performed with the same datasets.
Workload
Workload can be described as the work that must be performed by a system and can be viewed as input to the system [29]. In the context of a CAN IDS, its workload comes from CAN bus traffic or measurements of physical characteristics. While CAN IDS datasets serve as the most common source of data to evaluate CAN IDS, in this paper we borrow the term ‘workloads’ from the wider conventional IDS literature to describe any input to a CAN IDS under test regardless of the source it originates from – which could be not just datasets but real-time CAN bus traffic or measurements of physical quantities from a vehicle, testbed, or simulation in an online test.
The workload used to evaluate an IDS may be generated in various ways [50]. For CAN IDS evaluation, we may distinguish between benign, attack-free workloads and malicious workloads containing attacks. Benign CAN IDS workloads may be from a real vehicle (real) or artificially generated (synthetic). Attack workloads can be obtained by conducting attacks on a real CAN bus (real attacks) or via simulation, which can be by manipulating a collected benign CAN trace to include attacks or by artificially generating malicious CAN workloads (simulated attacks). These various types of workloads are outlined in Fig. 2.

CAN IDS workload types.
A number of publicly available CAN datasets have been published in recent years, which has made them a popular choice for the design and evaluation of CAN IDS [56]. The Hacking and Countermeasures Research Lab (HCRL) has published three CAN intrusion datasets [33,46,71], which contain benign CAN bus logs as well as logs of fabrication attacks such as DoS, fuzzing, and targeted ID conducted on a real in-vehicle CAN bus. These datasets have been used in [10,36,38,42,63,72] among others. Verma et al. [90], who also provide a comprehensive survey of CAN intrusion datasets, present the Real ORNL Dynamometer (ROAD) dataset, consisting of real benign samples and samples of fuzzing, targeted ID (flam delivery), and masquerade attacks. While the fabrication attacks were conducted on a real vehicle CAN bus, the masquerade attack samples were created by manipulating the targeted ID logs. This dataset has been used in [53].
The dataset provided in [26] consists of logs collected from both a CAN bus prototype as well as real vehicles. Unlike the datasets mentioned thus far, the attack samples are simulated in that the benign data from vehicles has been augmented to create attack datasets (except the attack samples collected from the prototype). This dataset has been used in works such as [14,73]. In a similar vein, the CrySyS lab has published real benign CAN bus logs and provided a log infector tool that can be used to manipulate benign logs to create masquerade attack samples. This tool has been used to create the attack samples used to evaluate the statistical CAN IDS in [30]. A completely synthetic dataset, SynCAN, with simulated targeted ID, suspension, and masquerade attacks is published by the authors of [34], where they use it to evaluate their LSTM autoencoder-based CAN IDS. It has also been used in [44,58], which are also autoencoder-based CAN IDS. Apart from ROAD, SynCAN is the only dataset providing translated signal values instead of raw CAN data fields.
Physical fingerprinting-based CAN IDS, unlike data link layer CAN IDS, cannot be evaluated using the aforementioned datasets. Towards this end, Foruhandeh et al. [28] have published a dataset consisting of voltage measurements from a real vehicle along with their Single-frame based Physical-Layer (SIMPLE) identification solution that can detect intrusions and identify the sending ECU for each message. This dataset is also used in [45] for the detection of a hill-climbing style masquerade attack whereby an attacker attempts to evade detection and alter ECU fingerprints to do so. Popa et al. [64] have also made available clock skew and voltage data from a total of 54 ECUs across 10 vehicles, which can be used for the development of physical layer CAN IDS.
The attack model considered in this design space is the same as the one described in Section 2.2.
As discussed in Section 4.1, the attack types that can be detected by a CAN IDS depend on the features and techniques used by the CAN IDS. However, this is not the only consideration a researcher has to make when selecting attack types for CAN IDS assessment. Fabrication attacks are relatively less complex compared to suspension and masquerade attacks and form the most common class of attacks found in the CAN IDS literature [90], which is reflected in the fact that real attack datasets are available for only these attack types [33,46,71,90]. On the contrary, while CAN intrusion datasets with suspension and masquerade attacks are available, these are simulated attack samples created either from a purely synthetic benign dataset [34] or a real benign dataset [26]. The advantage of using real attacks over simulated attacks is that in the former case, the attacks are known to have caused an effect on the operations of the vehicle, i.e., the effects are physically verified. For example, while creating the ROAD dataset, the authors of [90] noted abnormal behaviour such as accelerator pedals becoming ineffective, false displays on speedometer, and incorrect reverse light status. With simulated attacks, not only is it impossible to verify their physical effects, but the attack simulation method (such as manipulating benign CAN bus logs) may result in an unrealistic attack sample.
Another important consideration in selecting attack datasets for evaluation is the “difficulty” of the detection problem captured in the dataset [89]. Verma et al. [90] find that datasets such as [46] and [71] that are commonly used consist of unstealthy attacks with high-frequency injection of malicious messages that can be detected by trivial, timing-based detectors and are less suitable for assessing more sophisticated CAN IDS that should be able to detect low rate fabrication attacks [82,90].
Since the usage of real test vehicles is outside the reach of many researchers, there is a need for comprehensive CAN intrusion datasets consisting of real attack traces of all known attack types, ranging from simple, easily detected attacks to complex, stealthy attacks, for robust CAN IDS evaluation and benchmarking. This has become necessary for physical layer CAN IDS as well, most of which continue to be evaluated with real vehicles and testbeds.
Evaluation metrics
As per [50], the metrics selected and reported while evaluating an IDS should depend on the properties of the IDS being assessed. Metrics can be security-related metrics, which quantify attack detection capability, or performance-based metrics, which quantify non-functional aspects of an IDS such as resource consumption. The authors of [50] also distinguish between basic security metrics, which quantify individual attack detection properties, and composite security metrics, which combine basic metrics.
Security-related metrics allow assessment of attack detection properties such as attack detection accuracy and resistance to evasive attacks. These include classification accuracy, precision, recall, F1-score, and false positive rate, which have been reported in [6,24,44,74,93,96] among others. Basic security metrics such as true positive rate, false positive rate, false negative rate, true negative rate, precision, and recall must be reported and analysed together to understand the performance of a given CAN IDS [50]. Some works choose to report receiver operator characteristics (ROC) and area under curve (AUC) [47,60], which are considered composite security metrics, to indicate the detection performance of an IDS at multiple operating points.
An important problem to consider when using CAN datasets for assessment is the class imbalance in such datasets, whereby malicious messages that are a part of attack traffic are usually present as a very small percentage of the total captured CAN bus traffic. This is further reason to not rely on a single metric like accuracy, which would yield high values for a detector that only predicts the normal class for a highly imbalanced dataset, and instead use a suite of security metrics to understand the ability of the IDS to distinguish between normal and attack traffic. While some may choose to counter the imbalanced dataset problem by reporting balanced accuracy, Chicco et al. [16] recommend the use of the Matthews Correlation Coefficient (MCC), which has been described as a single metric that summarises the performance of a binary detector. MCC is especially suited for CAN intrusion detection since it is equally important for a CAN IDS to correctly classify both normal and attack traffic, i.e., to keep both false positive and false negative rates low. The MCC is reported alongside other security metrics in [57].
Considering the safety-critical nature and real-time requirements of the in-vehicle network, it may be argued that attack detection latency is an important metric that should be considered in assessing the attack detection capabilities of a CAN IDS [3,21,56]. Detection latency, or Time To Detection (TTD), has been defined as the time taken to classify a CAN message from the time it was received [56] and has been reported in [61]. Nichelini et al. [57] report Testing Time per Packet (TTP) as the ratio of the total detection time and the number of messages in their test dataset, which gives the average time taken by the CAN IDS to evaluate a single CAN message. Unlike these works, Sunny et al. [81] report best-case and worst-case computation time to examine if their proposed CAN IDS can be used for real-time evaluation of CAN messages, which can be published every 2 ms.
Apart from detection latency, non-functional properties of a CAN IDS like resource consumption, performance overhead, and workload processing capacity are also of interest, particularly since they are expected to be deployed in the resource-constrained environment of the in-vehicle network. A CAN IDS that is highly accurate in detecting attacks may still become impractical for implementation in the in-vehicle network if it is not able to process rapidly generated CAN bus traffic in time or requires significant computing resources to maintain quick response times. An analysis of computational complexity and memory requirement has been provided for the Hamming distance-based CAN IDS in [76]. Unlike works with offline assessment using datasets, memory footprint in kilobytes and inference time (similar to TTD) of the autoencoder-based detector in [44] have been measured by implementing the IDS on an automotive-grade microcontroller.
Survey of benchmark frameworks and comparative studies of CAN IDS
This section provides an overview of the evaluation frameworks and comparative studies of CAN IDS that have been detailed in the literature. The papers included in this study were published between 2017 and 2022 (September) and were selected by conducting a search on Google Scholar, IEEEXplore, and the ACM Digital Library using the keywords “controller area network intrusion detection system benchmark”, “controller area network intrusion detection system evaluation”, “controller area network intrusion detection system testbed”, “controller area network intrusion detection system comparative”. Papers were included in and excluded from this study by considering abstracts.
The surveyed works differ in their scope in terms of the types of IDS evaluated, the attack types tested, and the metrics reported. Since the works mostly restrict themselves to particular types of CAN IDS, they have been categorised as those that evaluate statistical CAN IDS (listed in Table 2) and those that evaluate ML-based CAN IDS (listed in Table 3). The attack types that have been used for evaluation by the surveyed works are summarised in Table 4, while the reported metrics are provided in Table 5.
Statistical CAN IDS evaluated in surveyed CAN IDS benchmark frameworks and comparative studies
Statistical CAN IDS evaluated in surveyed CAN IDS benchmark frameworks and comparative studies
Signature-based CAN IDS.
ML-based CAN IDS.
Details of CAN IDS evaluated are not disclosed.
ML-based CAN IDS evaluated in surveyed CAN IDS benchmark frameworks and comparative studies
These studies model intrusion detection as a multi-class classification problem where the attack type is predicted.
Workloads used and attack types covered in surveyed CAN IDS benchmark frameworks and comparative studies
Only the clean dataset has been used in [77] to simulate attacks. The attack datasets available are not used in [77].
This dataset consists of only non-anomalous data and has been created for behaviour analysis. In [12], it has been used to evaluate the unsupervised learning methods only.
Only benign dataset available.
Evaluation metrics reported in surveyed CAN IDS benchmark frameworks and comparative studies
False positive rate.
Receiver Operator Characteristic.
Precision-Recall.
Only for selected IDS under test.
The benchmark framework by
The study by
ML-based intrusion detection methods
The comparative study of
The benchmark framework provided by
Like in [13],
Discussion
In this section, the reviewed benchmarking framework and comparative studies are further categorised and discussed in terms of the proposed design space to understand current efforts as well as opportunities for future work.
IDS type
Regarding the type of CAN IDS, anomaly-based intrusion detection methods are clearly seen as the way forward for CAN intrusion detection – anomaly-based methods are the only type of IDS that was found to have been benchmarked and compared in the surveyed works. In the wider CAN IDS literature, anomaly-based IDS are indeed the most common type of IDS proposed [4,41]. Signature-based methods are not very common since complete CAN attack signatures do not exist, and such databases are difficult to create as the implementation of CAN messaging differs among vehicles of different models and makes [4].
Among the anomaly-based methods, each of the surveyed studies restricted themselves to either one of two types of IDS: statistical methods or ML-based methods. The exception to this is the evaluation framework in [2], where a neural network IDS is included along with the other statistical methods. Papers benchmarking statistical methods [2,25,39,77] include a variety of IDS that use different types of features and techniques for attack detection, including timing and frequencies of AIDs, AID sequences, Hamming distances in payload, and entropy of CAN messages. Meanwhile, studies of ML-based methods have applied algorithms like SVM, decision trees, Isolation Forest, ensemble algorithms, as well as various types of neural networks.
Since intrusion detection methods are not equally effective at detecting all attack types, some studies focus on particular intrusion detection techniques or attack types. Blevins et al. [13] benchmark only timing-based intrusion detection methods, which are computationally inexpensive but are effective only against attacks that alter the timing of CAN messages on the CAN bus. This is why only fuzzing and targeted ID attacks are used for evaluation and not masquerade attacks, which do not impact message timing. Another work [83] examines the detection of fuzzing attacks in particular, which can be difficult to detect, using ensemble learning algorithms that use timestamps, AIDs, and payload data as features.
It is also observed that while most of the surveyed works treat anomaly detection as a binary classification problem, two of these – [54] and [59] – model intrusion detection as a multi-class classification problem aiming to classify attack data into attack types. Attack classification can indeed be a useful addition to an IDS since the type of attack can inform attack mitigation responses.
Finally, we find no benchmarking or comparative evaluation study that includes physical characteristics-based CAN IDS. Review of physical layer CAN IDS reveals that they are commonly evaluated in online tests with testbeds and real vehicles. These evaluation methods are not common in the surveyed works, which take advantage of publicly available datasets. This indicates a need for comparative evaluation frameworks for online testing as well as datasets (such as [28,64]), which would encourage similar benchmarking studies of physical layer CAN IDS.
Evaluation type
If we distinguish between offline and online evaluations, almost all of the works surveyed have performed benchmarking with offline evaluations, whereby the assessed CAN IDS analyse collected CAN bus datasets. This can be explained by the proliferation of publicly available datasets covering common attack types in the literature, such as fabrication attacks. With comprehensive documentation, offline experiments using datasets can be replicated easily. Offline assessments can also be conveniently used to evaluate multiple IDS on an equal footing against the same dataset under equivalent test conditions.
On the other hand, the assessment methodology in [78] is the only one among the surveyed works to include online evaluations of CAN IDS, whereby IDS products have been integrated into test vehicles for evaluation in different scenarios. As noted in Section 6.2, online tests are relatively difficult to perform with a real vehicle due to costs and safety risks. Notably, we did not find any studies that utilise testbeds (such as that proposed in [37]) or simulations (such as Vector CANoe, as used in [88]) for the purpose of benchmarking, both of which are options that provide a more realistic environment for assessing detection capability and performance without the disadvantages of using real vehicles. This indicates a need for mature frameworks for real-time assessment and benchmarking of CAN IDS that can not only be implemented without the costs and risks associated with test vehicles but also allow repeatable assessments of CAN IDS under identical experimental conditions.
Workload
As mentioned in the previous subsection, the most common method used for comparative evaluation of CAN IDS is using datasets for offline experiments. The popularity of real CAN bus datasets in the broader CAN literature is reflected among the surveyed benchmarking frameworks and comparative studies, wherein all the datasets used are derived from the CAN bus of real vehicles. However, they differ in whether the attacks themselves are real or simulated. Half of the works surveyed use publicly available datasets from HCRL [33,40,46,71], which consist of real attack traces. Another dataset including real attacks with physically verified effects is the ROAD dataset [90], which was more recently published and has been used in only two of the surveyed works [13,83]. Other than these, the attack datasets that are used have been created by modifying logs of real CAN bus traffic to simulate attacks on the CAN bus. One of the earliest works [39] performed their comparative evaluation with an unpublished dataset whereby benign data collected from a real car is replayed in a CAN bus simulation software, where attack scenarios are conducted to generate attack samples. The study by Taylor et al. [85] is unique in that they propose a method for generating attack samples using a parametrised attack framework. While it may be argued that approaches such as these [39,85] may not produce realistic attack samples, these methods provide a customisable way of generating attack samples, enabling assessment against a variety of attack scenarios – from high-frequency message injection attacks to low rate injection attacks and masquerade attacks.
Unlike the remaining studies, [78] perform online evaluations on real test vehicles, the CAN buses of which are made to undergo different types of targeted ID attacks generated using attack scripts.
Attack model
Among all the works surveyed, fabrication attacks are the most common attack type tested, which is due to the fact that fabrication attacks comprise the majority of attacks found in CAN IDS literature, and there are several CAN datasets that provide real CAN bus logs with real attack samples [90]. On the other hand, few works evaluate intrusion detection methods against suspension and masquerade attacks. Only one study [25] performed IDS evaluations against all attack types in our attack model, but with simulated attack datasets. This highlights the need for realistic samples of these attack types.
The benefit of using real attacks over simulated attacks is realised in the online assessments in [78], where the physical effects of attacks like doors locking could be observed and verified and the evaluated CAN IDS could be confirmed to be effective in realistic attack scenarios. This is also the case for studies that use datasets like [71,90] that were collected from a real vehicle undergoing cyberattacks. However, real attack datasets that have been used in the surveyed works have their limitations. For instance, the dataset [71] used in [2,25] does not provide a difficult detection challenge since it consists of fabrication attacks that can be easily detected by trivial IDS methods and may not sufficiently test IDS methods that are capable of detecting more subtle attacks [90].
Apart from the attack classes identified in Section 2.2, ML-based CAN IDS have also been demonstrated to be vulnerable against adversarial attacks [19] that are designed to evade detection. The benchmark framework in [23] also evaluates the IDS under test against adversarial attack samples generated using Generative Adversarial Networks (GAN) and heuristics.
Evaluation metrics
From the evaluation metrics described in Section 4.5, we find that the surveyed studies all focus on examining attack detection accuracy by presenting security-related metrics almost exclusively. These metrics include classification accuracy, precision, recall, false positive rate (FPR), and receiver operator characteristic (ROC) curve. In order to avoid the pitfalls associated with relying on only a single metric like classification accuracy [79], the majority of the works report multiple metrics to provide a complete picture of attack detection capability. The study in [83] also reports balanced accuracy in addition to other metrics to account for class imbalance in attack datasets. Furthermore, as noted in Section 6.4, resistance to evasive attacks is evaluated in the benchmark framework provided in [23].
On the contrary, performance metrics are not widely reported in the surveyed studies. None of the surveyed works report detection latency, with only one work [13] providing an analysis of detection latency in terms of the computational time of the detection algorithms. Detection speed and latency to filter benign messages are included in the suite of metrics in the online assessment by Stachowski et al. [78], but they are ultimately not recorded and reported. Another study [83] reports testing execution time in addition to training time as a means of indicating performance. This trend can be explained by the fact that, unlike the security-related metrics that can be obtained in offline experiments with datasets, obtaining precise measurements of detection latency requires some form of online testing, such as simulations or testbeds. This further highlights the necessity of online testing for benchmarking CAN IDS. We also find a limited examination of non-functional properties such as resource consumption, performance overhead, and workload processing capacity.
Recommendations for future work
In compiling our evaluation design space and conducting the survey of benchmarking and comparative evaluation studies, we found several concerns that lead to opportunities for future work. First, we observed a lack of studies that incorporate CAN IDS from different categories. Only [2] and [26] include an ML-based and signature-based CAN IDS, respectively, in their studies of otherwise statistical CAN IDS. We were also not able to find studies that benchmark physical layer CAN IDS along with data link layer CAN IDS, or indeed, a study with only physical layer CAN IDS. Secondly, offline evaluations with CAN datasets were found to be the most prevalent methodology employed for benchmarking, with only one study [78] performing CAN IDS evaluations on real vehicular networks. However, this study focuses on presenting an assessment framework and does not disclose the mechanics of the CAN IDS being evaluated, thus providing no insight into detection techniques and their effectiveness against the selected attack types. We also observe that more complex attack types like suspension and masquerade are not considered in the surveyed works, while in the broader CAN IDS literature, there are numerous CAN IDS that consider the detection of masquerade attacks [53,57,96]. Finally, we find that all the reviewed studies are restricted to the assessment of security-related properties, and most do not provide assessment of performance factors such as detection latency or resource overhead. In light of these issues, we provide the following recommendations for future work:
Comprehensive benchmarking datasets. Benchmarking datasets collected from real vehicles and consisting of real attack traces are crucial for the development and evaluation of CAN IDS. As opposed to synthetic datasets and simulated attacks, real traces with real attacks capture the dynamics of CAN bus traffic and are best suited for evaluating CAN IDS. While a number of CAN datasets have been published for IDS evaluation [41,82,90], none has emerged as a benchmark dataset in the way the KDD Cup and DARPA datasets are used for computer network IDS evaluation [4]. There is also a need for comprehensive datasets of real attack traces of all known types. While several traces containing fabrication attacks are available, currently available datasets do not contain real traces of suspension or masquerading attacks, and more attacks are being discovered. The creation of such a dataset would allow robust benchmarking and evaluation of CAN IDS, ensuring that CAN IDS being developed can effectively detect at least all known attacks.
Methods of synthesising datasets and simulating attacks are also not without merit: creating real attack datasets requires considerable resources and skill, and simulation tools can allow the creation of customisable attacks, as the parametrised attack framework in [85] illustrates. Such datasets can therefore fill in gaps where real datasets are not available.
Online evaluation methods. Current benchmarking and evaluation studies mostly use offline methods with CAN bus datasets. This restricts the kinds of CAN IDS properties that can be assessed – the surveyed works almost exclusively focus on evaluating attack detection accuracy. Apart from attack detection accuracy, attack detection latency, resource consumption, and workload processing capability are important considerations for an in-vehicle IDS but have not been examined sufficiently in current CAN IDS benchmarking studies. Therefore, there is a need for online evaluation methods for CAN IDS, such as in [78], that can allow repeatable comparative evaluation studies. Benchmarking with online testing methods, such as using trace replay tools, simulations, and testbeds, can facilitate not just the assessment of non-functional properties but also allow a more accurate evaluation of attack detection capability in a manner close to real operating environments. Documentation becomes an important aspect of reporting online evaluations conducted using real vehicles, testbeds, and simulations; documenting and reporting hardware devices and components, network topologies, and source code used would further enhance the reproducibility of results.
Benchmarking layer 1 CAN IDS. While most CAN intrusion detection methods in the literature use features derived from OSI Layer 2 data, i.e., CAN frames, the physical characteristics-based methods use Layer 1 information for attack detection. So current CAN attack datasets cannot be used to evaluate these physical IDS and, as a result, these IDS have not been benchmarked alongside the CAN IDS using CAN messages. Hence, inclusive benchmarking methods are needed where both Layer 1, Layer 2, and even hybrid CAN IDS can be evaluated and compared with each other under similar test conditions. Apart from using real vehicles and testbeds, this can be realised by creating datasets of physical measurements (such as [28,64]) and datasets with both Layer 1 and Layer 2 data for offline testing, as well as simulations of physical characteristics for online testing. Inclusive benchmarking of Layer 1 and Layer 2 CAN IDS will allow direct comparison among different types of CAN IDS and also facilitate the development of comprehensive intrusion detection methods that incorporate more diverse features and detect a wider range of attacks.
Comprehensive evaluation metrics. Among the surveyed studies, it is observed that the set of metrics reported differs, which hinders direct comparison across studies. Furthermore, the focus is only on security-related metrics, with performance-related metrics largely not being measured or reported. But to select an IDS for implementation in automotive networks, it is necessary to assess not just attack detection capability but also detection latency and other non-functional properties. This means that a comprehensive suite of evaluation metrics needs to be developed that includes both security- and performance-related metrics, covering the assessment of CAN IDS in all practical aspects. The selection of security-related metrics should consider factors such as class imbalance in attack datasets and prior likelihood of attack, which make using only classification accuracy or ROC insufficient [79] and necessitate metrics like balanced accuracy and MCC, which give equal importance to both normal and attack classes.
Conclusion
Many intrusion detection methods are being developed for the CAN bus in an endeavour to secure it against various types of cyberattacks that have the potential to cause vehicles to malfunction and result in dangerous accidents. The evaluation of these CAN IDS varies in terms of the CAN IDS type assessed, the attack types considered, the evaluation type, the workload used for evaluation, and the evaluation metrics reported. We thus propose a CAN IDS evaluation design space in the manner of [50] encapsulating these five aspects of CAN IDS assessment, with the aim of categorising current CAN IDS works and serving as a guide for planning evaluation studies by enumerating existing approaches to CAN IDS evaluation.
CAN IDS are usually evaluated under disparate experimental conditions, which hinders direct comparison. Therefore, there have been several benchmark frameworks proposed and comparative studies conducted that evaluate CAN IDS in similar experimental conditions to reveal how they perform in relation to each other. Such benchmarking efforts ultimately facilitate the selection of the most appropriate CAN intrusion detection methods for implementation in in-vehicle networks. This work surveys current efforts at benchmarking and comparing CAN IDS and discusses them in terms of the proposed CAN IDS evaluation design space to understand current trends as well as directions for future work.
From the surveyed works, it is apparent that anomaly-based CAN IDS are the most popular type of CAN IDS selected for benchmarking since they have the capability of detecting novel, unknown attacks and do not require attack signatures. Among anomaly-based CAN IDS, it is observed that only statistical- and ML-based methods are the ones that are typically included in benchmarking studies. Because of the difficulties associated with conducting online tests, all comparative evaluations are offline evaluations using CAN bus datasets. There are a number of publicly available traces of CAN bus traffic collected from real vehicles, both under normal operation and under attack, which are commonly used for offline evaluations. However, such datasets are limited in the types of attacks they contain; while there are several datasets available with common fabrication attacks, there is a lack of datasets containing other classes of attacks like suspension and masquerade attacks. Offline experiments also allow measurement of only security-related metrics related to attack detection accuracy. As such, attack detection latency and other non-functional properties are understudied in current benchmarking and comparative studies.
Examining surveyed works in terms of this design space reveals avenues for future work: benchmarking datasets, repeatable online evaluations, methods for comparing Layer 1 CAN IDS with Layer 2 CAN IDS, and comprehensive evaluation metrics.
